从 Wiki 中获取匹配特定文本的表格
Get tables from Wiki that match specific text
我对 Python 和 BeautifulSoup 很陌生,我已经尝试解决这个问题几个小时了...
首先,我想从 link 下方提取标题为“大选”的所有 table 数据:
https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)
我确实有另一个数据框,其中包含每个 table 的名称(例如“1961 年大选”、“1965 年大选”),但我希望通过在每个 table 来确认它是否是我需要的。
然后我想获取所有以粗体显示的名称(表示他们赢了),最后我想要另一个按原始顺序排列的“Count 1”(有时是 1st Pref)列表,我想比较它们到“粗体”列表。我还没看这篇文章,因为我还没过第一关。
url = "https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)"
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
my_tables = soup.find_all("table", {"class":"wikitable"})
for table in my_tables:
rows = table.find_all('tr', text="general election")
print(rows)
如有任何帮助,我们将不胜感激...
此页面需要一些技巧,但可以完成:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
req = requests.get('https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)')
soup = bs(req.text,'lxml')
#first - select all the tables on the page
tables = soup.select('table.wikitable')
for table in tables:
ttr = table.select('tbody tr')
#next, filter out any table that doesn't involve general elections
if "general election" in ttr[0].text:
#clean up the rows
s_ttr = ttr[1].text.replace('\n','xxx').strip()
#find and clean up column headings
columns = [col.strip() for col in s_ttr.split('xxx') if len(col.strip())>0 ]
rows = [] #initialize a list to house the table rows
for c in ttr[2:]:
#from here, start processing each row and loading it into the list
row = [a.text.strip() if len(a.text.strip())>0 else 'NA' for a in c.select('td') ]
if (row[0])=="NA":
row=row[1:]
columns = [col.strip() for col in s_ttr.split('xxx') if len(col.strip())>0 ]
if len(row)>0:
rows.append(row)
#load the whole thing into a dataframe
df = pd.DataFrame(rows,columns=columns)
print(df)
输出应该是页面上所有的大选表。
我对 Python 和 BeautifulSoup 很陌生,我已经尝试解决这个问题几个小时了...
首先,我想从 link 下方提取标题为“大选”的所有 table 数据:
https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)
我确实有另一个数据框,其中包含每个 table 的名称(例如“1961 年大选”、“1965 年大选”),但我希望通过在每个 table 来确认它是否是我需要的。
然后我想获取所有以粗体显示的名称(表示他们赢了),最后我想要另一个按原始顺序排列的“Count 1”(有时是 1st Pref)列表,我想比较它们到“粗体”列表。我还没看这篇文章,因为我还没过第一关。
url = "https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)"
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
my_tables = soup.find_all("table", {"class":"wikitable"})
for table in my_tables:
rows = table.find_all('tr', text="general election")
print(rows)
如有任何帮助,我们将不胜感激...
此页面需要一些技巧,但可以完成:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
req = requests.get('https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)')
soup = bs(req.text,'lxml')
#first - select all the tables on the page
tables = soup.select('table.wikitable')
for table in tables:
ttr = table.select('tbody tr')
#next, filter out any table that doesn't involve general elections
if "general election" in ttr[0].text:
#clean up the rows
s_ttr = ttr[1].text.replace('\n','xxx').strip()
#find and clean up column headings
columns = [col.strip() for col in s_ttr.split('xxx') if len(col.strip())>0 ]
rows = [] #initialize a list to house the table rows
for c in ttr[2:]:
#from here, start processing each row and loading it into the list
row = [a.text.strip() if len(a.text.strip())>0 else 'NA' for a in c.select('td') ]
if (row[0])=="NA":
row=row[1:]
columns = [col.strip() for col in s_ttr.split('xxx') if len(col.strip())>0 ]
if len(row)>0:
rows.append(row)
#load the whole thing into a dataframe
df = pd.DataFrame(rows,columns=columns)
print(df)
输出应该是页面上所有的大选表。