为什么我的网站抓取缺少预期的 table 使用 python?
Why is my website scrape missing the intended table using python?
我正在尝试使用此代码从 Ballotpedia (https://ballotpedia.org/Governor_(state_executive_office)) 中抓取信息,特别是高管的姓名。我这里的代码只给我以下输出:
,Governor_(state_executive_office),Lieutenant_Governor_(state_executive_office),Secretary_of_State_(state_executive_office),Attorney_General_(state_executive_office)
我也在尝试获取名称。这是我当前的代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in
soup.select("table.bptable.gray.sortable.tablesorter
tablesorter-default tablesorter17e7f0d6cf4b4 jquery-
tablesorter")]
最后一行是我认为存在问题的那一行。我尝试删除代码并将其添加到“table.bptable.gray.sortable.tablesorter tablesorter-default tablesorter17e7f0d6cf4b4 jquery-tablesorter”部分,但仍然得到相同的结果。我直接从网站上复制了它,不,我不确定我遗漏了什么。如果不是这个,那么该行中的其余代码是否有问题?谢谢!
有一种更简单的方法。随机取一个你的网址,试试这个:
import pandas as pd
tables = pd.read_html("https://ballotpedia.org/Governor_(state_executive_office)")
tables[4]
输出:
Office Name Party Date assumed office
0 Governor of Georgia Brian Kemp Republican January 14, 2019
1 Governor of Tennessee Bill Lee Republican January 15, 2019
2 Governor of Missouri Mike Parson Republican June 1, 2018
等等
您可以尝试通过选择器到达 table:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select('#officeholder-table')]
使用下面的 css 选择器先找到 table 然后使用 pandas
到 read_html()
并加载到数据框中。
这将为您提供单个数据框中的所有数据。
import pandas as pd
import requests
from bs4 import BeautifulSoup
listurl = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
df1=pd.DataFrame()
for l in listurl:
res=requests.get(l)
soup=BeautifulSoup(res.text,'html.parser')
table=soup.select("table#officeholder-table")[-1]
df= pd.read_html(str(table))[0]
df1=df1.append(df,ignore_index=True)
print(df1)
如果您想获取单个数据框,请尝试此操作。
import pandas as pd
import requests
from bs4 import BeautifulSoup
listurl = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
for l in listurl:
res=requests.get(l)
soup=BeautifulSoup(res.text,'html.parser')
table=soup.select("table#officeholder-table")[-1]
df= pd.read_html(str(table))[0]
print(df)
我正在尝试使用此代码从 Ballotpedia (https://ballotpedia.org/Governor_(state_executive_office)) 中抓取信息,特别是高管的姓名。我这里的代码只给我以下输出:
,Governor_(state_executive_office),Lieutenant_Governor_(state_executive_office),Secretary_of_State_(state_executive_office),Attorney_General_(state_executive_office)
我也在尝试获取名称。这是我当前的代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in
soup.select("table.bptable.gray.sortable.tablesorter
tablesorter-default tablesorter17e7f0d6cf4b4 jquery-
tablesorter")]
最后一行是我认为存在问题的那一行。我尝试删除代码并将其添加到“table.bptable.gray.sortable.tablesorter tablesorter-default tablesorter17e7f0d6cf4b4 jquery-tablesorter”部分,但仍然得到相同的结果。我直接从网站上复制了它,不,我不确定我遗漏了什么。如果不是这个,那么该行中的其余代码是否有问题?谢谢!
有一种更简单的方法。随机取一个你的网址,试试这个:
import pandas as pd
tables = pd.read_html("https://ballotpedia.org/Governor_(state_executive_office)")
tables[4]
输出:
Office Name Party Date assumed office
0 Governor of Georgia Brian Kemp Republican January 14, 2019
1 Governor of Tennessee Bill Lee Republican January 15, 2019
2 Governor of Missouri Mike Parson Republican June 1, 2018
等等
您可以尝试通过选择器到达 table:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select('#officeholder-table')]
使用下面的 css 选择器先找到 table 然后使用 pandas
到 read_html()
并加载到数据框中。
这将为您提供单个数据框中的所有数据。
import pandas as pd
import requests
from bs4 import BeautifulSoup
listurl = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
df1=pd.DataFrame()
for l in listurl:
res=requests.get(l)
soup=BeautifulSoup(res.text,'html.parser')
table=soup.select("table#officeholder-table")[-1]
df= pd.read_html(str(table))[0]
df1=df1.append(df,ignore_index=True)
print(df1)
如果您想获取单个数据框,请尝试此操作。
import pandas as pd
import requests
from bs4 import BeautifulSoup
listurl = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
for l in listurl:
res=requests.get(l)
soup=BeautifulSoup(res.text,'html.parser')
table=soup.select("table#officeholder-table")[-1]
df= pd.read_html(str(table))[0]
print(df)