我正在尝试抓取 U.S 成员的名字。来自 Ballotpedia 的国会
I am trying to scrape the names of the members of the U.S. Congress from Ballotpedia
我正在尝试抓取 U.S 成员的名字。国会从这个页面 (https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress) 在 Ballotpedia 上与 Python。我使用的这段代码在过去(就在上周)运行良好。现在,它没有给我立法者的名字,而是给了我页面标题:“,List_of_current_members_of_the_U.S._Congress”。
这是我的代码
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in
soup.select('table.wikitable.sortable.jquery-tablesorter')]
df = pd.DataFrame.from_dict(temp_dict,
orient='index').transpose()
df.to_csv('3-New Congressmen.csv')
我认为问题出在第 13 行:
temp_dict[page.split('/')[-1]] = [item.text for item in
soup.select('table.wikitable.sortable.jquery-tablesorter')]
我试着拿出来
table.wikitable.sortable.jquery-tablesorter
并将其替换为
bptable gray sortable tablesorter tablesorter-default tablesortera6303b5b2311e jquery-tablesorter
而且我还需要为 U.S 添加一个新行。众议院,因为上面的行只会给参议员
bptable gray sortable tablesorter tablesorter-default tablesorter2e5ec79e370a5 jquery-tablesorter
但是,这个新代码给我的标题与我用原始代码得到的标题完全相同。
你对我有什么建议吗?非常感谢!
如果您只对网站上的表格感兴趣,pandas 有一个内置函数 read_html()
(需要包 lxml)来抓取它并将其直接放入 DataFrame 中:
import pandas as pd
list = ['https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress']
test = pd.read_html(list[0]) #gets all the tables from the url
print(test[3]) #get the different tables by index
输出:
Office Name Party Date assumed office
0 U.S. Senate Kansas Jerry Moran Republican January 5, 2011
1 U.S. Senate Kansas Roger Marshall Republican January 3, 2021
2 U.S. Senate Michigan Gary Peters Democratic January 6, 2015
3 U.S. Senate Michigan Debbie Stabenow Democratic January 3, 2001
4 U.S. Senate Virginia Tim Kaine Democratic January 3, 2013
.. ... ... ... ...
95 U.S. Senate Missouri Josh Hawley Republican January 3, 2019
96 U.S. Senate Pennsylvania Pat Toomey Republican January 5, 2011
97 U.S. Senate Pennsylvania Bob Casey Jr. Democratic January 4, 2007
98 U.S. Senate Utah Mike Lee Republican January 5, 2011
99 U.S. Senate Utah Mitt Romney Republican January 3, 2019
最好的等待就是使用 padding-left:10px;text-align:center;
这种 U.S 独有的样式将名称添加到列表中。参议院议员姓名。
这应该可以解决问题:
import requests
import bs4 as bs
headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)'}
url="https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress"
response = requests.get(url, headers=headers)
soup = bs.BeautifulSoup(response.text,'lxml')
tds = soup.findAll("td", {"style": "padding-left:10px;text-align:center;"})
names=[]
for td in tds:
names.append(td.getText())
我正在尝试抓取 U.S 成员的名字。国会从这个页面 (https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress) 在 Ballotpedia 上与 Python。我使用的这段代码在过去(就在上周)运行良好。现在,它没有给我立法者的名字,而是给了我页面标题:“,List_of_current_members_of_the_U.S._Congress”。
这是我的代码
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in
soup.select('table.wikitable.sortable.jquery-tablesorter')]
df = pd.DataFrame.from_dict(temp_dict,
orient='index').transpose()
df.to_csv('3-New Congressmen.csv')
我认为问题出在第 13 行:
temp_dict[page.split('/')[-1]] = [item.text for item in
soup.select('table.wikitable.sortable.jquery-tablesorter')]
我试着拿出来
table.wikitable.sortable.jquery-tablesorter
并将其替换为
bptable gray sortable tablesorter tablesorter-default tablesortera6303b5b2311e jquery-tablesorter
而且我还需要为 U.S 添加一个新行。众议院,因为上面的行只会给参议员
bptable gray sortable tablesorter tablesorter-default tablesorter2e5ec79e370a5 jquery-tablesorter
但是,这个新代码给我的标题与我用原始代码得到的标题完全相同。
你对我有什么建议吗?非常感谢!
如果您只对网站上的表格感兴趣,pandas 有一个内置函数 read_html()
(需要包 lxml)来抓取它并将其直接放入 DataFrame 中:
import pandas as pd
list = ['https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress']
test = pd.read_html(list[0]) #gets all the tables from the url
print(test[3]) #get the different tables by index
输出:
Office Name Party Date assumed office
0 U.S. Senate Kansas Jerry Moran Republican January 5, 2011
1 U.S. Senate Kansas Roger Marshall Republican January 3, 2021
2 U.S. Senate Michigan Gary Peters Democratic January 6, 2015
3 U.S. Senate Michigan Debbie Stabenow Democratic January 3, 2001
4 U.S. Senate Virginia Tim Kaine Democratic January 3, 2013
.. ... ... ... ...
95 U.S. Senate Missouri Josh Hawley Republican January 3, 2019
96 U.S. Senate Pennsylvania Pat Toomey Republican January 5, 2011
97 U.S. Senate Pennsylvania Bob Casey Jr. Democratic January 4, 2007
98 U.S. Senate Utah Mike Lee Republican January 5, 2011
99 U.S. Senate Utah Mitt Romney Republican January 3, 2019
最好的等待就是使用 padding-left:10px;text-align:center;
这种 U.S 独有的样式将名称添加到列表中。参议院议员姓名。
这应该可以解决问题:
import requests
import bs4 as bs
headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)'}
url="https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress"
response = requests.get(url, headers=headers)
soup = bs.BeautifulSoup(response.text,'lxml')
tds = soup.findAll("td", {"style": "padding-left:10px;text-align:center;"})
names=[]
for td in tds:
names.append(td.getText())