我正在尝试抓取 U.S 成员的名字。来自 Ballotpedia 的国会

Question

我正在尝试抓取 U.S 成员的名字。国会从这个页面 (https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress) 在 Ballotpedia 上与 Python。我使用的这段代码在过去（就在上周）运行良好。现在，它没有给我立法者的名字，而是给了我页面标题：“，List_of_current_members_of_the_U.S._Congress”。

这是我的代码

import requests
from bs4 import BeautifulSoup
import pandas as pd

list = ['https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress']

temp_dict = {}

for page in list:
    r = requests.get(page)
    soup = BeautifulSoup(r.content, 'html.parser')

    temp_dict[page.split('/')[-1]] = [item.text for item in 
soup.select('table.wikitable.sortable.jquery-tablesorter')]

df = pd.DataFrame.from_dict(temp_dict, 
orient='index').transpose()
df.to_csv('3-New Congressmen.csv')

我认为问题出在第 13 行：

temp_dict[page.split('/')[-1]] = [item.text for item in 
soup.select('table.wikitable.sortable.jquery-tablesorter')]

我试着拿出来

table.wikitable.sortable.jquery-tablesorter

并将其替换为

bptable gray sortable tablesorter tablesorter-default tablesortera6303b5b2311e jquery-tablesorter

而且我还需要为 U.S 添加一个新行。众议院，因为上面的行只会给参议员

bptable gray sortable tablesorter tablesorter-default tablesorter2e5ec79e370a5 jquery-tablesorter

但是，这个新代码给我的标题与我用原始代码得到的标题完全相同。

你对我有什么建议吗？非常感谢！

Answer 1

如果您只对网站上的表格感兴趣，pandas 有一个内置函数 read_html()（需要包 lxml）来抓取它并将其直接放入 DataFrame 中：

import pandas as pd

list = ['https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress']    
test = pd.read_html(list[0]) #gets all the tables from the url    
print(test[3]) #get the different tables by index

输出：

                      Office             Name       Party Date assumed office
0         U.S. Senate Kansas      Jerry Moran  Republican     January 5, 2011
1         U.S. Senate Kansas   Roger Marshall  Republican     January 3, 2021
2       U.S. Senate Michigan      Gary Peters  Democratic     January 6, 2015
3       U.S. Senate Michigan  Debbie Stabenow  Democratic     January 3, 2001
4       U.S. Senate Virginia        Tim Kaine  Democratic     January 3, 2013
..                       ...              ...         ...                 ...
95      U.S. Senate Missouri      Josh Hawley  Republican     January 3, 2019
96  U.S. Senate Pennsylvania       Pat Toomey  Republican     January 5, 2011
97  U.S. Senate Pennsylvania    Bob Casey Jr.  Democratic     January 4, 2007
98          U.S. Senate Utah         Mike Lee  Republican     January 5, 2011
99          U.S. Senate Utah      Mitt Romney  Republican     January 3, 2019

Answer 2

最好的等待就是使用 padding-left:10px;text-align:center; 这种 U.S 独有的样式将名称添加到列表中。参议院议员姓名。

这应该可以解决问题：

import requests
import bs4 as bs

headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)'}
url="https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress"

response = requests.get(url, headers=headers)
soup = bs.BeautifulSoup(response.text,'lxml')
tds = soup.findAll("td", {"style": "padding-left:10px;text-align:center;"})

names=[]
for td in tds:
    names.append(td.getText())

我正在尝试抓取 U.S 成员的名字。来自 Ballotpedia 的国会

I am trying to scrape the names of the members of the U.S. Congress from Ballotpedia

python

web-inspector

web-scraping