为文本数据抓取网站 html 时出现 lxml 错误。尝试了几次迭代
lxml errors with scraping website html for text data. Tried several iterations
我正在尝试从 https://www.congress.gov/members 网站获取国会议员的文本属性。我对此很陌生。我遵循了 youtube 上的教程,认为我非常接近。
这是我试图获取的 html 信息的一个片段。文本以粗体显示。
这是我认为最接近我的语法(使用 python 2.7 - 工作限制):
import requests, lxml
import lxml.html
#from bs4 import BeautifulSoup
html = requests.get('https://www.congress.gov/members?q=%7B%22congress%22%3A%22117%22%2C%22chamber%22%3A%22Senate%22%7D')
doc = lxml.html.fromstring(html.content)
house = doc.xpath('//div[@id="houseMemberNavigator"]')[0]
print(house)#got printed element div
members = house.xpath('.//select[@id="members-representatives"]/text()')
#returns ['\n ', ' ']
print(members)
我确定这是我的语法,但一直无法解决....
使用BeautifulSoup
soup = BeautifulSoup(html.text, 'lxml')
[data.text for data in soup.find(id='members-representatives').select('option[value]')]
['Find a Representative',
'Adams, Alma S. [D-NC-12]',
'Aderholt, Robert B. [R-AL-4]',
'Aguilar, Pete [D-CA-31]',
'Allen, Rick W. [R-GA-12]',
'Allred, Colin Z. [D-TX-32]',
'Amodei, Mark E. [R-NV-2]',
'Armstrong, Kelly [R-ND]',
'Arrington, Jodey C. [R-TX-19]',
'Auchincloss, Jake [D-MA-4]',
'Axne, Cynthia [D-IA-3]',
'Babin, Brian [R-TX-36]',
...]
我正在尝试从 https://www.congress.gov/members 网站获取国会议员的文本属性。我对此很陌生。我遵循了 youtube 上的教程,认为我非常接近。
这是我试图获取的 html 信息的一个片段。文本以粗体显示。
这是我认为最接近我的语法(使用 python 2.7 - 工作限制):
import requests, lxml
import lxml.html
#from bs4 import BeautifulSoup
html = requests.get('https://www.congress.gov/members?q=%7B%22congress%22%3A%22117%22%2C%22chamber%22%3A%22Senate%22%7D')
doc = lxml.html.fromstring(html.content)
house = doc.xpath('//div[@id="houseMemberNavigator"]')[0]
print(house)#got printed element div
members = house.xpath('.//select[@id="members-representatives"]/text()')
#returns ['\n ', ' ']
print(members)
我确定这是我的语法,但一直无法解决....
使用BeautifulSoup
soup = BeautifulSoup(html.text, 'lxml')
[data.text for data in soup.find(id='members-representatives').select('option[value]')]
['Find a Representative',
'Adams, Alma S. [D-NC-12]',
'Aderholt, Robert B. [R-AL-4]',
'Aguilar, Pete [D-CA-31]',
'Allen, Rick W. [R-GA-12]',
'Allred, Colin Z. [D-TX-32]',
'Amodei, Mark E. [R-NV-2]',
'Armstrong, Kelly [R-ND]',
'Arrington, Jodey C. [R-TX-19]',
'Auchincloss, Jake [D-MA-4]',
'Axne, Cynthia [D-IA-3]',
'Babin, Brian [R-TX-36]',
...]