使用 xpath 抓取网页内容时获取空列表
Getting empty list when scraping web page content using xpath
当我在以下代码中尝试使用 xpath 从 url 检索一些数据时,我得到一个空列表:
from lxml import html
import requests
if __name__ == '__main__':
url = 'https://www.leagueofgraphs.com/champions/stats/aatrox'
page = requests.get(url)
tree = html.fromstring(page.content)
# XPath to get the XP
print(tree.xpath('//*[@id="graphDD1"]/text()'))
>>> []
我期望的是像这样的字符串值:
>>> ['
5.0% ']
这是因为您正在搜索的 xpath 元素在某些 JavaScript.
中
您需要找出调用 JavaScript 后生成的 cookie,以便您可以对 URL 进行相同的调用。
- 转到开发控制台的 'Network' 页面
- 在
abg_lite.js
有运行(我的是cookie: __cf_bm=TtnYbPlIA0J_GOhNj2muKa1pi8pU38iqA3Yglaua7q8-1636535361-0- AQcpStbhEdH3oPnKSuPIRLHVBXaqVwo+zf6d3YI/rhmk/RvN5B7OaIcfwtvVyR0IolwcoCk4ClrSvbBP4DVJ 70I=
) 之后找出请求头的区别
- 将 cookie 添加到您的请求中
from lxml import html
import requests
if __name__ == '__main__':
url = 'https://www.leagueofgraphs.com/champions/stats/aatrox'
# Create a session to add cookies and headers to
s = requests.Session()
# After finding the correct cookie, update your sessions cookie jar
# add your own cookie here
s.cookies['cookie'] = '__cf_bm=TtnYbPlIA0J_GOhNj2muKa1pi8pU38iqA3Yglaua7q8-1636535361-0-'
'AQcpStbhEdH3oPnKSuPIRLHVBXaqVwo+zf6d3YI/rhmk/RvN5B7OaIcfwtvVyR0IolwcoCk4ClrSvbBP4DVJ70I='
# Update headers to spoof a regular browser; this may not be necessary
# but is good practice to bypass any basic bot detection
s.headers.update({
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
})
page = s.get(url)
tree = html.fromstring(page.content)
# XPath to get the XP
print(tree.xpath('//*[@id="graphDD1"]/text()'))
得到如下输出:-
['\r\n 5.0% ']
当我在以下代码中尝试使用 xpath 从 url 检索一些数据时,我得到一个空列表:
from lxml import html
import requests
if __name__ == '__main__':
url = 'https://www.leagueofgraphs.com/champions/stats/aatrox'
page = requests.get(url)
tree = html.fromstring(page.content)
# XPath to get the XP
print(tree.xpath('//*[@id="graphDD1"]/text()'))
>>> []
我期望的是像这样的字符串值:
>>> ['
5.0% ']
这是因为您正在搜索的 xpath 元素在某些 JavaScript.
中您需要找出调用 JavaScript 后生成的 cookie,以便您可以对 URL 进行相同的调用。
- 转到开发控制台的 'Network' 页面
- 在
abg_lite.js
有运行(我的是cookie: __cf_bm=TtnYbPlIA0J_GOhNj2muKa1pi8pU38iqA3Yglaua7q8-1636535361-0- AQcpStbhEdH3oPnKSuPIRLHVBXaqVwo+zf6d3YI/rhmk/RvN5B7OaIcfwtvVyR0IolwcoCk4ClrSvbBP4DVJ 70I=
) 之后找出请求头的区别
- 将 cookie 添加到您的请求中
from lxml import html
import requests
if __name__ == '__main__':
url = 'https://www.leagueofgraphs.com/champions/stats/aatrox'
# Create a session to add cookies and headers to
s = requests.Session()
# After finding the correct cookie, update your sessions cookie jar
# add your own cookie here
s.cookies['cookie'] = '__cf_bm=TtnYbPlIA0J_GOhNj2muKa1pi8pU38iqA3Yglaua7q8-1636535361-0-'
'AQcpStbhEdH3oPnKSuPIRLHVBXaqVwo+zf6d3YI/rhmk/RvN5B7OaIcfwtvVyR0IolwcoCk4ClrSvbBP4DVJ70I='
# Update headers to spoof a regular browser; this may not be necessary
# but is good practice to bypass any basic bot detection
s.headers.update({
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
})
page = s.get(url)
tree = html.fromstring(page.content)
# XPath to get the XP
print(tree.xpath('//*[@id="graphDD1"]/text()'))
得到如下输出:-
['\r\n 5.0% ']