lxml.xpath 没有将元素放入列表的问题

Question

这就是我的问题。我正在尝试使用 lxml 来抓取网站并获取一些信息，但是在使用 var.xpath 命令时未找到与信息相关的元素。它正在查找页面，但在使用 xpath 后没有找到任何内容。

import requests
from lxml import html

def main():
   result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')

   # the root of the tracker website
   page = html.fromstring(result.content)
   print('its getting the element from here', page)
   
   threesRank = page.xpath('//*[@id="app"]/div[2]/div[2]/div/main/div[2]/div[3]/div[1]/div/div/div[1]/div[2]/table/tbody/tr[*]/td[3]/div/div[2]/div[1]/div')
   print('the 3s rank is: ', threesRank)

if __name__ == "__main__":
    main()

OUTPUT:
"D:\Python projects\venv\Scripts\python.exe" "D:/Python projects/main.py"

its getting the element from here <Element html at 0x20eb01006d0>
the 3s rank is:  []

Process finished with exit code 0

“the 3s rank is:”旁边的输出应该看起来像这样

[<Element html at 0x20eb01006d0>, <Element html at 0x20eb01006d0>, <Element html at 0x20eb01006d0>]

Answer 1

lxml 不支持“tbody”。将您的 xpath 更改为

'//*[@id="app"]/div[2]/div[2]/div/main/div[2]/div[3]/div[1]/div/div/div[1]/div[2]/table/tr[*]/td[3]/div/div[2]/div[1]/div'

Answer 2

由于xpath字符串不匹配，page.xpath(..)没有返回结果集。很难准确地说出您在寻找什么，但考虑到“threesRank”，我假设您正在寻找所有 table 值，即。排名等等。

您可以使用 Chrome 插件“Xpath helper”获得更准确和不言自明的 xpath。使用方法：进入网站并激活扩展。按住 shift 键并悬停在您感兴趣的元素上。

由于 tracker.network.com 使用的 HTML 是使用 javascript 和 BootstrapVue（和 Moment/Typeahead/jQuery）动态构建的，所以动态渲染有时会产生不同的结果，这是一个很大的风险。

我建议您不要抓取渲染的 html，而是使用渲染所需的结构化数据，在这种情况下，它作为 json 存储在名为 JavaScript 的变量中__INITIAL_STATE__

import requests
import re
import json
from contextlib import suppress

# get page
result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')

# Extract everything needed to render the current page. Data is stored as Json in the
# JavaScript variable: window.__INITIAL_STATE__={"route":{"path":"\u0 ... }};
json_string = re.search(r"window.__INITIAL_STATE__\s?=\s?(\{.*?\});", result.text).group(1)

# convert text string to structured json data
rocketleague = json.loads(json_string)

# Save structured json data to a text file that helps you orient yourself and pick
# the parts you are interested in.
with open('rocketleague_json_data.txt', 'w') as outfile:
    outfile.write(json.dumps(rocketleague, indent=4, sort_keys=True))

# Access members using names
print(rocketleague['titles']['currentTitle']['platforms'][0]['name'])

# To avoid 'KeyError' when a key is missing or index is out of range, use "with suppress"
# as in the example below:  since there there is no platform no 99, the variable "platform99"
# will be unassigned without throwing a 'keyerror' exception.
from contextlib import suppress

with suppress(KeyError):
    platform1 = rocketleague['titles']['currentTitle']['platforms'][0]['name']
    platform99 = rocketleague['titles']['currentTitle']['platforms'][99]['name']

# print platforms used by currentTitle
for platform in rocketleague['titles']['currentTitle']['platforms']:
    print(platform['name'])

# print all titles with corresponding platforms
for title in rocketleague['titles']['titles']:
    print(f"\nTitle: {title['name']}")
    for platform in title['platforms']:
        print(f"\tPlatform: {platform['name']}")

lxml.xpath 没有将元素放入列表的问题

Problem with lxml.xpath not putting elements into a list

python

html

xpath

lxml

python-requests-html