意外重命名文件中的字段

Unexpectedly renaming field in file

我正在使用 Python BS4/lxml 解析 xml 格式的 RSS 提要(特别是 https://itch.io/games/on-sale.xml). I'm finding that in the transition from Requests receiving the page data and BS4 reading it from text, the name of the link field is being changed. Specifically, res.text contains ...</saleends><link>https://foo.itch.io/bar</link><description>... but reading it into BS4/lxml and printing that results in ...</saleends><link/>https://foo.itch.io/bar<description>..., which BS4 is unable to parse correctly. My code is available here,第 237 行。

我可以提供一个精简版的项目,没有登录和日志记录部分以便于测试。

使用简化代码编辑:

import requests
from bs4 import BeautifulSoup
res = requests.get("https://itch.io/feed/sales.xml")
soup = BeautifulSoup(res.text, 'lxml')
print(soup.item.link)

预期行为:打印“https://itch.io/s/12345/foobar”(无论 RSS 中最新的 link 是什么) 实际行为:打印“

lxml 是 lxml 的 HTML 解析器,lxml-xmlxml 是 lxml 的 XML 解析器。 (参考 this answer which points to this 文档)

因此,您应该使用 lxml-xmlxml 解析器,而不是使用 lxml 解析器。

import requests
from bs4 import BeautifulSoup
res = requests.get("https://itch.io/feed/sales.xml")
soup = BeautifulSoup(res.text, 'lxml-xml')
print(soup.item.link.text)

输出: https://itch.io/s/38593/halloween-event-sale