意外重命名文件中的字段

Question

我正在使用 Python BS4/lxml 解析 xml 格式的 RSS 提要（特别是 https://itch.io/games/on-sale.xml). I'm finding that in the transition from Requests receiving the page data and BS4 reading it from text, the name of the link field is being changed. Specifically, res.text contains ...</saleends><link>https://foo.itch.io/bar</link><description>... but reading it into BS4/lxml and printing that results in ...</saleends><link/>https://foo.itch.io/bar<description>..., which BS4 is unable to parse correctly. My code is available here，第 237 行。

我可以提供一个精简版的项目，没有登录和日志记录部分以便于测试。

使用简化代码编辑：

import requests
from bs4 import BeautifulSoup
res = requests.get("https://itch.io/feed/sales.xml")
soup = BeautifulSoup(res.text, 'lxml')
print(soup.item.link)

预期行为：打印“https://itch.io/s/12345/foobar”（无论 RSS 中最新的 link 是什么）实际行为：打印“”

Answer 1

lxml 是 lxml 的 HTML 解析器，lxml-xml 和 xml 是 lxml 的 XML 解析器。（参考 this answer which points to this 文档）

因此，您应该使用 lxml-xml 或 xml 解析器，而不是使用 lxml 解析器。

import requests
from bs4 import BeautifulSoup
res = requests.get("https://itch.io/feed/sales.xml")
soup = BeautifulSoup(res.text, 'lxml-xml')
print(soup.item.link.text)

输出： https://itch.io/s/38593/halloween-event-sale

意外重命名文件中的字段

Unexpectedly renaming field in file

python

rss

lxml

beautifulsoup

python-requests