意外重命名文件中的字段
Unexpectedly renaming field in file
我正在使用 Python BS4/lxml 解析 xml 格式的 RSS 提要(特别是 https://itch.io/games/on-sale.xml). I'm finding that in the transition from Requests receiving the page data and BS4 reading it from text, the name of the link field is being changed. Specifically, res.text contains ...</saleends><link>https://foo.itch.io/bar</link><description>...
but reading it into BS4/lxml and printing that results in ...</saleends><link/>https://foo.itch.io/bar<description>...
, which BS4 is unable to parse correctly. My code is available here,第 237 行。
我可以提供一个精简版的项目,没有登录和日志记录部分以便于测试。
使用简化代码编辑:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://itch.io/feed/sales.xml")
soup = BeautifulSoup(res.text, 'lxml')
print(soup.item.link)
预期行为:打印“https://itch.io/s/12345/foobar”(无论 RSS 中最新的 link 是什么)
实际行为:打印“”
lxml
是 lxml 的 HTML 解析器,lxml-xml
和 xml
是 lxml 的 XML 解析器。 (参考 this answer which points to this 文档)
因此,您应该使用 lxml-xml
或 xml
解析器,而不是使用 lxml
解析器。
import requests
from bs4 import BeautifulSoup
res = requests.get("https://itch.io/feed/sales.xml")
soup = BeautifulSoup(res.text, 'lxml-xml')
print(soup.item.link.text)
输出:
https://itch.io/s/38593/halloween-event-sale
我正在使用 Python BS4/lxml 解析 xml 格式的 RSS 提要(特别是 https://itch.io/games/on-sale.xml). I'm finding that in the transition from Requests receiving the page data and BS4 reading it from text, the name of the link field is being changed. Specifically, res.text contains ...</saleends><link>https://foo.itch.io/bar</link><description>...
but reading it into BS4/lxml and printing that results in ...</saleends><link/>https://foo.itch.io/bar<description>...
, which BS4 is unable to parse correctly. My code is available here,第 237 行。
我可以提供一个精简版的项目,没有登录和日志记录部分以便于测试。
使用简化代码编辑:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://itch.io/feed/sales.xml")
soup = BeautifulSoup(res.text, 'lxml')
print(soup.item.link)
预期行为:打印“https://itch.io/s/12345/foobar”(无论 RSS 中最新的 link 是什么) 实际行为:打印“”
lxml
是 lxml 的 HTML 解析器,lxml-xml
和 xml
是 lxml 的 XML 解析器。 (参考 this answer which points to this 文档)
因此,您应该使用 lxml-xml
或 xml
解析器,而不是使用 lxml
解析器。
import requests
from bs4 import BeautifulSoup
res = requests.get("https://itch.io/feed/sales.xml")
soup = BeautifulSoup(res.text, 'lxml-xml')
print(soup.item.link.text)
输出:
https://itch.io/s/38593/halloween-event-sale