无法使用 BeautifulSoup 正确解析 xml

Question

我正在尝试抓取此页面：https://www.france24.com/en/europe/rss

我的代码：

from urllib.request import urlopen
from bs4 import BeautifulSoup

xml = urlopen("https://www.france24.com/en/europe/rss")
data = xml.read()
text = data.decode('utf-8')
bs = BeautifulSoup(text, "lxml")
items = bs.find("rss").find("channel").find_all("item")
for n, item in enumerate(items):
    print(f"\n{n+1} - {item.find('title').get_text()}")
    print(item.find("pubDate"))
    print(item.find("description").get_text().replace("\n", ""))
    print(item.find("link").get_text())

我感兴趣的结构：

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:media="http://search.yahoo.com/mrss/" xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
    <channel>
        <lastBuildDate>Fri, 23 Apr 2021 06:08:56 GMT</lastBuildDate>
        <item>
            <category>Europe</category>
            <title>French fishermen seek to block British shipments in Brexit protest</title>
            <link>https://www.france24.com/en/europe/20210423-french-fishermen-seek-to-block-british-shipments-in-brexit-protest</link>
            <description>
French trawlermen angered by the slow issuance of licenses to fish inside British waters after Brexit on Thursday blocked lorries carrying UK-landed fish as they arrived in Boulogne-sur-Mer, Europe’s largest seafood processing centre.
</description>
            <media:thumbnail url="https://s.france24.com/media/display/70e354e8-a3e7-11eb-a6eb-005056bf87d6/w:1024/p:16x9/Brexit%20fishermen%20protest.jpg" />
            <enclosure url="https://s.france24.com/media/display/70e354e8-a3e7-11eb-a6eb-005056bf87d6/w:1024/p:16x9/Brexit%20fishermen%20protest.jpg" type="image/jpeg" length="0" />
            <guid isPermaLink="false">cadc08aa-a3e7-11eb-91c0-005056bff4a8</guid>
            <pubDate>Fri, 23 Apr 2021 03:55:46 GMT</pubDate>
            <source url="https://s.france24.com/media/display/70e354e8-a3e7-11eb-a6eb-005056bf87d6/w:1024/p:16x9/Brexit%20fishermen%20protest.jpg">© Denis Charlet, AFP</source>
            <dc:creator>NEWS WIRES</dc:creator>
        </item>

输出：

1 - French fishermen seek to block British shipments in Brexit protest
None
French trawlermen angered by the slow issuance of licenses to fish inside British waters after Brexit on Thursday blocked lorries carrying UK-landed fish as they arrived in Boulogne-sur-Mer, Europe’s largest seafood processing centre.

(...)

如您所见，未打印 pubDate 和 link。

这是 print(bs) 的结果，看看 BS 如何解析 xml（格式化）：

<?xml version="1.0" encoding="UTF-8"?>
<html>
    <body>
        <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/">
            <channel>
                <lastbuilddate>Fri, 23 Apr 2021 06:08:56 GMT</lastbuilddate>
                <item>
                    <category>Europe</category>
                    <title>French fishermen seek to block British shipments in Brexit protest</title>
                    <link />
                    https://www.france24.com/en/europe/20210423-french-fishermen-seek-to-block-british-shipments-in-brexit-protest
                    <description>
French trawlermen angered by the slow issuance of licenses to fish inside British waters after Brexit on Thursday blocked lorries carrying UK-landed fish as they arrived in Boulogne-sur-Mer, Europe’s largest seafood processing centre.
</description>
                    <media:thumbnail url="https://s.france24.com/media/display/70e354e8-a3e7-11eb-a6eb-005056bf87d6/w:1024/p:16x9/Brexit%20fishermen%20protest.jpg"></media:thumbnail>
                    <enclosure length="0" type="image/jpeg" url="https://s.france24.com/media/display/70e354e8-a3e7-11eb-a6eb-005056bf87d6/w:1024/p:16x9/Brexit%20fishermen%20protest.jpg"></enclosure>
                    <guid ispermalink="false">cadc08aa-a3e7-11eb-91c0-005056bff4a8</guid>
                    <pubdate>Fri, 23 Apr 2021 03:55:46 GMT</pubdate>
                    <source url="https://s.france24.com/media/display/70e354e8-a3e7-11eb-a6eb-005056bf87d6/w:1024/p:16x9/Brexit%20fishermen%20protest.jpg">© Denis Charlet, AFP</source>
                    <dc:creator>NEWS WIRES</dc:creator>
                </item>

注意 pubdate 和 link /.

我看不出是什么问题。关于为什么它被错误解析的任何意见？

Answer 1

发布日期：

在用BS解析的Xml中，应答器pubDate变为pubdate或者在你的代码中你正在寻找pubDate

也许你可以试试这个。

Answer 2

尝试 xml 解析器：

bs = BeautifulSoup(text, "xml")

这是我为第一项获得的输出：

1 - Without licenses to fish in British waters, French trawlermen block deliveries of UK-landed fish
<pubDate>Fri, 23 Apr 2021 03:55:46 GMT</pubDate>
French trawlermen angered by the slow issuance of licenses to fish inside British waters after Brexit on Thursday blocked lorries carrying UK-landed fish as they arrived in Boulogne-sur-Mer, Europe’s largest seafood processing centre.
https://www.france24.com/en/europe/20210423-french-fishermen-seek-to-block-british-shipments-in-brexit-protest

无法使用 BeautifulSoup 正确解析 xml

Can't parse xml properly with BeautifulSoup

python

lxml

beautifulsoup

web-scraping