我如何使用 Python 解析 XML

Question

我想从网站解析一个 xml，谁能帮我？

这是xml，我只想获取信息。

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
<loc>
http://www.habergazete.com/haber-detay/1/69364/cAYKUR-3-bin-500-personel-alimi-yapacagini-duyurdu-cAYKUR-3-bin-500-personel-alim-sarlari--2015-01-29.html
</loc>
<news:news>
<news:publication>
<news:name>Haber Gazete</news:name>
<news:language>tr</news:language>
</news:publication>
<news:publication_date>2015-01-29T15:04:01+02:00</news:publication_date>
<news:title>
ÇAYKUR 3 bin 500 personel alımı yapacağını duyurdu! (ÇAYKUR 3 bin 500 personel alım şarları)
</news:title>
</news:news>
<image:image>
<image:loc>
http://www.habergazete.com/resimler/haber/haber_detay/611x395-alim-54c8f335b176e-1422536816.jpg
</image:loc>
</image:image>
</url>

我尝试使用此代码进行解析，但它给出了 null

conn = client.HTTPConnection("www.habergazete.com")
conn.request("GET", "/sitemaps/1/haberler.xml")
response =  conn.getresponse()
xmlData = response.read()
conn.close()
root = ET.fromstring(xmlData)
print(root.findall("loc"))

有什么建议吗？

谢谢:)

Answer 1

首先，您显示的 XML 格式不正确，因此对其进行解析应该引发异常——它缺少最后的结尾 '</urlset>'。我怀疑你只是没有向我们展示你试图解析的实际 XML。

一旦你解决了这个问题（例如，如果 XML 数据实际上以某种方式被截断，则通过解析 xmlData + '</urlset>'），你就会运行陷入命名空间问题，这很容易显示：

>>> et.tostring(root)
b'<ns0:urlset xmlns:ns0="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:ns1="http://www.google.com/schemas/sitemap-news/0.9" xmlns:ns2="http://www.google.com/schemas/sitemap-image/1.1">\n<ns0:url>\n<ns0:loc>\nhttp://www.habergazete.com/haber-detay/1/69364/cAYKUR-3-bin-500-personel-alimi-yapacagini-duyurdu-cAYKUR-3-bin-500-personel-alim-sarlari--2015-01-29.html\n</ns0:loc>\n<ns1:news>\n<ns1:publication>\n<ns1:name>Haber Gazete</ns1:name>\n<ns1:language>tr</ns1:language>\n</ns1:publication>\n<ns1:publication_date>2015-01-29T15:04:01+02:00</ns1:publication_date>\n<ns1:title>\n&#199;AYKUR 3 bin 500 personel al&#305;m&#305; yapaca&#287;&#305;n&#305; duyurdu! (&#199;AYKUR 3 bin 500 personel al&#305;m &#351;arlar&#305;)\n</ns1:title>\n</ns1:news>\n<ns2:image>\n<ns2:loc>\nhttp://www.habergazete.com/resimler/haber/haber_detay/611x395-alim-54c8f335b176e-1422536816.jpg\n</ns2:loc>\n</ns2:image>\n</ns0:url></ns0:urlset>'

是的，它是一个很长的字符串，但您会在很早的时候看到：

<ns0:loc>

这表明您正在寻找的 loc 实际上被仔细地表示为在命名空间 0 中（即 ns0: 前缀）。

第三，https://docs.python.org/2/library/xml.etree.elementtree.html的文档仔细解释，我引用：

Element.findall() finds only elements with a tag which are direct children of the current element.

我的重点：您只能找到 urlset 的直接子代标签，而不是后代的通用标签（子代的子代，等等）。

因此，扩展命名空间，并使用一点 xpath 语法进行递归搜索：

>>> root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}loc')
[<Element '{http://www.sitemaps.org/schemas/sitemap/0.9}loc' at 0x1022a50e8>]

...您终于找到了您要找的元素。

顺便说一句，我们中的一些人发现 BeautifulSoup、http://www.crummy.com/software/BeautifulSoup/bs4/doc/ 在不需要 etree 或lxml.

我如何使用 Python 解析 XML

How Can i Parse XML using Python

python

xml

parsing

xml-parsing

python-3.x