使用 BS4 和 Python 提取不寻常的 XML 标签

Question

无法在任何地方找到答案。我有一个 XML:

<channel>
    <title>xxx</title>
    <description>aaa</description>
    <item>
        <title>theTitle</title>
        <link/>link
        </item>
        <title>theTitle2</title>
        <link/>link
        </item>

我需要从该文件中提取所有链接。

我迭代：

for link in soup.channel.findAll('item'):
    links = link.link
    linkdict.append(links)

但输出是：

[<link/>, <link/>, <link/>]

如何使用正则表达式解析此 xml with/without。我希望代码尽可能简单。

更新

我找到了用一行代码完成它的方法：

soup = bs4.BeautifulSoup(output, features='xml')

Answer 1

使用此安装 Xml - pip install lxml 然后您可以使用

轻松解析

 soup = BeautifulSoup(xmlString,"lxml")

Answer 2

鉴于您已经安装了 lxml，您可以直接使用它，而不是通过 BeautifulSoup。在 lxml 树模型中，link 文本将作为相应 <link/> 元素的 tail 可用：

from lxml import etree

raw = '''<channel> 
  <title>xxx</title>  
  <description>aaa</description>  
  <item> 
    <title>theTitle</title>  
    <link/>link
  </item>  
  <item> 
    <title>theTitle2</title>  
    <link/>link
  </item> 
</channel>'''

root = etree.fromstring(raw)
for link in root.xpath('//item/link'):
    print link.tail.strip()

输出：

link
link

XPath表达式//item/link表示查找item元素，在当前文档的任意位置，以及return对应的子元素link .还值得一提的是，lxml 在大多数情况下比 BS4 更快。

_{参考资料：}
_{1) BeautifulSoup 4 Benchmark}
_{2) Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?}

使用 BS4 和 Python 提取不寻常的 XML 标签

Extract an unusual XML tag with BS4 and Python

python

xml

tags

parsing

更新