如何使用 bs4 或 lxml 获取在 Python 中找到 XML 标记的文本行？

Question

我有一个 XML 文档，我想获取由 BeautifulSoup 或 lxml 提取的标签所在的行。有办法吗？

Answer 1

对于 BeautifulSoup，此属性存储在 sourceline attribute of the Tag class, and is being populated in the parsers here and here。

对于 lxml，这也可以通过 sourceline 属性实现。这是一个例子：

#!/usr/bin/python3
from lxml import etree
xml = '''
<a>
  <b>
    <c>
    </c>
  </b>
  <d>
  </d>
</a>
'''
root = etree.fromstring(xml)

for e in root.iter():
    print(e.tag, e.sourceline)

输出：

a 2
b 3
c 4
d 7

如果您想查看 sourceline method it's actually calling xmlGetLineNo which is a binding of xmlGetLineNo from libxml2 that is a wrapper for xmlGetLineNoInternal 的实现（实际逻辑位于 libxml2 中）。

您也可以检查该标签子树的文本表示中有多少行结尾。

xml.etree.ElementTree can also be extended to provide the line numbers where the elements have been found by the parser (the parser being xmlparser from the module xml.parsers.expat).

Answer 2

尝试使用 enumerate() 函数。

例如，如果我们有以下 HTML:

html = """
<!DOCTYPE html>
<html>
<body>
<h1>My Heading</h1>
<p>My paragraph.</p>
</body>
</html>"""

我们希望找到 <h1> 标签 (<h1>My Heading</h1>) 的行号。

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")

for (index, value) in enumerate(
    # Remove all the empty lines, so that they shouldn't be part of the line count
    (x for x in str(soup).splitlines() if x != ""),
    start=1,
):
    # Specify the tag you want to find
    # If the tag is found, it will return `1`, else `-1`
    if value.find("h1") == 1:
        print(f"Line: {index}.\t Found: '{value}' ")
        break

输出：

Line: 4.     Found: '<h1>My Heading</h1>'

如何使用 bs4 或 lxml 获取在 Python 中找到 XML 标记的文本行？

How can I get the line of the text where an XML tag is found in Python using bs4 or lxml?

python

xml

lxml

beautifulsoup