如何使用 lxml 从这个 HTML 片段中获取文本？

Question

谁能解释为什么这个片段在断言上失败了？

from lxml import etree

s = '<div><h2><img />XYZZY</h2></div>'

root = etree.fromstring(s)

elements = root.xpath(".//*[contains(text(),'XYZZY')]")  # Finds 1 element, as expected

for el in elements:
    assert el.text is not None

然后...我怎样才能访问“XYZZY”并将其更改为“ZYX”？

Answer 1

Can anyone explain why this snippet fails on the assert?

因为 <h2> 元素的文本由 lxml 存储在 h2 元素的一个子元素中。您可以使用 itertext() 来获取您要查找的内容。

from lxml import etree
s = '<div><h2><img />XYZZY</h2></div>'
root = etree.fromstring(s)
elements = root.xpath(".//*[contains(text(),'XYZZY')]")
for el in elements:
    el_text = ''.join(el.itertext())
    assert el_text is not None
    print(el_text)

更新：进一步查看后，发现每个元素都有 3 个相关属性：.tag、.text 和 .tail。

对于.tail属性,there is a small part in the tutorial的解释是：

<html><body>Hello<br/>World</body></html>

Here, the
tag is surrounded by text. This is often referred to as document-style or mixed-content XML. Elements support this through their tail property. It contains the text that directly follows the element, up to the next element in the XML tree

.tail 的填充方式是 again explained here:

LXML appends trailing text, which is not wrapped inside it's own tag, as the .tail attribute of the tag just prior.

所以我们实际上可以编写以下代码，遍历元素树中的每个元素并找到文本 XYZZY 所在的位置：

from lxml import etree
s = '<div><h2><img />XYZZY</h2></div>'
root = etree.fromstring(s)

context = etree.iterwalk(root, events=("start","end"))
for action, elem in context:
    print("%s: %s : [text=%s : tail=%s]" % (action, elem.tag, elem.text, elem.tail))

输出：

start: div : [text=None : tail=None]
start: h2 : [text=None : tail=None]
start: img : [text=None : tail=XYZZY]
end: img : [text=None : tail=XYZZY]
end: h2 : [text=None : tail=None]
end: div : [text=None : tail=None]

因此它位于 <img> 元素的 .tail 属性中。

关于你的第二个问题：

And then... how can I get access to "XYZZY" and change it to "ZYX"?

一种解决方案是遍历元素树，检查每个元素的文本或尾部是否有字符串，然后替换它：

#!/usr/bin/python3
from lxml import etree
s = '<div><h2><img />XYZZY</h2></div>'
root = etree.fromstring(s)

search_string = "XYZZY"
replace_string = "ZYX"

context = etree.iterwalk(root, events=("start","end"))
for action, elem in context:
    if elem.text and elem.text.strip() == search_string:
        elem.text = replace_string
    elif elem.tail and elem.tail.strip() == search_string:
        elem.tail = replace_string

print(etree.tostring(root).decode("utf-8"))

输出：

<div><h2><img/>ZYX</h2></div>

如何使用 lxml 从这个 HTML 片段中获取文本？

How can I get the text from this HTML snippet using lxml?

python

xml

lxml