python lxml.html：以文档字符串顺序使用 .tail 遍历文本的正确方法

Question

我正在尝试遍历 html 字符串并将文本内容与字符串连接符连接起来，该连接符随遇到的 html 标记类型而变化。

示例html： html_str='<td>This is how we parse our string together</td>'

我编写了一个名为 smart_itertext() 的辅助函数来通过方法 e.iter() 遍历 html 元素 e。对于 e.iter() 中的每个 tag，它会检查标签，然后附加 .text 或 .tail 内容。

我的挑战是让尾部文本显示在正确的位置。当我按标签迭代时，我到达 ，这似乎是我访问尾随文本 'together'.

的唯一机会

想要的结果：

>>>smart_itertext(lxml.html.fromstring(html_str))
'This is how::we::parse::our string::together'

实际结果：

>>>smart_itertext(lxml.html.fromstring(html_str))
'This is how:: together::::we::parse::::our string'

这是我的功能：

def smart_itertext(tree, cross_joiner='::'):
empty_join= ['strong','b','em','i','small','marked','deleted',
            'ins', 'sub','sup']
cross_join = ['td','tr','br','p']
output=''
for element in tree.iter():
    if element.tag in empty_join:
        if element.text:
            output += element.text
        if element.tail:
            output += element.tail
    elif element.tag in cross_join:
        if element.text:
            output += cross_joiner + element.text
        else:
            output += cross_joiner
        if element.tail:
            output += cross_joiner + element.tail
    else:
        print ('unknown tag in smart_itertext:',element.tag)
return output

完成此任务的正确方法是什么？

Answer 1

答案是使用 xpath，它允许您构建按文档顺序出现的内容文本列表，具有属性 is_tail 和 is_text，以及方法 getparent()。

来自 lxml.html tutorial:

Note that a string result returned by XPath is a special 'smart' object that knows about its origins. You can ask it where it came from through its getparent() method, just as you would with Elements:
>>> texts = build_text_list(html)
>>> print(texts[0])
TEXT
>>> parent = texts[0].getparent()
>>> print(parent.tag)
body

>>> print(texts[1])
TAIL
>>> print(texts[1].getparent().tag)
br
You can also find out if it's normal text content or tail text:
>>> print(texts[0].is_text)
True
>>> print(texts[1].is_text)
False
>>> print(texts[1].is_tail)
True

python lxml.html：以文档字符串顺序使用 .tail 遍历文本的正确方法

python lxml.html: proper way to iterate through text with .tail in docstring order

python

lxml

html-parsing

lxml.html