Xpath 提取当前节点内容包括所有子节点

Xpath extract current node content including all child node

我在提取当前节点内容(包括所有子节点)时遇到问题。

就像下面的代码,我要获取字符串 abcdefg<b>b1b2b3</b> 在预标记中。

但我无法使用 "child::*" 获取它。 如果我使用“/text()”,我会丢失 b 标签格式信息。请帮帮我。

# -*- coding: utf-8 -*-
from lxml import html
import lxml.etree as le

input = "<pre>abcdefg<b>b1b2b3</b></pre>"
input_xpath = "//pre/child::*"
tree = html.fromstring(input)
result = tree.xpath(input_xpath)
result1 = [le.tostring(item) for item in result]
result2 = ''.join(result1)
print result2

output: <b>b1b2b3</b>

尝试用以下内容替换您的 xpath

In [0]: input = "<pre>abcdefg<b>b1b2b3</b></pre>"

In [1]: input_xpath = "//pre//text()"

In [2]: tree = html.fromstring(input)

In [3]: result = tree.xpath(input_xpath)

In [4]: result
Out[5]: ['abcdefg', 'b1b2b3']

要获取 XML 节点的内容标记(有时称为 "innerXML"),您可以从选择节点开始(而不是选择子节点或文本内容):

from lxml import html
import lxml.etree as le

input = "<pre>abcdefg<b>b1b2b3</b></pre>"
tree = html.fromstring(input)
node = tree.xpath("//pre")[0]

然后将文本内容与所有子节点标记结合起来:

result = node.text + ''.join(le.tostring(e) for e in node)
print result

输出:

abcdefg<b>b1b2b3</b>