Xpath 提取当前节点内容包括所有子节点
Xpath extract current node content including all child node
我在提取当前节点内容(包括所有子节点)时遇到问题。
就像下面的代码,我要获取字符串
abcdefg<b>b1b2b3</b>
在预标记中。
但我无法使用 "child::*" 获取它。
如果我使用“/text()”,我会丢失 b 标签格式信息。请帮帮我。
# -*- coding: utf-8 -*-
from lxml import html
import lxml.etree as le
input = "<pre>abcdefg<b>b1b2b3</b></pre>"
input_xpath = "//pre/child::*"
tree = html.fromstring(input)
result = tree.xpath(input_xpath)
result1 = [le.tostring(item) for item in result]
result2 = ''.join(result1)
print result2
output: <b>b1b2b3</b>
尝试用以下内容替换您的 xpath
In [0]: input = "<pre>abcdefg<b>b1b2b3</b></pre>"
In [1]: input_xpath = "//pre//text()"
In [2]: tree = html.fromstring(input)
In [3]: result = tree.xpath(input_xpath)
In [4]: result
Out[5]: ['abcdefg', 'b1b2b3']
要获取 XML 节点的内容标记(有时称为 "innerXML"),您可以从选择节点开始(而不是选择子节点或文本内容):
from lxml import html
import lxml.etree as le
input = "<pre>abcdefg<b>b1b2b3</b></pre>"
tree = html.fromstring(input)
node = tree.xpath("//pre")[0]
然后将文本内容与所有子节点标记结合起来:
result = node.text + ''.join(le.tostring(e) for e in node)
print result
输出:
abcdefg<b>b1b2b3</b>
我在提取当前节点内容(包括所有子节点)时遇到问题。
就像下面的代码,我要获取字符串
abcdefg<b>b1b2b3</b>
在预标记中。
但我无法使用 "child::*" 获取它。 如果我使用“/text()”,我会丢失 b 标签格式信息。请帮帮我。
# -*- coding: utf-8 -*-
from lxml import html
import lxml.etree as le
input = "<pre>abcdefg<b>b1b2b3</b></pre>"
input_xpath = "//pre/child::*"
tree = html.fromstring(input)
result = tree.xpath(input_xpath)
result1 = [le.tostring(item) for item in result]
result2 = ''.join(result1)
print result2
output: <b>b1b2b3</b>
尝试用以下内容替换您的 xpath
In [0]: input = "<pre>abcdefg<b>b1b2b3</b></pre>"
In [1]: input_xpath = "//pre//text()"
In [2]: tree = html.fromstring(input)
In [3]: result = tree.xpath(input_xpath)
In [4]: result
Out[5]: ['abcdefg', 'b1b2b3']
要获取 XML 节点的内容标记(有时称为 "innerXML"),您可以从选择节点开始(而不是选择子节点或文本内容):
from lxml import html
import lxml.etree as le
input = "<pre>abcdefg<b>b1b2b3</b></pre>"
tree = html.fromstring(input)
node = tree.xpath("//pre")[0]
然后将文本内容与所有子节点标记结合起来:
result = node.text + ''.join(le.tostring(e) for e in node)
print result
输出:
abcdefg<b>b1b2b3</b>