从 lxml 获取内部文本
Get inner text from lxml
lxml.html.fromstring 坚持将所有内容包装在标签中(p
默认值)。从这个标签树中,
<p>this is <b>the</b> good stuff<p>
我要提取字符串:
this is <b>the</b> good stuff
我该怎么做?
这通常被称为 "inner xml" 而不是 "inner text"。这是获取元素内部 xml 的一种可能方法:
import lxml.etree as etree
import lxml.html
html = "<p>this is <b>the</b> good stuff<p>"
tree = lxml.html.fromstring(html)
node = tree.xpath("//p")[0]
result = node.text + ''.join(etree.tostring(e) for e in node)
print(result)
输出:
this is <b>the</b> good stuff
lxml.html.fromstring 坚持将所有内容包装在标签中(p
默认值)。从这个标签树中,
<p>this is <b>the</b> good stuff<p>
我要提取字符串:
this is <b>the</b> good stuff
我该怎么做?
这通常被称为 "inner xml" 而不是 "inner text"。这是获取元素内部 xml 的一种可能方法:
import lxml.etree as etree
import lxml.html
html = "<p>this is <b>the</b> good stuff<p>"
tree = lxml.html.fromstring(html)
node = tree.xpath("//p")[0]
result = node.text + ''.join(etree.tostring(e) for e in node)
print(result)
输出:
this is <b>the</b> good stuff