ElementTree 文本与标签混合
ElementTree text mixed with tags
想象以下文字:
<description>
the thing <b>stuff</b> is very important for various reasons, notably <b>other things</b>.
</description>
我如何设法使用 etree
接口解析它?有了 description
标签,.text
属性 returns 只有第一个词 - the
。 .getchildren()
方法 returns <b>
元素,但不是文本的其余部分。
非常感谢!
获得.text_content()
。使用 lxml.html
:
的工作示例
from lxml.html import fromstring
data = """
<description>
the thing <b>stuff</b> is very important for various reasons, notably <b>other things</b>.
</description>
"""
tree = fromstring(data)
print(tree.xpath("//description")[0].text_content().strip())
打印:
the thing stuff is very important for various reasons, notably other things.
I forgot to specify one thing though, sorry. My ideal parsed version would contain a list of subsections: [normal("the thing"), bold("stuff"), normal("....")], is that possible with the lxml.html library?
假设描述中只有文本节点和 b
元素:
for item in tree.xpath("//description/*|//description/text()"):
print([item.strip(), 'normal'] if isinstance(item, basestring) else [item.text, 'bold'])
打印:
['the thing', 'normal']
['stuff', 'bold']
['is very important for various reasons, notably', 'normal']
['other things', 'bold']
['.', 'normal']
想象以下文字:
<description>
the thing <b>stuff</b> is very important for various reasons, notably <b>other things</b>.
</description>
我如何设法使用 etree
接口解析它?有了 description
标签,.text
属性 returns 只有第一个词 - the
。 .getchildren()
方法 returns <b>
元素,但不是文本的其余部分。
非常感谢!
获得.text_content()
。使用 lxml.html
:
from lxml.html import fromstring
data = """
<description>
the thing <b>stuff</b> is very important for various reasons, notably <b>other things</b>.
</description>
"""
tree = fromstring(data)
print(tree.xpath("//description")[0].text_content().strip())
打印:
the thing stuff is very important for various reasons, notably other things.
I forgot to specify one thing though, sorry. My ideal parsed version would contain a list of subsections: [normal("the thing"), bold("stuff"), normal("....")], is that possible with the lxml.html library?
假设描述中只有文本节点和 b
元素:
for item in tree.xpath("//description/*|//description/text()"):
print([item.strip(), 'normal'] if isinstance(item, basestring) else [item.text, 'bold'])
打印:
['the thing', 'normal']
['stuff', 'bold']
['is very important for various reasons, notably', 'normal']
['other things', 'bold']
['.', 'normal']