lxml xpath 表达式,用于选择给定 child 节点下的所有文本,包括他的 children
lxml xpath expression for selecting all text under a given child node including his children
前提是我有一个XML如下:
<node1>
<text title='book'>
<div chapter='0'>
<div id='theNode'>
<p xml:id="40">
A House that has:
<p xml:id="45">- a window;</p>
<p xml:id="46">- a door</p>
<p xml:id="46">- a door</p>
its a beuatiful house
</p>
</div>
</div>
</text>
</node1>
我想定位文本标题并从出现在文本标题书节点内的第一个 p 标签中获取所有文本
到目前为止我知道:
from lxml import etree
XML_tree = etree.fromstring(XML_content,parser=parser)
text = XML_tree.xpath('//text[@title="book"]/div/div/p/text()')
得到:"A house that has is a beautiful house"
但我还想要
下第一个 的所有可能 children 和伟大 children 的所有文本
基本上;查找 然后查找第一个
并给我该 p 标签下的所有文本,无论嵌套如何。
伪代码:
text = XML_tree.xpath('//text[@title="book"]/... any number of nodes.../p/ ....all text under p')
谢谢。
尝试使用 string()
or normalize-space()
...
from lxml import etree
XML_content = """
<node1>
<text title='book'>
<div chapter='0'>
<div id='theNode'>
<p xml:id="x40">
A House that has:
<p xml:id="x45">- a window;</p>
<p xml:id="x46">- a door</p>
<p xml:id="x47">- a door</p>
its a beuatiful house
</p>
</div>
</div>
</text>
</node1>
"""
XML_tree = etree.fromstring(XML_content)
text = XML_tree.xpath('string(//text[@title="book"]/div/div/p)')
# text = XML_tree.xpath('normalize-space(//text[@title="book"]/div/div/p)')
print(text)
输出使用string()
...
A House that has:
- a window;
- a door
- a door
its a beuatiful house
输出使用normalize-space()
...
A House that has: - a window; - a door - a door its a beuatiful house
另一种选择:
XML_tree = etree.fromstring(XML_content)
text = [el.strip() for el in XML_tree.xpath('//text()[ancestor::text[@title="book"]][normalize-space()]')]
print(" ".join(text))
print("\n".join(text))
输出:
A House that has: - a window; - a door - a door its a beuatiful house
A House that has:
- a window;
- a door
- a door
its a beuatiful house
前提是我有一个XML如下:
<node1>
<text title='book'>
<div chapter='0'>
<div id='theNode'>
<p xml:id="40">
A House that has:
<p xml:id="45">- a window;</p>
<p xml:id="46">- a door</p>
<p xml:id="46">- a door</p>
its a beuatiful house
</p>
</div>
</div>
</text>
</node1>
我想定位文本标题并从出现在文本标题书节点内的第一个 p 标签中获取所有文本
到目前为止我知道:
from lxml import etree
XML_tree = etree.fromstring(XML_content,parser=parser)
text = XML_tree.xpath('//text[@title="book"]/div/div/p/text()')
得到:"A house that has is a beautiful house"
但我还想要
下第一个的所有可能 children 和伟大 children 的所有文本
基本上;查找 然后查找第一个
并给我该 p 标签下的所有文本,无论嵌套如何。
伪代码:
text = XML_tree.xpath('//text[@title="book"]/... any number of nodes.../p/ ....all text under p')
谢谢。
尝试使用 string()
or normalize-space()
...
from lxml import etree
XML_content = """
<node1>
<text title='book'>
<div chapter='0'>
<div id='theNode'>
<p xml:id="x40">
A House that has:
<p xml:id="x45">- a window;</p>
<p xml:id="x46">- a door</p>
<p xml:id="x47">- a door</p>
its a beuatiful house
</p>
</div>
</div>
</text>
</node1>
"""
XML_tree = etree.fromstring(XML_content)
text = XML_tree.xpath('string(//text[@title="book"]/div/div/p)')
# text = XML_tree.xpath('normalize-space(//text[@title="book"]/div/div/p)')
print(text)
输出使用string()
...
A House that has:
- a window;
- a door
- a door
its a beuatiful house
输出使用normalize-space()
...
A House that has: - a window; - a door - a door its a beuatiful house
另一种选择:
XML_tree = etree.fromstring(XML_content)
text = [el.strip() for el in XML_tree.xpath('//text()[ancestor::text[@title="book"]][normalize-space()]')]
print(" ".join(text))
print("\n".join(text))
输出:
A House that has: - a window; - a door - a door its a beuatiful house
A House that has:
- a window;
- a door
- a door
its a beuatiful house