lxml etree 获取元素之前的所有文本
lxml etree get all text before element
如何获取所有文本 before etree 中的一个元素与文本 after 元素分开?
from lxml import etree
tree = etree.fromstring('''
<a>
find
<b>
the
</b>
text
<dd></dd>
<c>
before
</c>
<dd></dd>
and after
</a>
''')
我想要什么?在此示例中,<dd>
标记是分隔符,并且对于所有标记
for el in tree.findall('.//dd'):
我想要它们前后的所有文字:
[
{
el : <Element dd at 0xsomedistinctadress>,
before : 'find the text',
after : 'before and after'
},
{
el : <Element dd at 0xsomeotherdistinctadress>,
before : 'find the text before',
after : 'and after'
}
]
我的想法是在树中使用某种占位符来替换 <dd>
标签,然后在该占位符处剪切字符串,但我需要与实际元素对应。
可能有更简单的方法,但我会使用以下 XPath 表达式:
preceding-sibling::*/text()|preceding::text()
following-sibling::*/text()|following::text()
示例实现(绝对违反DRY原则):
def get_text_before(element):
for item in element.xpath("preceding-sibling::*/text()|preceding-sibling::text()"):
item = item.strip()
if item:
yield item
def get_text_after(element):
for item in element.xpath("following-sibling::*/text()|following-sibling::text()"):
item = item.strip()
if item:
yield item
for el in tree.findall('.//dd'):
before = " ".join(get_text_before(el))
after = " ".join(get_text_after(el))
print {
"el": el,
"before": before,
"after": after
}
打印:
{'el': <Element dd at 0x10af81488>, 'after': 'before and after', 'before': 'find the text'}
{'el': <Element dd at 0x10af81200>, 'after': 'and after', 'before': 'find the text before'}
如何获取所有文本 before etree 中的一个元素与文本 after 元素分开?
from lxml import etree
tree = etree.fromstring('''
<a>
find
<b>
the
</b>
text
<dd></dd>
<c>
before
</c>
<dd></dd>
and after
</a>
''')
我想要什么?在此示例中,<dd>
标记是分隔符,并且对于所有标记
for el in tree.findall('.//dd'):
我想要它们前后的所有文字:
[
{
el : <Element dd at 0xsomedistinctadress>,
before : 'find the text',
after : 'before and after'
},
{
el : <Element dd at 0xsomeotherdistinctadress>,
before : 'find the text before',
after : 'and after'
}
]
我的想法是在树中使用某种占位符来替换 <dd>
标签,然后在该占位符处剪切字符串,但我需要与实际元素对应。
可能有更简单的方法,但我会使用以下 XPath 表达式:
preceding-sibling::*/text()|preceding::text()
following-sibling::*/text()|following::text()
示例实现(绝对违反DRY原则):
def get_text_before(element):
for item in element.xpath("preceding-sibling::*/text()|preceding-sibling::text()"):
item = item.strip()
if item:
yield item
def get_text_after(element):
for item in element.xpath("following-sibling::*/text()|following-sibling::text()"):
item = item.strip()
if item:
yield item
for el in tree.findall('.//dd'):
before = " ".join(get_text_before(el))
after = " ".join(get_text_after(el))
print {
"el": el,
"before": before,
"after": after
}
打印:
{'el': <Element dd at 0x10af81488>, 'after': 'before and after', 'before': 'find the text'}
{'el': <Element dd at 0x10af81200>, 'after': 'and after', 'before': 'find the text before'}