XML parsing in Python: 如何获取 child 节点的字符串索引关于扁平字符串

Question

我是 XML 解析 Python 的新手，我需要获取一些关于某些短语节点及其 children 的内部文本的数据（最好使用 Minidom，但这不是必需的）。

示例：

<phrase id="x.y">This example
    <foo id="x.y.z">
        <bar type="likelihood" ref="x.y.z">might</bar> 
    be useful</foo>.
</phrase>

我要获取的是以下数据：

字符串中的整个文本结合了 parent 节点及其 children（就像 Minidom 文档中的递归方法 getText 一样）
包含 children 数据的三元组列表：
- 标签名称
- 考虑整个字符串的起始索引
- 考虑到整个字符串的结束索引

在 xml 示例中，<bar> 内部文本（可能）从索引 14 开始到索引 18 结束，而 <foo> contents (be useful) 从索引 19 开始到索引 28 结束。这个例子的执行应该 return 类似的东西（children 的顺序是无关紧要）：

('This example might be useful.', [('bar', 14, 18), ('foo', 19, 28)])

Answer 1

这是一个有趣的项目！有点令人费解，不确定在其他情况下会走多远，但请尝试这样的事情：

from lxml import etree
phrase = """[your xml above]"""
doc = etree.fromstring(phrase)

#this requires a couple of help functions to clean up spaces, find indexes, etc.:

def space_rem(str):
    while '  ' in str:
        str = str.replace('  ', ' ')
    return str

def build(str):
    str_path = doc.xpath(f'//{str}/text()')
    str = ''
    for s in str_path:
        str+=(s.strip())
    space_rem(str)
    str_ind = ttxt.find(str)
    return str_ind,str_ind+len(str)

foo_lst = ['foo']
bar_lst = ['bar']
ttxt = ''

for t in doc.xpath('//*/text()'):
    ttxt+=t.replace('\n','')
ttxt = space_rem(ttxt)

foo_lst.extend(build('foo'))
bar_lst.extend(build('bar'))

ttxt,foo_lst,bar_lst

输出：

('This example might be useful.', ['foo', 19, 28], ['bar', 13, 18])

XML parsing in Python: 如何获取 child 节点的字符串索引关于扁平字符串

XML parsing in Python: how to get the string indexes of child nodes with regard to the flattened string

python

minidom

xml-parsing

python-3.x