Python xml 解析 - 删除 br 标签

Question

我正在尝试解析 xml 包含发表在期刊上的文章的文件。请参阅下面的示例：

<?xml version="1.0" encoding="UTF-8"?>
<docs>
<doc>
<articolo>
<data>Mercoledí 24 Febbraio 2021</data>
<testo>
<p>Row1</p>
<p>Row2</p>
<p>Row3 <br/>Row4<br/>Row5<br/>Row6</p>
<p>Row7</p>
<p>Row8</p>
<p>Row9<br/>Row10 <br/>Row11 <br/>Row12 <br/>Row13</p>
</testo>
</articolo>
</doc></docs>

我的代码如下：

import xml.etree.ElementTree as ET
file = os.path.join(directory, "myfile.xml")
tree = ET.parse(file)
root = tree.getroot()

for doc in root.findall('doc'):

    for articolo in doc.findall('articolo'):
            data = articolo.find('data').text
            testo = ""
            for x in articolo.find('testo').findall('p'):
                if x.text != None:
                    testo = testo + x.text + "\n"
                    
            print(testo)

我希望得到以下结果：

Row1
Row2 Row3 Row4 Row5 Row6
Row7
Row8
Row9 Row10 Row11 Row12 Row13

但我得到：

Row1
Row2
Row3 
Row7
Row8
Row9

主要问题是部分句子（< br /> 标签后的部分完全缺失）。有没有办法删除 < br /> 标签？

谢谢弗朗西斯卡

Answer 1

你可以做一些更简单的事情：

data = [list(row.itertext()) for row in root.findall('.//testo/p')]
for datum in data:
    print([dat.strip() for dat in datum])

输出：

['Row1']
['Row2']
['Row3', 'Row4', 'Row5', 'Row6']
['Row7']
['Row8']
['Row9', 'Row10', 'Row11', 'Row12', 'Row13']

Answer 2

使用 itertext()（Jack Fleeting 在其他答案中提到），您的代码看起来

        for x in articolo.find('testo').findall('p'):
            testo = ""
            for child in x.itertext():
                testo += child.strip() + " "
            #testo += "\n"
            print(testo)

如果你想要一个字符串中的所有内容

        testo = ""
        for x in articolo.find('testo').findall('p'):
            for child in x.itertext():
                testo += child.strip() + " "
            testo += "\n"
        print(testo)

完整的工作示例

text = '''<?xml version="1.0" encoding="UTF-8"?>
<docs>
<doc>
<articolo>
<data>Mercoledí 24 Febbraio 2021</data>
<testo>
<p>Row1</p>
<p>Row2</p>
<p>Row3 <br/>Row4<br/>Row5<br/>Row6</p>
<p>Row7</p>
<p>Row8</p>
<p>Row9<br/>Row10 <br/>Row11 <br/>Row12 <br/>Row13</p>
</testo>
</articolo>
</doc></docs>'''

import xml.etree.ElementTree as ET

#file = os.path.join(directory, "20210224_SOLE_2.xml")
#tree = ET.parse(file)
#root = tree.getroot()

root = ET.fromstring(text)

for doc in root.findall('doc'):

    for articolo in doc.findall('articolo'):
        data = articolo.find('data').text
        for x in articolo.find('testo').findall('p'):
            testo = ""
            for child in x.itertext():
                testo += child.strip() + " "
            #testo += "\n"
            print(testo)

Python xml 解析 - 删除 br 标签

Python xml parsing - Remove br tags

python

xml-parsing