Python xml 解析 - 删除 br 标签
Python xml parsing - Remove br tags
我正在尝试解析 xml 包含发表在期刊上的文章的文件。请参阅下面的示例:
<?xml version="1.0" encoding="UTF-8"?>
<docs>
<doc>
<articolo>
<data>Mercoledí 24 Febbraio 2021</data>
<testo>
<p>Row1</p>
<p>Row2</p>
<p>Row3 <br/>Row4<br/>Row5<br/>Row6</p>
<p>Row7</p>
<p>Row8</p>
<p>Row9<br/>Row10 <br/>Row11 <br/>Row12 <br/>Row13</p>
</testo>
</articolo>
</doc></docs>
我的代码如下:
import xml.etree.ElementTree as ET
file = os.path.join(directory, "myfile.xml")
tree = ET.parse(file)
root = tree.getroot()
for doc in root.findall('doc'):
for articolo in doc.findall('articolo'):
data = articolo.find('data').text
testo = ""
for x in articolo.find('testo').findall('p'):
if x.text != None:
testo = testo + x.text + "\n"
print(testo)
我希望得到以下结果:
Row1
Row2 Row3 Row4 Row5 Row6
Row7
Row8
Row9 Row10 Row11 Row12 Row13
但我得到:
Row1
Row2
Row3
Row7
Row8
Row9
主要问题是部分句子(< br /> 标签后的部分完全缺失)。有没有办法删除 < br /> 标签?
谢谢
弗朗西斯卡
你可以做一些更简单的事情:
data = [list(row.itertext()) for row in root.findall('.//testo/p')]
for datum in data:
print([dat.strip() for dat in datum])
输出:
['Row1']
['Row2']
['Row3', 'Row4', 'Row5', 'Row6']
['Row7']
['Row8']
['Row9', 'Row10', 'Row11', 'Row12', 'Row13']
使用 itertext()
(Jack Fleeting 在其他答案中提到),您的代码看起来
for x in articolo.find('testo').findall('p'):
testo = ""
for child in x.itertext():
testo += child.strip() + " "
#testo += "\n"
print(testo)
如果你想要一个字符串中的所有内容
testo = ""
for x in articolo.find('testo').findall('p'):
for child in x.itertext():
testo += child.strip() + " "
testo += "\n"
print(testo)
完整的工作示例
text = '''<?xml version="1.0" encoding="UTF-8"?>
<docs>
<doc>
<articolo>
<data>Mercoledí 24 Febbraio 2021</data>
<testo>
<p>Row1</p>
<p>Row2</p>
<p>Row3 <br/>Row4<br/>Row5<br/>Row6</p>
<p>Row7</p>
<p>Row8</p>
<p>Row9<br/>Row10 <br/>Row11 <br/>Row12 <br/>Row13</p>
</testo>
</articolo>
</doc></docs>'''
import xml.etree.ElementTree as ET
#file = os.path.join(directory, "20210224_SOLE_2.xml")
#tree = ET.parse(file)
#root = tree.getroot()
root = ET.fromstring(text)
for doc in root.findall('doc'):
for articolo in doc.findall('articolo'):
data = articolo.find('data').text
for x in articolo.find('testo').findall('p'):
testo = ""
for child in x.itertext():
testo += child.strip() + " "
#testo += "\n"
print(testo)
我正在尝试解析 xml 包含发表在期刊上的文章的文件。请参阅下面的示例:
<?xml version="1.0" encoding="UTF-8"?>
<docs>
<doc>
<articolo>
<data>Mercoledí 24 Febbraio 2021</data>
<testo>
<p>Row1</p>
<p>Row2</p>
<p>Row3 <br/>Row4<br/>Row5<br/>Row6</p>
<p>Row7</p>
<p>Row8</p>
<p>Row9<br/>Row10 <br/>Row11 <br/>Row12 <br/>Row13</p>
</testo>
</articolo>
</doc></docs>
我的代码如下:
import xml.etree.ElementTree as ET
file = os.path.join(directory, "myfile.xml")
tree = ET.parse(file)
root = tree.getroot()
for doc in root.findall('doc'):
for articolo in doc.findall('articolo'):
data = articolo.find('data').text
testo = ""
for x in articolo.find('testo').findall('p'):
if x.text != None:
testo = testo + x.text + "\n"
print(testo)
我希望得到以下结果:
Row1
Row2 Row3 Row4 Row5 Row6
Row7
Row8
Row9 Row10 Row11 Row12 Row13
但我得到:
Row1
Row2
Row3
Row7
Row8
Row9
主要问题是部分句子(< br /> 标签后的部分完全缺失)。有没有办法删除 < br /> 标签?
谢谢 弗朗西斯卡
你可以做一些更简单的事情:
data = [list(row.itertext()) for row in root.findall('.//testo/p')]
for datum in data:
print([dat.strip() for dat in datum])
输出:
['Row1']
['Row2']
['Row3', 'Row4', 'Row5', 'Row6']
['Row7']
['Row8']
['Row9', 'Row10', 'Row11', 'Row12', 'Row13']
使用 itertext()
(Jack Fleeting 在其他答案中提到),您的代码看起来
for x in articolo.find('testo').findall('p'):
testo = ""
for child in x.itertext():
testo += child.strip() + " "
#testo += "\n"
print(testo)
如果你想要一个字符串中的所有内容
testo = ""
for x in articolo.find('testo').findall('p'):
for child in x.itertext():
testo += child.strip() + " "
testo += "\n"
print(testo)
完整的工作示例
text = '''<?xml version="1.0" encoding="UTF-8"?>
<docs>
<doc>
<articolo>
<data>Mercoledí 24 Febbraio 2021</data>
<testo>
<p>Row1</p>
<p>Row2</p>
<p>Row3 <br/>Row4<br/>Row5<br/>Row6</p>
<p>Row7</p>
<p>Row8</p>
<p>Row9<br/>Row10 <br/>Row11 <br/>Row12 <br/>Row13</p>
</testo>
</articolo>
</doc></docs>'''
import xml.etree.ElementTree as ET
#file = os.path.join(directory, "20210224_SOLE_2.xml")
#tree = ET.parse(file)
#root = tree.getroot()
root = ET.fromstring(text)
for doc in root.findall('doc'):
for articolo in doc.findall('articolo'):
data = articolo.find('data').text
for x in articolo.find('testo').findall('p'):
testo = ""
for child in x.itertext():
testo += child.strip() + " "
#testo += "\n"
print(testo)