使用 ElementTree 解析 XML 个具有相同名称的子标签
Parsing XML child tags with same name using ElementTree
我正在尝试解析具有以下结构的 XML 文件(在 Python 中,这对我来说是新的):
<xml>
<document>
<fit>
<grp> some tags </grp>
<prp>
<p> <id> 1674 </id> </p>
<drp>
<name> Joe </name>
<post>
<company> abc </company>
<company> Ltd. </company>
</post>
</drp>
</prp>
</fit>
</document>
<document>
.
.
.
</xml>
为了提取 id、name、company 等信息,然后将它们写入 csv,我尝试了以下代码:
tree = ET.parse(file)
root=tree.getroot()
with open(csvfile, 'a') as f:
writer=csvDictWriter(f, ['ID', 'NAME', 'NCOMP'], delimiter=', ')
writer.writeheader()
result = {}
for child in root.findall('./fit'):
result['ID'] = ( "" .join(child.find('p').find('id').text))
result['NAME'] = ( "" .join(child.find('drp').find('name'))
result['NCOMP'] = ( "" .join(child.find('drp').find('post').find('company')
writer.write(result)
然而,对于公司名称,我只得到第一个标签的内容,然后我尝试使用 for 循环并附加到这样的列表中:
Com = []
for each in child.find('drp').find('post'):
coms = each.find('company')
Com = Com.append[coms]
result['NCOMP'] = Com
期望的输出:
ID. NAME. NCOMP
1674. Joe. abc Ltd.
如何更改代码以使其包含两个标签的值?
按照这些思路尝试一些东西;它使用 lxml 通过 xpath 收集数据,并 pandas 将其存储在数据框中:
data = """
<xml>
<document>
<fit>
<grp>some tags</grp>
<prp>
<p>
<id>1674</id>
</p>
<drp>
<name>Joe</name>
<post>
<company>abc</company>
<company>Ltd.</company>
</post>
</drp>
</prp>
</fit>
</document>
</xml>
"""
from lxml import etree
import pandas as pd
columns = ["ID", "NAME", "NCOMP"]
rows = []
doc = etree.XML(data)
targets = doc.xpath('//prp')
for target in targets:
row = []
id = target.xpath('./p/id/text()')[0]
name = target.xpath('./drp/name/text()')[0]
ncomp = target.xpath('./drp//post//company/text()')
row.extend([id,name,' '.join(ncomp)])
rows.append(row)
pd.DataFrame(rows,columns=columns)
输出:
ID NAME NCOMP
0 1674 Joe abc Ltd.
编辑 - ET 版本。
第一个:
import xml.etree.ElementTree as ET
然后,从 doc
开始替换为:
doc = ET.fromstring(data)
et_targets = doc.findall('.//prp')
for target in et_targets:
row = []
id = target.findall('./p/id')[0]
name = target.findall('./drp/name')[0]
ncomp = target.findall('./drp//post//company')[0]
row.extend([id.text,name.text,' '.join(ncomp.text)])
rows.append(row)
pd.DataFrame(rows,columns=columns)
输出应该是一样的。
另一个解决方案。
from simplified_scrapy import SimplifiedDoc,req,utils
html = '''
<xml>
<document>
<fit>
<grp> some tags </grp>
<prp>
<p> <id> 1674 </id> </p>
<drp>
<name> Joe </name>
<post>
<company> abc </company>
<company> Ltd. </company>
</post>
</drp>
</prp>
</fit>
</document>
<document>
.
.
.
</xml>
'''
doc = SimplifiedDoc(html)
rows = []
rows.append(['ID', 'NAME', 'NCOMP'])
for document in doc.documents:
rows.append([document.id.text,document.name.text," ".join(document.companys.text)])
utils.save2csv('test.csv',rows)
结果:
ID,NAME,NCOMP
1674,Joe,abc Ltd.
我正在尝试解析具有以下结构的 XML 文件(在 Python 中,这对我来说是新的):
<xml>
<document>
<fit>
<grp> some tags </grp>
<prp>
<p> <id> 1674 </id> </p>
<drp>
<name> Joe </name>
<post>
<company> abc </company>
<company> Ltd. </company>
</post>
</drp>
</prp>
</fit>
</document>
<document>
.
.
.
</xml>
为了提取 id、name、company 等信息,然后将它们写入 csv,我尝试了以下代码:
tree = ET.parse(file)
root=tree.getroot()
with open(csvfile, 'a') as f:
writer=csvDictWriter(f, ['ID', 'NAME', 'NCOMP'], delimiter=', ')
writer.writeheader()
result = {}
for child in root.findall('./fit'):
result['ID'] = ( "" .join(child.find('p').find('id').text))
result['NAME'] = ( "" .join(child.find('drp').find('name'))
result['NCOMP'] = ( "" .join(child.find('drp').find('post').find('company')
writer.write(result)
然而,对于公司名称,我只得到第一个标签的内容,然后我尝试使用 for 循环并附加到这样的列表中:
Com = []
for each in child.find('drp').find('post'):
coms = each.find('company')
Com = Com.append[coms]
result['NCOMP'] = Com
期望的输出:
ID. NAME. NCOMP
1674. Joe. abc Ltd.
如何更改代码以使其包含两个标签的值?
按照这些思路尝试一些东西;它使用 lxml 通过 xpath 收集数据,并 pandas 将其存储在数据框中:
data = """
<xml>
<document>
<fit>
<grp>some tags</grp>
<prp>
<p>
<id>1674</id>
</p>
<drp>
<name>Joe</name>
<post>
<company>abc</company>
<company>Ltd.</company>
</post>
</drp>
</prp>
</fit>
</document>
</xml>
"""
from lxml import etree
import pandas as pd
columns = ["ID", "NAME", "NCOMP"]
rows = []
doc = etree.XML(data)
targets = doc.xpath('//prp')
for target in targets:
row = []
id = target.xpath('./p/id/text()')[0]
name = target.xpath('./drp/name/text()')[0]
ncomp = target.xpath('./drp//post//company/text()')
row.extend([id,name,' '.join(ncomp)])
rows.append(row)
pd.DataFrame(rows,columns=columns)
输出:
ID NAME NCOMP
0 1674 Joe abc Ltd.
编辑 - ET 版本。
第一个:
import xml.etree.ElementTree as ET
然后,从 doc
开始替换为:
doc = ET.fromstring(data)
et_targets = doc.findall('.//prp')
for target in et_targets:
row = []
id = target.findall('./p/id')[0]
name = target.findall('./drp/name')[0]
ncomp = target.findall('./drp//post//company')[0]
row.extend([id.text,name.text,' '.join(ncomp.text)])
rows.append(row)
pd.DataFrame(rows,columns=columns)
输出应该是一样的。
另一个解决方案。
from simplified_scrapy import SimplifiedDoc,req,utils
html = '''
<xml>
<document>
<fit>
<grp> some tags </grp>
<prp>
<p> <id> 1674 </id> </p>
<drp>
<name> Joe </name>
<post>
<company> abc </company>
<company> Ltd. </company>
</post>
</drp>
</prp>
</fit>
</document>
<document>
.
.
.
</xml>
'''
doc = SimplifiedDoc(html)
rows = []
rows.append(['ID', 'NAME', 'NCOMP'])
for document in doc.documents:
rows.append([document.id.text,document.name.text," ".join(document.companys.text)])
utils.save2csv('test.csv',rows)
结果:
ID,NAME,NCOMP
1674,Joe,abc Ltd.