在 Python 中解析嵌套和复杂的 XML
Parsing nested and complex XML in Python
我正在尝试解析相当复杂的 xml 文件并将其内容存储在数据框中。我尝试 xml.etree.ElementTree 并设法检索了一些元素,但我以某种方式多次检索它,就好像有更多对象一样。我正在尝试提取以下内容:category, created, last_updated, accession type, name type identifier, name type synonym as a list
<cellosaurus>
<cell-line category="Hybridoma" created="2012-06-06" last_updated="2020-03-12" entry_version="6">
<accession-list>
<accession type="primary">CVCL_B375</accession>
</accession-list>
<name-list>
<name type="identifier">#490</name>
<name type="synonym">490</name>
<name type="synonym">Mab 7</name>
<name type="synonym">Mab7</name>
</name-list>
<comment-list>
<comment category="Monoclonal antibody target"> Cronartium ribicola antigens </comment>
<comment category="Monoclonal antibody isotype"> IgM, kappa </comment>
</comment-list>
<species-list>
<cv-term terminology="NCBI-Taxonomy" accession="10090">Mus musculus</cv-term>
</species-list>
<derived-from>
<cv-term terminology="Cellosaurus" accession="CVCL_4032">P3X63Ag8.653</cv-term>
</derived-from>
<reference-list>
<reference resource-internal-ref="Patent=US5616470"/>
</reference-list>
<xref-list>
<xref database="CLO" category="Ontologies" accession="CLO_0001018">
<url><![CDATA[https://www.ebi.ac.uk/ols/ontologies/clo/terms?iri=http://purl.obolibrary.org/obo/CLO_0001018]]></url>
</xref>
<xref database="ATCC" category="Cell line collections" accession="HB-12029">
<url><![CDATA[https://www.atcc.org/Products/All/HB-12029.aspx]]></url>
</xref>
<xref database="Wikidata" category="Other" accession="Q54422073">
<url><![CDATA[https://www.wikidata.org/wiki/Q54422073]]></url>
</xref>
</xref-list>
</cell-line>
</cellosaurus>
您的问题有点不清楚,因为在某些情况下您希望解析标签属性,而在其他情况下您希望解析 tag_values。
我的理解如下。您需要以下值:
- 标签 类别 属性的值 cell-line.
- 标签 created 属性的值 cell-line.
- 标签 cell-line.
属性 last_updated 的值
- 标签类型的属性值加入。
- 标签name对应的文本,属性为identifier.
- 标签name对应的文本,属性为synonym.
可以使用模块 xml.etree.Etree 从 xml 文件中提取这些值。特别是,请查看使用元素 class.
的 findall and iter 方法
假设 xml 在名为 input.xml 的文件中,下面的代码片段应该可以解决问题。
import xml.etree.ElementTree as et
def main():
tree = et.parse('cellosaurus.xml')
root = tree.getroot()
results = []
for element in root.findall('.//cell-line'):
key_values = {}
for key in ['category', 'created', 'last_updated']:
key_values[key] = element.attrib[key]
for child in element.iter():
if child.tag == 'accession':
key_values['accession type'] = child.attrib['type']
elif child.tag == 'name' and child.attrib['type'] == 'identifier':
key_values['name type identifier'] = child.text
elif child.tag == 'name' and child.attrib['type'] == 'synonym':
key_values['name type synonym'] = child.text
results.append([
# Using the get method of the dict object in case any particular
# entry does not have all the required attributes.
key_values.get('category' , None)
,key_values.get('created' , None)
,key_values.get('last_updated' , None)
,key_values.get('accession type' , None)
,key_values.get('name type identifier', None)
,key_values.get('name type synonym' , None)
])
print(results)
if __name__ == '__main__':
main()
解析 xml 的最简单方法是,恕我直言,使用 lxml。
from lxml import etree
data = """[your xml above]"""
doc = etree.XML(data)
for att in doc.xpath('//cell-line'):
print(att.attrib['category'])
print(att.attrib['last_updated'])
print(att.xpath('.//accession/@type')[0])
print(att.xpath('.//name[@type="identifier"]/text()')[0])
print(att.xpath('.//name[@type="synonym"]/text()'))
输出:
Hybridoma
2020-03-12
primary
#490
['490', 'Mab 7', 'Mab7']
然后您可以将输出分配给变量、附加到列表等。
另一种方法。最近比较了几个XML解析库,发现这个好用。我推荐。
from simplified_scrapy import SimplifiedDoc, utils
xml = '''your xml above'''
# xml = utils.getFileContent('your file name.xml')
results = []
doc = SimplifiedDoc(xml)
for ele in doc.selects('cell-line'):
key_values = {}
for k in ele:
if k not in ['tag','html']:
key_values[k]=ele[k]
key_values['name type identifier'] = ele.select('name@type="identifier">text()')
key_values['name type synonym'] = ele.selects('name@type="synonym">text()')
results.append(key_values)
print (results)
结果:
[{'category': 'Hybridoma', 'created': '2012-06-06', 'last_updated': '2020-03-12', 'entry_version': '6', 'name type identifier': '#490', 'name type synonym': ['490', 'Mab 7', 'Mab7']}]
我正在尝试解析相当复杂的 xml 文件并将其内容存储在数据框中。我尝试 xml.etree.ElementTree 并设法检索了一些元素,但我以某种方式多次检索它,就好像有更多对象一样。我正在尝试提取以下内容:category, created, last_updated, accession type, name type identifier, name type synonym as a list
<cellosaurus>
<cell-line category="Hybridoma" created="2012-06-06" last_updated="2020-03-12" entry_version="6">
<accession-list>
<accession type="primary">CVCL_B375</accession>
</accession-list>
<name-list>
<name type="identifier">#490</name>
<name type="synonym">490</name>
<name type="synonym">Mab 7</name>
<name type="synonym">Mab7</name>
</name-list>
<comment-list>
<comment category="Monoclonal antibody target"> Cronartium ribicola antigens </comment>
<comment category="Monoclonal antibody isotype"> IgM, kappa </comment>
</comment-list>
<species-list>
<cv-term terminology="NCBI-Taxonomy" accession="10090">Mus musculus</cv-term>
</species-list>
<derived-from>
<cv-term terminology="Cellosaurus" accession="CVCL_4032">P3X63Ag8.653</cv-term>
</derived-from>
<reference-list>
<reference resource-internal-ref="Patent=US5616470"/>
</reference-list>
<xref-list>
<xref database="CLO" category="Ontologies" accession="CLO_0001018">
<url><![CDATA[https://www.ebi.ac.uk/ols/ontologies/clo/terms?iri=http://purl.obolibrary.org/obo/CLO_0001018]]></url>
</xref>
<xref database="ATCC" category="Cell line collections" accession="HB-12029">
<url><![CDATA[https://www.atcc.org/Products/All/HB-12029.aspx]]></url>
</xref>
<xref database="Wikidata" category="Other" accession="Q54422073">
<url><![CDATA[https://www.wikidata.org/wiki/Q54422073]]></url>
</xref>
</xref-list>
</cell-line>
</cellosaurus>
您的问题有点不清楚,因为在某些情况下您希望解析标签属性,而在其他情况下您希望解析 tag_values。
我的理解如下。您需要以下值:
- 标签 类别 属性的值 cell-line.
- 标签 created 属性的值 cell-line.
- 标签 cell-line. 属性 last_updated 的值
- 标签类型的属性值加入。
- 标签name对应的文本,属性为identifier.
- 标签name对应的文本,属性为synonym.
可以使用模块 xml.etree.Etree 从 xml 文件中提取这些值。特别是,请查看使用元素 class.
的 findall and iter 方法假设 xml 在名为 input.xml 的文件中,下面的代码片段应该可以解决问题。
import xml.etree.ElementTree as et
def main():
tree = et.parse('cellosaurus.xml')
root = tree.getroot()
results = []
for element in root.findall('.//cell-line'):
key_values = {}
for key in ['category', 'created', 'last_updated']:
key_values[key] = element.attrib[key]
for child in element.iter():
if child.tag == 'accession':
key_values['accession type'] = child.attrib['type']
elif child.tag == 'name' and child.attrib['type'] == 'identifier':
key_values['name type identifier'] = child.text
elif child.tag == 'name' and child.attrib['type'] == 'synonym':
key_values['name type synonym'] = child.text
results.append([
# Using the get method of the dict object in case any particular
# entry does not have all the required attributes.
key_values.get('category' , None)
,key_values.get('created' , None)
,key_values.get('last_updated' , None)
,key_values.get('accession type' , None)
,key_values.get('name type identifier', None)
,key_values.get('name type synonym' , None)
])
print(results)
if __name__ == '__main__':
main()
解析 xml 的最简单方法是,恕我直言,使用 lxml。
from lxml import etree
data = """[your xml above]"""
doc = etree.XML(data)
for att in doc.xpath('//cell-line'):
print(att.attrib['category'])
print(att.attrib['last_updated'])
print(att.xpath('.//accession/@type')[0])
print(att.xpath('.//name[@type="identifier"]/text()')[0])
print(att.xpath('.//name[@type="synonym"]/text()'))
输出:
Hybridoma
2020-03-12
primary
#490
['490', 'Mab 7', 'Mab7']
然后您可以将输出分配给变量、附加到列表等。
另一种方法。最近比较了几个XML解析库,发现这个好用。我推荐。
from simplified_scrapy import SimplifiedDoc, utils
xml = '''your xml above'''
# xml = utils.getFileContent('your file name.xml')
results = []
doc = SimplifiedDoc(xml)
for ele in doc.selects('cell-line'):
key_values = {}
for k in ele:
if k not in ['tag','html']:
key_values[k]=ele[k]
key_values['name type identifier'] = ele.select('name@type="identifier">text()')
key_values['name type synonym'] = ele.selects('name@type="synonym">text()')
results.append(key_values)
print (results)
结果:
[{'category': 'Hybridoma', 'created': '2012-06-06', 'last_updated': '2020-03-12', 'entry_version': '6', 'name type identifier': '#490', 'name type synonym': ['490', 'Mab 7', 'Mab7']}]