在 Python 中使用 minidom 解析嵌套的 XML 结构
Parsing nested XML structure using minidom in Python
我是 Python XML 初学者,我无法从给定的 XML 文件中获取数据:
<?xml version="1.0" encoding="UTF-8"?>
<martif xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="en">
<cat>
<desc type="No">1</desc>
<desc type="Main">DES1.1</desc>
<desc type="Sub">DES1.2</desc>
<lang xml:lang="EN">
<t>
<term>T1.1</term>
<Typ type="TermType">main</Typ>
</t>
<t>
<term>T1.2</term>
<Typ type="TermType">option</Typ>
</t>
</lang>
<lang xml:lang="FR">
<t>
<term>T1.3</term>
<Typ type="TermType">main</Typ>
</t>
<t>
<term>T1.4</term>
<Typ type="TermType">option</Typ>
</t>
</lang>
</cat>
<cat>
<desc type="No">2</desc>
<desc type="Main">DES2.1</desc>
<desc type="Sub">DES2.2</desc>
<lang xml:lang="EN">
<t>
<term>T2.1</term>
<Typ type="TermType">main</Typ>
</t>
<t>
<term>T2.2</term>
<Typ type="TermType">option</Typ>
</t>
</lang>
<lang xml:lang="FR">
<t>
<term>T2.3</term>
<Typ type="TermType">main</Typ>
</t>
<t>
<term>T2.4</term>
<Typ type="TermType">option</Typ>
</t>
</lang>
</cat>
</martif>
期望的结果应该是:
Type: Main Category: DES1.1
Type: Sub Category: DES1.2
lang: EN
Term: T2.1
TermType: main
Term: T1.2
TermType: option
lang: FR
Term: T1.3
Term Note: main
Term: T1.4
TermType: option
Type: Main Category: DES2.1
Type: Sub Category: DES2.2
lang: EN
Term: T2.1
TermType: main
Term: T2.2
TermType: option
lang: FR
Term: T2.3
Term Note: main
Term: T2.4
TermType: option
我试过了,但在获得所需结果方面仍有一些问题,问题是如何根据给定的 xml 数据结构提取数据。
这是我的代码:
from xml.dom import minidom
doc = minidom.parse("data.xml")
descs = doc.getElementsByTagName("desc")
for desSetElem in descs:
type = desSetElem.getAttribute("type")
if type!='No':
print('Type: ',type,' Category:',desSetElem.firstChild.nodeValue)
lang_termSetElem = doc.getElementsByTagName('lang')
for lang_term in lang_termSetElem:
# for lang_tig in lang_tigSetElem:
lang_type=lang_term.getAttribute(('xml:lang'))
print('lang: ',lang_type)
print('Term: ',lang_term.getElementsByTagName("term")[0].firstChild.nodeValue)
print('Term Type:',lang_term.getElementsByTagName("Typ")[0].firstChild.nodeValue)
这是我得到的结果:
Type: Main Category: DES1.1
lang: EN
Term: T1.1
Term Type: main
lang: FR
Term: T1.3
Term Type: main
lang: EN
Term: T2.1
Term Type: main
lang: FR
Term: T2.3
Term Type: main
Type: Sub Category: DES1.2
lang: EN
Term: T1.1
Term Type: main
lang: FR
Term: T1.3
Term Type: main
lang: EN
Term: T2.1
Term Type: main
lang: FR
Term: T2.3
Term Type: main
Type: Main Category: DES2.1
lang: EN
Term: T1.1
Term Type: main
lang: FR
Term: T1.3
Term Type: main
lang: EN
Term: T2.1
Term Type: main
lang: FR
Term: T2.3
Term Type: main
Type: Sub Category: DES2.2
lang: EN
Term: T1.1
Term Type: main
lang: FR
Term: T1.3
Term Type: main
lang: EN
Term: T2.1
Term Type: main
lang: FR
Term: T2.3
Term Type: main
考虑通过循环走下 XML 的三个级别:<cat>
、<desc>
/<lang>
和 <t>
。具体来说,由于 <lang>
是 <desc>
的兄弟,所以它不应该是嵌套循环。此外,需要迭代 <t>
个元素。
同时考虑使用 F 字符串 (Python 3.6+) 和换行符以符合 80 个字符的 PEP-8 标准。
from xml.dom import minidom
doc = minidom.parse("MiniDOMPrintOutput.xml")
cats = doc.getElementsByTagName("cat")
for catElem in cats:
descs = catElem.getElementsByTagName("desc")
for desSetElem in descs:
type = desSetElem.getAttribute("type")
if type != 'No':
print(f"Type: {type.ljust(9)}"
f"Category: {desSetElem.firstChild.nodeValue}")
lang_termSetElem = catElem.getElementsByTagName("lang")
for lang_term in lang_termSetElem:
lang_type = lang_term.getAttribute(("xml:lang"))
print(f"lang: {lang_type}")
lang_tigSetElem = lang_term.getElementsByTagName("t")
for lang_tig in lang_tigSetElem:
term = (lang_tig.getElementsByTagName('term')[0]
.firstChild
.nodeValue)
Typ = (lang_tig.getElementsByTagName('Typ')[0]
.firstChild
.nodeValue)
print(f"Term: {term}")
print(f"Term Type: {Typ}")
输出
Type: Main Category: DES1.1
Type: Sub Category: DES1.2
lang: EN
Term: T1.1
Term Type: main
Term: T1.2
Term Type: option
lang: FR
Term: T1.3
Term Type: main
Term: T1.4
Term Type: option
Type: Main Category: DES2.1
Type: Sub Category: DES2.2
lang: EN
Term: T2.1
Term Type: main
Term: T2.2
Term Type: option
lang: FR
Term: T2.3
Term Type: main
Term: T2.4
Term Type: option
我是 Python XML 初学者,我无法从给定的 XML 文件中获取数据:
<?xml version="1.0" encoding="UTF-8"?>
<martif xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="en">
<cat>
<desc type="No">1</desc>
<desc type="Main">DES1.1</desc>
<desc type="Sub">DES1.2</desc>
<lang xml:lang="EN">
<t>
<term>T1.1</term>
<Typ type="TermType">main</Typ>
</t>
<t>
<term>T1.2</term>
<Typ type="TermType">option</Typ>
</t>
</lang>
<lang xml:lang="FR">
<t>
<term>T1.3</term>
<Typ type="TermType">main</Typ>
</t>
<t>
<term>T1.4</term>
<Typ type="TermType">option</Typ>
</t>
</lang>
</cat>
<cat>
<desc type="No">2</desc>
<desc type="Main">DES2.1</desc>
<desc type="Sub">DES2.2</desc>
<lang xml:lang="EN">
<t>
<term>T2.1</term>
<Typ type="TermType">main</Typ>
</t>
<t>
<term>T2.2</term>
<Typ type="TermType">option</Typ>
</t>
</lang>
<lang xml:lang="FR">
<t>
<term>T2.3</term>
<Typ type="TermType">main</Typ>
</t>
<t>
<term>T2.4</term>
<Typ type="TermType">option</Typ>
</t>
</lang>
</cat>
</martif>
期望的结果应该是:
Type: Main Category: DES1.1
Type: Sub Category: DES1.2
lang: EN
Term: T2.1
TermType: main
Term: T1.2
TermType: option
lang: FR
Term: T1.3
Term Note: main
Term: T1.4
TermType: option
Type: Main Category: DES2.1
Type: Sub Category: DES2.2
lang: EN
Term: T2.1
TermType: main
Term: T2.2
TermType: option
lang: FR
Term: T2.3
Term Note: main
Term: T2.4
TermType: option
我试过了,但在获得所需结果方面仍有一些问题,问题是如何根据给定的 xml 数据结构提取数据。
这是我的代码:
from xml.dom import minidom
doc = minidom.parse("data.xml")
descs = doc.getElementsByTagName("desc")
for desSetElem in descs:
type = desSetElem.getAttribute("type")
if type!='No':
print('Type: ',type,' Category:',desSetElem.firstChild.nodeValue)
lang_termSetElem = doc.getElementsByTagName('lang')
for lang_term in lang_termSetElem:
# for lang_tig in lang_tigSetElem:
lang_type=lang_term.getAttribute(('xml:lang'))
print('lang: ',lang_type)
print('Term: ',lang_term.getElementsByTagName("term")[0].firstChild.nodeValue)
print('Term Type:',lang_term.getElementsByTagName("Typ")[0].firstChild.nodeValue)
这是我得到的结果:
Type: Main Category: DES1.1
lang: EN
Term: T1.1
Term Type: main
lang: FR
Term: T1.3
Term Type: main
lang: EN
Term: T2.1
Term Type: main
lang: FR
Term: T2.3
Term Type: main
Type: Sub Category: DES1.2
lang: EN
Term: T1.1
Term Type: main
lang: FR
Term: T1.3
Term Type: main
lang: EN
Term: T2.1
Term Type: main
lang: FR
Term: T2.3
Term Type: main
Type: Main Category: DES2.1
lang: EN
Term: T1.1
Term Type: main
lang: FR
Term: T1.3
Term Type: main
lang: EN
Term: T2.1
Term Type: main
lang: FR
Term: T2.3
Term Type: main
Type: Sub Category: DES2.2
lang: EN
Term: T1.1
Term Type: main
lang: FR
Term: T1.3
Term Type: main
lang: EN
Term: T2.1
Term Type: main
lang: FR
Term: T2.3
Term Type: main
考虑通过循环走下 XML 的三个级别:<cat>
、<desc>
/<lang>
和 <t>
。具体来说,由于 <lang>
是 <desc>
的兄弟,所以它不应该是嵌套循环。此外,需要迭代 <t>
个元素。
同时考虑使用 F 字符串 (Python 3.6+) 和换行符以符合 80 个字符的 PEP-8 标准。
from xml.dom import minidom
doc = minidom.parse("MiniDOMPrintOutput.xml")
cats = doc.getElementsByTagName("cat")
for catElem in cats:
descs = catElem.getElementsByTagName("desc")
for desSetElem in descs:
type = desSetElem.getAttribute("type")
if type != 'No':
print(f"Type: {type.ljust(9)}"
f"Category: {desSetElem.firstChild.nodeValue}")
lang_termSetElem = catElem.getElementsByTagName("lang")
for lang_term in lang_termSetElem:
lang_type = lang_term.getAttribute(("xml:lang"))
print(f"lang: {lang_type}")
lang_tigSetElem = lang_term.getElementsByTagName("t")
for lang_tig in lang_tigSetElem:
term = (lang_tig.getElementsByTagName('term')[0]
.firstChild
.nodeValue)
Typ = (lang_tig.getElementsByTagName('Typ')[0]
.firstChild
.nodeValue)
print(f"Term: {term}")
print(f"Term Type: {Typ}")
输出
Type: Main Category: DES1.1
Type: Sub Category: DES1.2
lang: EN
Term: T1.1
Term Type: main
Term: T1.2
Term Type: option
lang: FR
Term: T1.3
Term Type: main
Term: T1.4
Term Type: option
Type: Main Category: DES2.1
Type: Sub Category: DES2.2
lang: EN
Term: T2.1
Term Type: main
Term: T2.2
Term Type: option
lang: FR
Term: T2.3
Term Type: main
Term: T2.4
Term Type: option