使用 etree.ElementTree 从 xml 中提取数据
Issue pulling data from xml with etree.ElementTree
我正在使用 JMDict (https://www.edrdg.org/jmdict/j_jmdict.html)。这是我遇到问题的数据的一个小例子:
<entry>
<ent_seq>1265070</ent_seq>
<k_ele>
<keb>古い</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news1</ke_pri>
<ke_pri>nf16</ke_pri>
</k_ele>
<k_ele>
<keb>故い</keb>
</k_ele>
<k_ele>
<keb>旧い</keb>
</k_ele>
<r_ele>
<reb>ふるい</reb>
<re_pri>ichi1</re_pri>
<re_pri>news1</re_pri>
<re_pri>nf16</re_pri>
</r_ele>
<sense>
<pos>&adj-i;</pos>
<s_inf>of things, not people</s_inf>
<gloss lang="eng">old</gloss>
<gloss lang="eng">aged</gloss>
<gloss lang="eng">ancient</gloss>
<gloss lang="eng">antiquated</gloss>
<gloss lang="eng">antique</gloss>
<gloss lang="eng">timeworn</gloss>
</sense>
<sense>
<pos>&adj-i;</pos>
<gloss lang="eng">long</gloss>
<gloss lang="eng">since long ago</gloss>
<gloss lang="eng">time-honored</gloss>
</sense>
<sense>
<pos>&adj-i;</pos>
<gloss lang="eng">of the distant past</gloss>
<gloss lang="eng">long-ago</gloss>
</sense>
<sense>
<pos>&adj-i;</pos>
<gloss lang="eng">stale</gloss>
<gloss lang="eng">threadbare</gloss>
<gloss lang="eng">hackneyed</gloss>
<gloss lang="eng">corny</gloss>
</sense>
<sense>
<pos>&adj-i;</pos>
<gloss lang="eng">old-fashioned</gloss>
<gloss lang="eng">outmoded</gloss>
<gloss lang="eng">out-of-date</gloss>
</sense>
<sense>
<gloss lang="dut">oud</gloss>
</sense>
<sense>
<gloss lang="dut">aloud</gloss>
<gloss lang="dut">verouderd</gloss>
<gloss lang="dut">oubollig</gloss>
<gloss lang="dut">gedateerd</gloss>
<gloss lang="dut">ouderwets</gloss>
<gloss lang="dut">oudmodisch</gloss>
<gloss lang="dut">archaïsch</gloss>
<gloss lang="dut">antiek</gloss>
<gloss lang="dut">{i.h.b.} afgezaagd</gloss>
</sense>
<sense>
<gloss lang="dut">niet vers</gloss>
<gloss lang="dut">onfris</gloss>
<gloss lang="dut">belegen</gloss>
<gloss lang="dut">oud</gloss>
<gloss lang="dut">oudbakken</gloss>
<gloss lang="dut">verschaald</gloss>
<gloss lang="dut">muf</gloss>
</sense>
<sense>
<gloss lang="dut">gebruikt</gloss>
<gloss lang="dut">afgewerkt</gloss>
<gloss lang="dut">sleets</gloss>
<gloss lang="dut">versleten</gloss>
</sense>
<sense>
<gloss lang="fre">vieux (sauf pour les personnes)</gloss>
<gloss lang="fre">âgé</gloss>
<gloss lang="fre">ancien</gloss>
<gloss lang="fre">antique</gloss>
<gloss lang="fre">vieilli</gloss>
<gloss lang="fre">vieillot</gloss>
<gloss lang="fre">caduque</gloss>
<gloss lang="fre">démodé</gloss>
<gloss lang="fre">obsolète</gloss>
<gloss lang="fre">passé</gloss>
<gloss lang="fre">vicié</gloss>
<gloss lang="fre">usé</gloss>
</sense>
<sense>
<gloss lang="ger">alt</gloss>
<gloss lang="ger">altertümlich</gloss>
</sense>
<sense>
<gloss lang="ger">langjährig</gloss>
<gloss lang="ger">sich über lange Zeit erstreckend</gloss>
</sense>
<sense>
<gloss lang="ger">altehrwürdig</gloss>
<gloss lang="ger">althergebracht</gloss>
</sense>
<sense>
<gloss lang="ger">ehemalig</gloss>
<gloss lang="ger">noch nicht reformiert</gloss>
<gloss lang="ger">in der alten Version</gloss>
</sense>
<sense>
<gloss lang="ger">altmodisch</gloss>
<gloss lang="ger">veraltet</gloss>
<gloss lang="ger">altbacken</gloss>
<gloss lang="ger">unmodern</gloss>
<gloss lang="ger">abgestanden</gloss>
<gloss lang="ger">abgenutzt</gloss>
</sense>
<sense>
<gloss lang="ger">alterserfahren</gloss>
<gloss lang="ger">routiniert</gloss>
</sense>
<sense>
<gloss lang="hun">öreg</gloss>
<gloss lang="hun">régi</gloss>
<gloss lang="hun">divatjamúlt</gloss>
<gloss lang="hun">elavult</gloss>
<gloss lang="hun">állott</gloss>
<gloss lang="hun">áporodott</gloss>
<gloss lang="hun">banális</gloss>
<gloss lang="hun">elcsépelt</gloss>
<gloss lang="hun">elévült</gloss>
<gloss lang="hun">nem friss</gloss>
<gloss lang="hun">poshadt</gloss>
<gloss lang="hun">foszlott</gloss>
<gloss lang="hun">kopott</gloss>
</sense>
<sense>
<gloss lang="rus">старый</gloss>
<gloss lang="rus">1) старый</gloss>
<gloss lang="rus">(ср.) ふるく</gloss>
<gloss lang="rus">2) устарелый, отсталый</gloss>
</sense>
<sense>
<gloss lang="slv">star (za predmete)</gloss>
</sense>
<sense>
<gloss lang="spa">(objetos) viejo</gloss>
<gloss lang="spa">antiguo</gloss>
<gloss lang="spa">anticuado</gloss>
<gloss lang="spa">antigüedad</gloss>
<gloss lang="spa">articulo obsoleto</gloss>
<gloss lang="spa">(objeto) viejo</gloss>
<gloss lang="spa">antiguo</gloss>
<gloss lang="spa">anticuado</gloss>
<gloss lang="spa">antigüedad</gloss>
<gloss lang="spa">articulo obsoleto</gloss>
</sense>
<sense>
<gloss lang="swe">gammal</gloss>
</sense>
</entry>
我正在使用 Django 和 etree.ElementTree 来提取数据。
这是我的代码:
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
treeW = ET.parse('D:\Dev\kanjitest\kanjitest\static\dbxml\JMdict.xml')
rootW = treeW.getroot()
wordsKanjiXml = rootW.findall(".//entry")
entryId =0
for wordsKanjiEntry in wordsKanjiXml:
entry = {}
entryKanji =''
entryKana =''
entryMeanings = []
# -------------
for parts in wordsKanjiEntry:
sensesList=[]
if parts.tag == 'k_ele':
for kanji in parts:
if kanji.tag == 'keb':
entryKanji=kanji.text
if parts.tag=='sense':
for sense in parts:
if sense.tag == 'gloss':
if 'spa' in sense.attrib.values():
sensesList.append(sense.text)
if sensesList:
entryMeanings=sensesList
if parts.tag == 'r_ele':
for kana in parts:
if kana.tag == 'reb':
entryKana=kana.text
# -------------
entryId =entryId+1
entry=dict(
kanji = entryKanji,
kana = entryKana,
meanings = entryMeanings
)
if literal in entryKanji:
words.append(entry)
所以,问题出在最后的条件。 literal
是另一部分代码中的一个变量,其中包含一个字符串中的汉字。例如,古。因此,如果一个单词包含该特定字符串,则该条目将添加到名为 words
的列表中。问题在于同一个词有多个写作的条目(如我发布的 xml 示例)。古い可以写成“古い”,或“故い”,或“旧い”。因为它有几个 <keb>
标签,所以条件似乎不适用,即使其中一个实际上是真的。我不知道我对自己的解释是否足够好,但我希望有人理解并帮助我完善最终条件,因此如果任何 <keb>
标签包含 literal
,代码就会运行。
仅存在 entryKanji
的最后一个值,因此它可能与 literal
中的值匹配,也可能不匹配。
制作 entryKanji
列表并使用 literal='古'
进行测试。
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
treeW = ET.parse('/home/luis/tmp/test.xml')
rootW = treeW.getroot()
literal='古'
words = []
wordsKanjiXml = rootW.findall(".//entry")
entryId =0
for wordsKanjiEntry in wordsKanjiXml:
entry = {}
entryKanji =[]
entryKana =''
entryMeanings = []
# -------------
for parts in wordsKanjiEntry:
sensesList=[]
if parts.tag == 'k_ele':
for kanji in parts:
if kanji.tag == 'keb':
entryKanji.append(kanji.text)
if parts.tag=='sense':
for sense in parts:
if sense.tag == 'gloss':
if 'spa' in sense.attrib.values():
sensesList.append(sense.text)
if sensesList:
entryMeanings=sensesList
if parts.tag == 'r_ele':
for kana in parts:
if kana.tag == 'reb':
entryKana=kana.text
# -------------
entryId =entryId+1
entry=dict(
kanji = entryKanji,
kana = entryKana,
meanings = entryMeanings
)
if literal in ''.join(entryKanji) :
words.append(entry)
print(words)
结果:
[{'kanji': ['古い', '故い', '旧い'], 'kana': 'ふるい', 'meanings': []}]
列出要检查的 XML 个实体也有效
entryKanji =[]
entryKanjiEnt =[]
# existing code
for kanji in parts:
if kanji.tag == 'keb':
entryKanji.append(kanji.text)
entryKanjiEnt.append(kanji.text.encode('ascii', 'xmlcharrefreplace'))
# existing code
if b''.join(entryKanjiEnt).find(literal.encode('ascii', 'xmlcharrefreplace')) != -1:
words.append(entry)
我正在使用 JMDict (https://www.edrdg.org/jmdict/j_jmdict.html)。这是我遇到问题的数据的一个小例子:
<entry>
<ent_seq>1265070</ent_seq>
<k_ele>
<keb>古い</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news1</ke_pri>
<ke_pri>nf16</ke_pri>
</k_ele>
<k_ele>
<keb>故い</keb>
</k_ele>
<k_ele>
<keb>旧い</keb>
</k_ele>
<r_ele>
<reb>ふるい</reb>
<re_pri>ichi1</re_pri>
<re_pri>news1</re_pri>
<re_pri>nf16</re_pri>
</r_ele>
<sense>
<pos>&adj-i;</pos>
<s_inf>of things, not people</s_inf>
<gloss lang="eng">old</gloss>
<gloss lang="eng">aged</gloss>
<gloss lang="eng">ancient</gloss>
<gloss lang="eng">antiquated</gloss>
<gloss lang="eng">antique</gloss>
<gloss lang="eng">timeworn</gloss>
</sense>
<sense>
<pos>&adj-i;</pos>
<gloss lang="eng">long</gloss>
<gloss lang="eng">since long ago</gloss>
<gloss lang="eng">time-honored</gloss>
</sense>
<sense>
<pos>&adj-i;</pos>
<gloss lang="eng">of the distant past</gloss>
<gloss lang="eng">long-ago</gloss>
</sense>
<sense>
<pos>&adj-i;</pos>
<gloss lang="eng">stale</gloss>
<gloss lang="eng">threadbare</gloss>
<gloss lang="eng">hackneyed</gloss>
<gloss lang="eng">corny</gloss>
</sense>
<sense>
<pos>&adj-i;</pos>
<gloss lang="eng">old-fashioned</gloss>
<gloss lang="eng">outmoded</gloss>
<gloss lang="eng">out-of-date</gloss>
</sense>
<sense>
<gloss lang="dut">oud</gloss>
</sense>
<sense>
<gloss lang="dut">aloud</gloss>
<gloss lang="dut">verouderd</gloss>
<gloss lang="dut">oubollig</gloss>
<gloss lang="dut">gedateerd</gloss>
<gloss lang="dut">ouderwets</gloss>
<gloss lang="dut">oudmodisch</gloss>
<gloss lang="dut">archaïsch</gloss>
<gloss lang="dut">antiek</gloss>
<gloss lang="dut">{i.h.b.} afgezaagd</gloss>
</sense>
<sense>
<gloss lang="dut">niet vers</gloss>
<gloss lang="dut">onfris</gloss>
<gloss lang="dut">belegen</gloss>
<gloss lang="dut">oud</gloss>
<gloss lang="dut">oudbakken</gloss>
<gloss lang="dut">verschaald</gloss>
<gloss lang="dut">muf</gloss>
</sense>
<sense>
<gloss lang="dut">gebruikt</gloss>
<gloss lang="dut">afgewerkt</gloss>
<gloss lang="dut">sleets</gloss>
<gloss lang="dut">versleten</gloss>
</sense>
<sense>
<gloss lang="fre">vieux (sauf pour les personnes)</gloss>
<gloss lang="fre">âgé</gloss>
<gloss lang="fre">ancien</gloss>
<gloss lang="fre">antique</gloss>
<gloss lang="fre">vieilli</gloss>
<gloss lang="fre">vieillot</gloss>
<gloss lang="fre">caduque</gloss>
<gloss lang="fre">démodé</gloss>
<gloss lang="fre">obsolète</gloss>
<gloss lang="fre">passé</gloss>
<gloss lang="fre">vicié</gloss>
<gloss lang="fre">usé</gloss>
</sense>
<sense>
<gloss lang="ger">alt</gloss>
<gloss lang="ger">altertümlich</gloss>
</sense>
<sense>
<gloss lang="ger">langjährig</gloss>
<gloss lang="ger">sich über lange Zeit erstreckend</gloss>
</sense>
<sense>
<gloss lang="ger">altehrwürdig</gloss>
<gloss lang="ger">althergebracht</gloss>
</sense>
<sense>
<gloss lang="ger">ehemalig</gloss>
<gloss lang="ger">noch nicht reformiert</gloss>
<gloss lang="ger">in der alten Version</gloss>
</sense>
<sense>
<gloss lang="ger">altmodisch</gloss>
<gloss lang="ger">veraltet</gloss>
<gloss lang="ger">altbacken</gloss>
<gloss lang="ger">unmodern</gloss>
<gloss lang="ger">abgestanden</gloss>
<gloss lang="ger">abgenutzt</gloss>
</sense>
<sense>
<gloss lang="ger">alterserfahren</gloss>
<gloss lang="ger">routiniert</gloss>
</sense>
<sense>
<gloss lang="hun">öreg</gloss>
<gloss lang="hun">régi</gloss>
<gloss lang="hun">divatjamúlt</gloss>
<gloss lang="hun">elavult</gloss>
<gloss lang="hun">állott</gloss>
<gloss lang="hun">áporodott</gloss>
<gloss lang="hun">banális</gloss>
<gloss lang="hun">elcsépelt</gloss>
<gloss lang="hun">elévült</gloss>
<gloss lang="hun">nem friss</gloss>
<gloss lang="hun">poshadt</gloss>
<gloss lang="hun">foszlott</gloss>
<gloss lang="hun">kopott</gloss>
</sense>
<sense>
<gloss lang="rus">старый</gloss>
<gloss lang="rus">1) старый</gloss>
<gloss lang="rus">(ср.) ふるく</gloss>
<gloss lang="rus">2) устарелый, отсталый</gloss>
</sense>
<sense>
<gloss lang="slv">star (za predmete)</gloss>
</sense>
<sense>
<gloss lang="spa">(objetos) viejo</gloss>
<gloss lang="spa">antiguo</gloss>
<gloss lang="spa">anticuado</gloss>
<gloss lang="spa">antigüedad</gloss>
<gloss lang="spa">articulo obsoleto</gloss>
<gloss lang="spa">(objeto) viejo</gloss>
<gloss lang="spa">antiguo</gloss>
<gloss lang="spa">anticuado</gloss>
<gloss lang="spa">antigüedad</gloss>
<gloss lang="spa">articulo obsoleto</gloss>
</sense>
<sense>
<gloss lang="swe">gammal</gloss>
</sense>
</entry>
我正在使用 Django 和 etree.ElementTree 来提取数据。 这是我的代码:
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
treeW = ET.parse('D:\Dev\kanjitest\kanjitest\static\dbxml\JMdict.xml')
rootW = treeW.getroot()
wordsKanjiXml = rootW.findall(".//entry")
entryId =0
for wordsKanjiEntry in wordsKanjiXml:
entry = {}
entryKanji =''
entryKana =''
entryMeanings = []
# -------------
for parts in wordsKanjiEntry:
sensesList=[]
if parts.tag == 'k_ele':
for kanji in parts:
if kanji.tag == 'keb':
entryKanji=kanji.text
if parts.tag=='sense':
for sense in parts:
if sense.tag == 'gloss':
if 'spa' in sense.attrib.values():
sensesList.append(sense.text)
if sensesList:
entryMeanings=sensesList
if parts.tag == 'r_ele':
for kana in parts:
if kana.tag == 'reb':
entryKana=kana.text
# -------------
entryId =entryId+1
entry=dict(
kanji = entryKanji,
kana = entryKana,
meanings = entryMeanings
)
if literal in entryKanji:
words.append(entry)
所以,问题出在最后的条件。 literal
是另一部分代码中的一个变量,其中包含一个字符串中的汉字。例如,古。因此,如果一个单词包含该特定字符串,则该条目将添加到名为 words
的列表中。问题在于同一个词有多个写作的条目(如我发布的 xml 示例)。古い可以写成“古い”,或“故い”,或“旧い”。因为它有几个 <keb>
标签,所以条件似乎不适用,即使其中一个实际上是真的。我不知道我对自己的解释是否足够好,但我希望有人理解并帮助我完善最终条件,因此如果任何 <keb>
标签包含 literal
,代码就会运行。
仅存在 entryKanji
的最后一个值,因此它可能与 literal
中的值匹配,也可能不匹配。
制作 entryKanji
列表并使用 literal='古'
进行测试。
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
treeW = ET.parse('/home/luis/tmp/test.xml')
rootW = treeW.getroot()
literal='古'
words = []
wordsKanjiXml = rootW.findall(".//entry")
entryId =0
for wordsKanjiEntry in wordsKanjiXml:
entry = {}
entryKanji =[]
entryKana =''
entryMeanings = []
# -------------
for parts in wordsKanjiEntry:
sensesList=[]
if parts.tag == 'k_ele':
for kanji in parts:
if kanji.tag == 'keb':
entryKanji.append(kanji.text)
if parts.tag=='sense':
for sense in parts:
if sense.tag == 'gloss':
if 'spa' in sense.attrib.values():
sensesList.append(sense.text)
if sensesList:
entryMeanings=sensesList
if parts.tag == 'r_ele':
for kana in parts:
if kana.tag == 'reb':
entryKana=kana.text
# -------------
entryId =entryId+1
entry=dict(
kanji = entryKanji,
kana = entryKana,
meanings = entryMeanings
)
if literal in ''.join(entryKanji) :
words.append(entry)
print(words)
结果:
[{'kanji': ['古い', '故い', '旧い'], 'kana': 'ふるい', 'meanings': []}]
列出要检查的 XML 个实体也有效
entryKanji =[]
entryKanjiEnt =[]
# existing code
for kanji in parts:
if kanji.tag == 'keb':
entryKanji.append(kanji.text)
entryKanjiEnt.append(kanji.text.encode('ascii', 'xmlcharrefreplace'))
# existing code
if b''.join(entryKanjiEnt).find(literal.encode('ascii', 'xmlcharrefreplace')) != -1:
words.append(entry)