使用 etree.ElementTree 从 xml 中提取数据

Issue pulling data from xml with etree.ElementTree

我正在使用 JMDict (https://www.edrdg.org/jmdict/j_jmdict.html)。这是我遇到问题的数据的一个小例子:

<entry>
<ent_seq>1265070</ent_seq>
<k_ele>
<keb>古い</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news1</ke_pri>
<ke_pri>nf16</ke_pri>
</k_ele>
<k_ele>
<keb>故い</keb>
</k_ele>
<k_ele>
<keb>旧い</keb>
</k_ele>
<r_ele>
<reb>ふるい</reb>
<re_pri>ichi1</re_pri>
<re_pri>news1</re_pri>
<re_pri>nf16</re_pri>
</r_ele>
<sense>
<pos>&adj-i;</pos>
<s_inf>of things, not people</s_inf>
<gloss lang="eng">old</gloss>
<gloss lang="eng">aged</gloss>
<gloss lang="eng">ancient</gloss>
<gloss lang="eng">antiquated</gloss>
<gloss lang="eng">antique</gloss>
<gloss lang="eng">timeworn</gloss>
</sense>
<sense>
<pos>&adj-i;</pos>
<gloss lang="eng">long</gloss>
<gloss lang="eng">since long ago</gloss>
<gloss lang="eng">time-honored</gloss>
</sense>
<sense>
<pos>&adj-i;</pos>
<gloss lang="eng">of the distant past</gloss>
<gloss lang="eng">long-ago</gloss>
</sense>
<sense>
<pos>&adj-i;</pos>
<gloss lang="eng">stale</gloss>
<gloss lang="eng">threadbare</gloss>
<gloss lang="eng">hackneyed</gloss>
<gloss lang="eng">corny</gloss>
</sense>
<sense>
<pos>&adj-i;</pos>
<gloss lang="eng">old-fashioned</gloss>
<gloss lang="eng">outmoded</gloss>
<gloss lang="eng">out-of-date</gloss>
</sense>
<sense>
<gloss lang="dut">oud</gloss>
</sense>
<sense>
<gloss lang="dut">aloud</gloss>
<gloss lang="dut">verouderd</gloss>
<gloss lang="dut">oubollig</gloss>
<gloss lang="dut">gedateerd</gloss>
<gloss lang="dut">ouderwets</gloss>
<gloss lang="dut">oudmodisch</gloss>
<gloss lang="dut">archaïsch</gloss>
<gloss lang="dut">antiek</gloss>
<gloss lang="dut">{i.h.b.} afgezaagd</gloss>
</sense>
<sense>
<gloss lang="dut">niet vers</gloss>
<gloss lang="dut">onfris</gloss>
<gloss lang="dut">belegen</gloss>
<gloss lang="dut">oud</gloss>
<gloss lang="dut">oudbakken</gloss>
<gloss lang="dut">verschaald</gloss>
<gloss lang="dut">muf</gloss>
</sense>
<sense>
<gloss lang="dut">gebruikt</gloss>
<gloss lang="dut">afgewerkt</gloss>
<gloss lang="dut">sleets</gloss>
<gloss lang="dut">versleten</gloss>
</sense>
<sense>
<gloss lang="fre">vieux (sauf pour les personnes)</gloss>
<gloss lang="fre">âgé</gloss>
<gloss lang="fre">ancien</gloss>
<gloss lang="fre">antique</gloss>
<gloss lang="fre">vieilli</gloss>
<gloss lang="fre">vieillot</gloss>
<gloss lang="fre">caduque</gloss>
<gloss lang="fre">démodé</gloss>
<gloss lang="fre">obsolète</gloss>
<gloss lang="fre">passé</gloss>
<gloss lang="fre">vicié</gloss>
<gloss lang="fre">usé</gloss>
</sense>
<sense>
<gloss lang="ger">alt</gloss>
<gloss lang="ger">altertümlich</gloss>
</sense>
<sense>
<gloss lang="ger">langjährig</gloss>
<gloss lang="ger">sich über lange Zeit erstreckend</gloss>
</sense>
<sense>
<gloss lang="ger">altehrwürdig</gloss>
<gloss lang="ger">althergebracht</gloss>
</sense>
<sense>
<gloss lang="ger">ehemalig</gloss>
<gloss lang="ger">noch nicht reformiert</gloss>
<gloss lang="ger">in der alten Version</gloss>
</sense>
<sense>
<gloss lang="ger">altmodisch</gloss>
<gloss lang="ger">veraltet</gloss>
<gloss lang="ger">altbacken</gloss>
<gloss lang="ger">unmodern</gloss>
<gloss lang="ger">abgestanden</gloss>
<gloss lang="ger">abgenutzt</gloss>
</sense>
<sense>
<gloss lang="ger">alterserfahren</gloss>
<gloss lang="ger">routiniert</gloss>
</sense>
<sense>
<gloss lang="hun">öreg</gloss>
<gloss lang="hun">régi</gloss>
<gloss lang="hun">divatjamúlt</gloss>
<gloss lang="hun">elavult</gloss>
<gloss lang="hun">állott</gloss>
<gloss lang="hun">áporodott</gloss>
<gloss lang="hun">banális</gloss>
<gloss lang="hun">elcsépelt</gloss>
<gloss lang="hun">elévült</gloss>
<gloss lang="hun">nem friss</gloss>
<gloss lang="hun">poshadt</gloss>
<gloss lang="hun">foszlott</gloss>
<gloss lang="hun">kopott</gloss>
</sense>
<sense>
<gloss lang="rus">старый</gloss>
<gloss lang="rus">1) старый</gloss>
<gloss lang="rus">(ср.) ふるく</gloss>
<gloss lang="rus">2) устарелый, отсталый</gloss>
</sense>
<sense>
<gloss lang="slv">star (za predmete)</gloss>
</sense>
<sense>
<gloss lang="spa">(objetos) viejo</gloss>
<gloss lang="spa">antiguo</gloss>
<gloss lang="spa">anticuado</gloss>
<gloss lang="spa">antigüedad</gloss>
<gloss lang="spa">articulo obsoleto</gloss>
<gloss lang="spa">(objeto) viejo</gloss>
<gloss lang="spa">antiguo</gloss>
<gloss lang="spa">anticuado</gloss>
<gloss lang="spa">antigüedad</gloss>
<gloss lang="spa">articulo obsoleto</gloss>
</sense>
<sense>
<gloss lang="swe">gammal</gloss>
</sense>
</entry>

我正在使用 Django 和 etree.ElementTree 来提取数据。 这是我的代码:

# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
    
treeW = ET.parse('D:\Dev\kanjitest\kanjitest\static\dbxml\JMdict.xml')
rootW = treeW.getroot()

wordsKanjiXml = rootW.findall(".//entry")
entryId =0
for wordsKanjiEntry in wordsKanjiXml:
    entry = {}
    entryKanji =''
    entryKana =''
    entryMeanings = []
    # -------------
    for parts in wordsKanjiEntry:
        sensesList=[]
        if parts.tag == 'k_ele':
            for kanji in parts:
                if kanji.tag == 'keb':
                    entryKanji=kanji.text
                                                
        if parts.tag=='sense':
           
            for sense in parts:
                if sense.tag == 'gloss':
                    if 'spa' in sense.attrib.values():
                        sensesList.append(sense.text)
                        
        if sensesList:
            entryMeanings=sensesList
        
        if parts.tag == 'r_ele':
            for kana in parts:
                if kana.tag == 'reb':
                    entryKana=kana.text
    # -------------
    
    entryId =entryId+1
       
    entry=dict(
        kanji = entryKanji,
        kana = entryKana,
        meanings = entryMeanings
    )
    if literal in entryKanji:
        words.append(entry)

所以,问题出在最后的条件。 literal是另一部分代码中的一个变量,其中包含一个字符串中的汉字。例如,古。因此,如果一个单词包含该特定字符串,则该条目将添加到名为 words 的列表中。问题在于同一个词有多个写作的条目(如我发布的 xml 示例)。古い可以写成“古い”,或“故い”,或“旧い”。因为它有几个 <keb> 标签,所以条件似乎不适用,即使其中一个实际上是真的。我不知道我对自己的解释是否足够好,但我希望有人理解并帮助我完善最终条件,因此如果任何 <keb> 标签包含 literal,代码就会运行。

仅存在 entryKanji 的最后一个值,因此它可能与 literal 中的值匹配,也可能不匹配。
制作 entryKanji 列表并使用 literal='古' 进行测试。

# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
    
treeW = ET.parse('/home/luis/tmp/test.xml')
rootW = treeW.getroot()
literal='古'
words = []
wordsKanjiXml = rootW.findall(".//entry")
entryId =0
for wordsKanjiEntry in wordsKanjiXml:
    entry = {}
    entryKanji =[]
    entryKana =''
    entryMeanings = []
    # -------------
    for parts in wordsKanjiEntry:
        sensesList=[]
        if parts.tag == 'k_ele':
            for kanji in parts:
                if kanji.tag == 'keb':
                    entryKanji.append(kanji.text)
                                                
        if parts.tag=='sense':
           
            for sense in parts:
                if sense.tag == 'gloss':
                    if 'spa' in sense.attrib.values():
                        sensesList.append(sense.text)
                        
        if sensesList:
            entryMeanings=sensesList
        
        if parts.tag == 'r_ele':
            for kana in parts:
                if kana.tag == 'reb':
                    entryKana=kana.text
    # -------------
    
    entryId =entryId+1
       
    entry=dict(
        kanji = entryKanji,
        kana = entryKana,
        meanings = entryMeanings
    )
    
    if literal in ''.join(entryKanji) :
        words.append(entry)
        
    print(words)

结果:

[{'kanji': ['古い', '故い', '旧い'], 'kana': 'ふるい', 'meanings': []}]

列出要检查的 XML 个实体也有效

entryKanji =[]
entryKanjiEnt =[]

# existing code

            for kanji in parts:
            if kanji.tag == 'keb':
                entryKanji.append(kanji.text)
                entryKanjiEnt.append(kanji.text.encode('ascii', 'xmlcharrefreplace'))
# existing code

if b''.join(entryKanjiEnt).find(literal.encode('ascii', 'xmlcharrefreplace')) != -1:
    words.append(entry)