无法使用 python 中的 lxml 从 XML 正确获取预期项目

Question

我编写了一个代码，用于从 tes.xml 中删除列表 lis 中不存在的国家，并在删除国家后生成更新的 xml output.xml .但这些国家也有产出，但名单上没有 XML:

tes.xml

<?xml version="1.0"?>
<data>
  <continents>
    <country>
      <state>
        <rank updated="yes">123456</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
      </state>
      <zones>
        <pretty>yes</pretty>
      </zones>
    </country>
    <country>
      <state>
        <rank updated="yes">789045</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <gpc>59900</gpc>
        <neighbor name="Malaysia" direction="N"/>
      </state>
      <zones>
        <pretty>No</pretty>
      </zones>
      <market>
        <pretty>cool</pretty>
      </market>  
    </country>
    <country>
      <state>
        <rank updated="yes">67846464</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <gpc>59900</gpc>
        <neighbor name="Malaysia" direction="N"/>
      </state>
      <zones>
        <pretty>No</pretty>
      </zones>
      <market>
        <pretty>cool</pretty>
      </market>  
    </country>
  </continents>  
</data>

代码：

import xml.etree.ElementTree as ET
tree = ET.parse('tes.xml')

lis = ["123456"]
root = tree.getroot()
print('root is', root)
print(type(root))

for continent in root.findall('.//continents'):
    for country in continent:
        rank = country.find('state/rank').text
        print(rank)
        if rank not in lis:
            continent.remove(country)

tree.write('outpu.xml')

控制台输出：它甚至没有打印来自 XML 的所有排名，即跳过 67846464 因此该排名也将打印在 output.xml 中，尽管它不在名单

root is <Element 'data' at 0x7f5929a9d8b0>
<class 'xml.etree.ElementTree.Element'>
123456
789045

当前输出：有 2 个 ID 123456 和 67846464

<data>
  <continents>
    <country>
      <state>
        <rank updated="yes">123456</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E" />
        <neighbor name="Switzerland" direction="W" />
      </state>
      <zones>
        <pretty>yes</pretty>
      </zones>
    </country>
    <country>
      <state>
        <rank updated="yes">67846464</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <gpc>59900</gpc>
        <neighbor name="Malaysia" direction="N" />
      </state>
      <zones>
        <pretty>No</pretty>
      </zones>
      <market>
        <pretty>cool</pretty>
      </market>  
    </country>
  </continents>  
</data>

预期输出：只有 123456 应该出现，因为 67846464 不在列表中

<data>
  <continents>
    <country>
      <state>
        <rank updated="yes">123456</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E" />
        <neighbor name="Switzerland" direction="W" />
      </state>
      <zones>
        <pretty>yes</pretty>
      </zones>
    </country>
  </continents>  
</data>

Answer 1

我让它与 BeautifulSoup 一起正常工作。我只是将 XML 代码作为字符串插入：

input = """
<?xml version="1.0"?>
<data>
  <continents>
    <country>
      <state>
        <rank updated="yes">123456</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
      </state>
      <zones>
        <pretty>yes</pretty>
      </zones>
    </country>
    <country>
      <state>
        <rank updated="yes">789045</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <gpc>59900</gpc>
        <neighbor name="Malaysia" direction="N"/>
      </state>
      <zones>
        <pretty>No</pretty>
      </zones>
      <market>
        <pretty>cool</pretty>
      </market>  
    </country>
    <country>
      <state>
        <rank updated="yes">67846464</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <gpc>59900</gpc>
        <neighbor name="Malaysia" direction="N"/>
      </state>
      <zones>
        <pretty>No</pretty>
      </zones>
      <market>
        <pretty>cool</pretty>
      </market>  
    </country>
  </continents>  
</data>
"""

下面是真正的编码部分：

from bs4 import BeautifulSoup

lis = ["123456"]

# Turn the XML into one big BS object
soup = BeautifulSoup(input, "lxml")

# Parse through to find all <country> tags.  
# From each, grab the <rank> value.  If the rank value
# is not in the list, delete the respective <country> tag.
for country in soup.find_all("country"):
    rank = country.find("rank").text
    if rank not in lis:
        country.decompose()

print(soup.prettify())

这给出了匹配国家的预期输出。当我将 lis 更改为“["123456", "67846464"]”时，我得到了预期的 2 个国家输出。

Answer 2

您的代码中的问题是您在迭代时从 continent 中删除了元素。

for continent in root.findall('.//continents'):
    for country in continent.findall('./country'):
        if country.find('state/rank').text not in lis:
            continent.remove(country)

无法使用 python 中的 lxml 从 XML 正确获取预期项目

Unable to fetch expected items properly from XML using lxml in python

python

xml

elementtree

python-3.x