无法使用 python 中的 lxml 从 XML 正确获取预期项目
Unable to fetch expected items properly from XML using lxml in python
我编写了一个代码,用于从 tes.xml
中删除列表 lis
中不存在的国家,并在删除国家后生成更新的 xml output.xml
.但这些国家也有产出,但名单上没有
XML:
tes.xml
<?xml version="1.0"?>
<data>
<continents>
<country>
<state>
<rank updated="yes">123456</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</state>
<zones>
<pretty>yes</pretty>
</zones>
</country>
<country>
<state>
<rank updated="yes">789045</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<gpc>59900</gpc>
<neighbor name="Malaysia" direction="N"/>
</state>
<zones>
<pretty>No</pretty>
</zones>
<market>
<pretty>cool</pretty>
</market>
</country>
<country>
<state>
<rank updated="yes">67846464</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<gpc>59900</gpc>
<neighbor name="Malaysia" direction="N"/>
</state>
<zones>
<pretty>No</pretty>
</zones>
<market>
<pretty>cool</pretty>
</market>
</country>
</continents>
</data>
代码:
import xml.etree.ElementTree as ET
tree = ET.parse('tes.xml')
lis = ["123456"]
root = tree.getroot()
print('root is', root)
print(type(root))
for continent in root.findall('.//continents'):
for country in continent:
rank = country.find('state/rank').text
print(rank)
if rank not in lis:
continent.remove(country)
tree.write('outpu.xml')
控制台输出:它甚至没有打印来自 XML 的所有排名,即跳过 67846464
因此该排名也将打印在 output.xml
中,尽管它不在名单
root is <Element 'data' at 0x7f5929a9d8b0>
<class 'xml.etree.ElementTree.Element'>
123456
789045
当前输出:有 2 个 ID 123456
和 67846464
<data>
<continents>
<country>
<state>
<rank updated="yes">123456</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E" />
<neighbor name="Switzerland" direction="W" />
</state>
<zones>
<pretty>yes</pretty>
</zones>
</country>
<country>
<state>
<rank updated="yes">67846464</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<gpc>59900</gpc>
<neighbor name="Malaysia" direction="N" />
</state>
<zones>
<pretty>No</pretty>
</zones>
<market>
<pretty>cool</pretty>
</market>
</country>
</continents>
</data>
预期输出:只有 123456 应该出现,因为 67846464 不在列表中
<data>
<continents>
<country>
<state>
<rank updated="yes">123456</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E" />
<neighbor name="Switzerland" direction="W" />
</state>
<zones>
<pretty>yes</pretty>
</zones>
</country>
</continents>
</data>
我让它与 BeautifulSoup 一起正常工作。我只是将 XML 代码作为字符串插入:
input = """
<?xml version="1.0"?>
<data>
<continents>
<country>
<state>
<rank updated="yes">123456</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</state>
<zones>
<pretty>yes</pretty>
</zones>
</country>
<country>
<state>
<rank updated="yes">789045</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<gpc>59900</gpc>
<neighbor name="Malaysia" direction="N"/>
</state>
<zones>
<pretty>No</pretty>
</zones>
<market>
<pretty>cool</pretty>
</market>
</country>
<country>
<state>
<rank updated="yes">67846464</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<gpc>59900</gpc>
<neighbor name="Malaysia" direction="N"/>
</state>
<zones>
<pretty>No</pretty>
</zones>
<market>
<pretty>cool</pretty>
</market>
</country>
</continents>
</data>
"""
下面是真正的编码部分:
from bs4 import BeautifulSoup
lis = ["123456"]
# Turn the XML into one big BS object
soup = BeautifulSoup(input, "lxml")
# Parse through to find all <country> tags.
# From each, grab the <rank> value. If the rank value
# is not in the list, delete the respective <country> tag.
for country in soup.find_all("country"):
rank = country.find("rank").text
if rank not in lis:
country.decompose()
print(soup.prettify())
这给出了匹配国家的预期输出。当我将 lis
更改为“["123456", "67846464"]
”时,我得到了预期的 2 个国家输出。
您的代码中的问题是您在迭代时从 continent
中删除了元素。
for continent in root.findall('.//continents'):
for country in continent.findall('./country'):
if country.find('state/rank').text not in lis:
continent.remove(country)
我编写了一个代码,用于从 tes.xml
中删除列表 lis
中不存在的国家,并在删除国家后生成更新的 xml output.xml
.但这些国家也有产出,但名单上没有
XML:
tes.xml
<?xml version="1.0"?>
<data>
<continents>
<country>
<state>
<rank updated="yes">123456</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</state>
<zones>
<pretty>yes</pretty>
</zones>
</country>
<country>
<state>
<rank updated="yes">789045</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<gpc>59900</gpc>
<neighbor name="Malaysia" direction="N"/>
</state>
<zones>
<pretty>No</pretty>
</zones>
<market>
<pretty>cool</pretty>
</market>
</country>
<country>
<state>
<rank updated="yes">67846464</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<gpc>59900</gpc>
<neighbor name="Malaysia" direction="N"/>
</state>
<zones>
<pretty>No</pretty>
</zones>
<market>
<pretty>cool</pretty>
</market>
</country>
</continents>
</data>
代码:
import xml.etree.ElementTree as ET
tree = ET.parse('tes.xml')
lis = ["123456"]
root = tree.getroot()
print('root is', root)
print(type(root))
for continent in root.findall('.//continents'):
for country in continent:
rank = country.find('state/rank').text
print(rank)
if rank not in lis:
continent.remove(country)
tree.write('outpu.xml')
控制台输出:它甚至没有打印来自 XML 的所有排名,即跳过 67846464
因此该排名也将打印在 output.xml
中,尽管它不在名单
root is <Element 'data' at 0x7f5929a9d8b0>
<class 'xml.etree.ElementTree.Element'>
123456
789045
当前输出:有 2 个 ID 123456
和 67846464
<data>
<continents>
<country>
<state>
<rank updated="yes">123456</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E" />
<neighbor name="Switzerland" direction="W" />
</state>
<zones>
<pretty>yes</pretty>
</zones>
</country>
<country>
<state>
<rank updated="yes">67846464</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<gpc>59900</gpc>
<neighbor name="Malaysia" direction="N" />
</state>
<zones>
<pretty>No</pretty>
</zones>
<market>
<pretty>cool</pretty>
</market>
</country>
</continents>
</data>
预期输出:只有 123456 应该出现,因为 67846464 不在列表中
<data>
<continents>
<country>
<state>
<rank updated="yes">123456</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E" />
<neighbor name="Switzerland" direction="W" />
</state>
<zones>
<pretty>yes</pretty>
</zones>
</country>
</continents>
</data>
我让它与 BeautifulSoup 一起正常工作。我只是将 XML 代码作为字符串插入:
input = """
<?xml version="1.0"?>
<data>
<continents>
<country>
<state>
<rank updated="yes">123456</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</state>
<zones>
<pretty>yes</pretty>
</zones>
</country>
<country>
<state>
<rank updated="yes">789045</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<gpc>59900</gpc>
<neighbor name="Malaysia" direction="N"/>
</state>
<zones>
<pretty>No</pretty>
</zones>
<market>
<pretty>cool</pretty>
</market>
</country>
<country>
<state>
<rank updated="yes">67846464</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<gpc>59900</gpc>
<neighbor name="Malaysia" direction="N"/>
</state>
<zones>
<pretty>No</pretty>
</zones>
<market>
<pretty>cool</pretty>
</market>
</country>
</continents>
</data>
"""
下面是真正的编码部分:
from bs4 import BeautifulSoup
lis = ["123456"]
# Turn the XML into one big BS object
soup = BeautifulSoup(input, "lxml")
# Parse through to find all <country> tags.
# From each, grab the <rank> value. If the rank value
# is not in the list, delete the respective <country> tag.
for country in soup.find_all("country"):
rank = country.find("rank").text
if rank not in lis:
country.decompose()
print(soup.prettify())
这给出了匹配国家的预期输出。当我将 lis
更改为“["123456", "67846464"]
”时,我得到了预期的 2 个国家输出。
您的代码中的问题是您在迭代时从 continent
中删除了元素。
for continent in root.findall('.//continents'):
for country in continent.findall('./country'):
if country.find('state/rank').text not in lis:
continent.remove(country)