我如何正确地从这个 xml 中提取信息
How do I properly extract information from this xml
我正在尝试查询 xml 文档以打印出与较低级别元素关联的较高级别元素属性。我得到的结果与 xml 结构不一致。基本上这是我到目前为止的代码。
import xml.etree.ElementTree as ET
tree = ET.parse('movies2.xml') root = tree.getroot()
for child in root:
print(child.tag, child.attrib) print()
mov = root.findall("./genre/decade/movie/[year='2000']")
for movie in mov:
print(child.attrib['category'], movie.attrib['title'])
这会产生这个-
genre {'category': 'Action'}
genre {'category': 'Thriller'}
genre {'category': 'Comedy'}
Comedy X-Men
Comedy American Psycho
如果检查 xml-
,最后两行实际上应该列出与电影标题相关的两种不同类型属性
Action X-Men
Thriller American Psycho
这是xml供参考-
<?xml version='1.0' encoding='utf8'?>
<collection>
<genre category="Action">
<decade years="1980s">
<movie favorite="True" title="Indiana Jones: The raiders of
the lost Ark">
<format multiple="No">DVD</format>
<year>1981</year>
<rating>PG</rating>
<description>
'Archaeologist and adventurer Indiana Jones
is hired by the U.S. government to find the Ark of the
Covenant before the Nazis.'
</description>
</movie>
<movie favorite="True" title="THE KARATE KID">
<format multiple="Yes">DVD,Online</format>
<year>1984</year>
<rating>PG</rating>
<description>None provided.</description>
</movie>
<movie favorite="False" title="Back 2 the Future">
<format multiple="False">Blu-ray</format>
<year>1985</year>
<rating>PG</rating>
<description>Marty McFly</description>
</movie>
</decade>
<decade years="1990s">
<movie favorite="False" title="X-Men">
<format multiple="Yes">dvd, digital</format>
<year>2000</year>
<rating>PG-13</rating>
<description>Two mutants come to a private academy for > their kind whose resident superhero team must
oppose a terrorist organization with similar powers.
</description>
</movie>
<movie favorite="True" title="Batman Returns">
<format multiple="No">VHS</format>
<year>1992</year>
<rating>PG13</rating>
<description>NA.</description>
</movie>
<movie favorite="False" title="Reservoir Dogs">
<format multiple="No">Online</format>
<year>1992</year>
<rating>R</rating>
<description>WhAtEvER I Want!!!?!</description>
</movie>
</decade>
</genre>
<genre category="Thriller">
<decade years="1970s">
<movie favorite="False" title="ALIEN">
<format multiple="Yes">DVD</format>
<year>1979</year>
<rating>R</rating>
<description>"""""""""</description>
</movie>
</decade>
<decade years="1980s">
<movie favorite="True" title="Ferris Bueller's Day Off">
<format multiple="No">DVD</format>
<year>1986</year>
<rating>PG13</rating>
<description>Funny movie about a funny guy</description>
</movie>
<movie favorite="FALSE" title="American Psycho">
<format multiple="No">blue-ray</format>
<year>2000</year>
<rating>Unrated</rating>
<description>psychopathic Bateman</description>
</movie>
</decade>
</genre>
<genre category="Comedy">
<decade years="1960s">
<movie favorite="False" title="Batman: The Movie">
<format multiple="Yes">DVD,VHS</format>
<year>1966</year>
<rating>PG</rating>
<description>What a joke!</description>
</movie>
</decade>
<decade years="2010s">
<movie favorite="True" title="Easy A">
<format multiple="No">DVD</format>
<year>2010</year>
<rating>PG--13</rating>
<description>Emma Stone = Hester Prynne</description>
</movie>
<movie favorite="True" title="Dinner for SCHMUCKS">
<format multiple="Yes">DVD,digital,Netflix</format>
<year>2011</year>
<rating>Unrated</rating>
<description>Tim (Rudd) is a rising executive who
'succeeds' in finding the perfect guest, IRS employee
Barry (Carell), for his boss' monthly event, a so-called
'dinner for idiots,' which offers certain advantages to
the exec who shows up with the biggest buffoon.
</description>
</movie>
</decade>
<decade years="1980s">
<movie favorite="False" title="Ghostbusters">
<format multiple="No">Online,VHS</format>
<year>1984</year>
<rating>PG</rating>
<description>Who ya gonna call?</description>
</movie>
</decade>
<decade years="1990s">
<movie favorite="True" title="Robin Hood: Prince of Thieves">
<format multiple="No">Blu_Ray</format>
<year>1991</year>
<rating>Unknown</rating>
<description>Robin Hood slaying</description>
</movie>
</decade>
</genre>
</collection>
你的初始循环:
for child in root:
print(child.tag, child.attrib) print()
将 child
设置为最后一个 child;因此 child.attrib['category']
将始终是最后一个 child 的类别。在你的例子中,最后一个 child 是一部喜剧。对于第二个循环中的每部电影:
for movie in mov:
print(child.attrib['category'], movie.attrib['title'])
您正在打印在第一个循环中找到的最后一个 child 的类别;所以他们都打印 "Comedy".
编辑:这将至少 select 具有正确流派标签的相同电影,但顺序可能不同:
for child in root:
mov = child.findall("./decade/movie/[year='2000']")
for movie in mov:
print(child.attrib['category'], movie.attrib['title'])
另一种方法,使用lxml代替elementree:
from lxml import etree as ET
tree = ET.parse('movies2.xml')
root = tree.getroot()
mov = root.findall("./genre/decade/movie/[year='2000']")
for movie in mov:
print(movie.getparent().getparent().attrib['category'], movie.attrib['title'])
我正在尝试查询 xml 文档以打印出与较低级别元素关联的较高级别元素属性。我得到的结果与 xml 结构不一致。基本上这是我到目前为止的代码。
import xml.etree.ElementTree as ET
tree = ET.parse('movies2.xml') root = tree.getroot()
for child in root:
print(child.tag, child.attrib) print()
mov = root.findall("./genre/decade/movie/[year='2000']")
for movie in mov:
print(child.attrib['category'], movie.attrib['title'])
这会产生这个-
genre {'category': 'Action'}
genre {'category': 'Thriller'}
genre {'category': 'Comedy'}
Comedy X-Men
Comedy American Psycho
如果检查 xml-
,最后两行实际上应该列出与电影标题相关的两种不同类型属性Action X-Men
Thriller American Psycho
这是xml供参考-
<?xml version='1.0' encoding='utf8'?> <collection> <genre category="Action"> <decade years="1980s"> <movie favorite="True" title="Indiana Jones: The raiders of the lost Ark"> <format multiple="No">DVD</format> <year>1981</year> <rating>PG</rating> <description> 'Archaeologist and adventurer Indiana Jones is hired by the U.S. government to find the Ark of the Covenant before the Nazis.' </description> </movie> <movie favorite="True" title="THE KARATE KID"> <format multiple="Yes">DVD,Online</format> <year>1984</year> <rating>PG</rating> <description>None provided.</description> </movie> <movie favorite="False" title="Back 2 the Future"> <format multiple="False">Blu-ray</format> <year>1985</year> <rating>PG</rating> <description>Marty McFly</description> </movie> </decade> <decade years="1990s"> <movie favorite="False" title="X-Men"> <format multiple="Yes">dvd, digital</format> <year>2000</year> <rating>PG-13</rating> <description>Two mutants come to a private academy for > their kind whose resident superhero team must oppose a terrorist organization with similar powers. </description> </movie> <movie favorite="True" title="Batman Returns"> <format multiple="No">VHS</format> <year>1992</year> <rating>PG13</rating> <description>NA.</description> </movie> <movie favorite="False" title="Reservoir Dogs"> <format multiple="No">Online</format> <year>1992</year> <rating>R</rating> <description>WhAtEvER I Want!!!?!</description> </movie> </decade> </genre> <genre category="Thriller"> <decade years="1970s"> <movie favorite="False" title="ALIEN"> <format multiple="Yes">DVD</format> <year>1979</year> <rating>R</rating> <description>"""""""""</description> </movie> </decade> <decade years="1980s"> <movie favorite="True" title="Ferris Bueller's Day Off"> <format multiple="No">DVD</format> <year>1986</year> <rating>PG13</rating> <description>Funny movie about a funny guy</description> </movie> <movie favorite="FALSE" title="American Psycho"> <format multiple="No">blue-ray</format> <year>2000</year> <rating>Unrated</rating> <description>psychopathic Bateman</description> </movie> </decade> </genre> <genre category="Comedy"> <decade years="1960s"> <movie favorite="False" title="Batman: The Movie"> <format multiple="Yes">DVD,VHS</format> <year>1966</year> <rating>PG</rating> <description>What a joke!</description> </movie> </decade> <decade years="2010s"> <movie favorite="True" title="Easy A"> <format multiple="No">DVD</format> <year>2010</year> <rating>PG--13</rating> <description>Emma Stone = Hester Prynne</description> </movie> <movie favorite="True" title="Dinner for SCHMUCKS"> <format multiple="Yes">DVD,digital,Netflix</format> <year>2011</year> <rating>Unrated</rating> <description>Tim (Rudd) is a rising executive who 'succeeds' in finding the perfect guest, IRS employee Barry (Carell), for his boss' monthly event, a so-called 'dinner for idiots,' which offers certain advantages to the exec who shows up with the biggest buffoon. </description> </movie> </decade> <decade years="1980s"> <movie favorite="False" title="Ghostbusters"> <format multiple="No">Online,VHS</format> <year>1984</year> <rating>PG</rating> <description>Who ya gonna call?</description> </movie> </decade> <decade years="1990s"> <movie favorite="True" title="Robin Hood: Prince of Thieves"> <format multiple="No">Blu_Ray</format> <year>1991</year> <rating>Unknown</rating> <description>Robin Hood slaying</description> </movie> </decade> </genre> </collection>
你的初始循环:
for child in root:
print(child.tag, child.attrib) print()
将 child
设置为最后一个 child;因此 child.attrib['category']
将始终是最后一个 child 的类别。在你的例子中,最后一个 child 是一部喜剧。对于第二个循环中的每部电影:
for movie in mov:
print(child.attrib['category'], movie.attrib['title'])
您正在打印在第一个循环中找到的最后一个 child 的类别;所以他们都打印 "Comedy".
编辑:这将至少 select 具有正确流派标签的相同电影,但顺序可能不同:
for child in root:
mov = child.findall("./decade/movie/[year='2000']")
for movie in mov:
print(child.attrib['category'], movie.attrib['title'])
另一种方法,使用lxml代替elementree:
from lxml import etree as ET
tree = ET.parse('movies2.xml')
root = tree.getroot()
mov = root.findall("./genre/decade/movie/[year='2000']")
for movie in mov:
print(movie.getparent().getparent().attrib['category'], movie.attrib['title'])