如何将 XML 文件转换为 pandas 数据帧?
How do I convert an XML file to a pandas dataframe?
我正在尝试将 XML 文件转换为以下格式:
<ann>
<anime id="24235" gid="2583955622" type="TV" name="Love After World Domination" precision="TV" generated-on="2021-04-06T00:15:25Z">
<related-prev rel="adapted from" id="24234"/>
<info gid="1661578035" type="Main title" lang="EN">Love After World Domination</info>
<info gid="2103040388" type="Alternative title" lang="JA">Sekai Seifuku no Ato de</info>
<info gid="2069464047" type="Alternative title" lang="JA">恋は世界征服のあとで</info>
<staff gid="1364018953">
...
</staff>
<staff gid="2582001321">
...
</staff>
</anime>
<manga id="24225" gid="1003998999" type="manga" name="She's My Knight" precision="manga" generated-on="2021-04-06T00:21:21Z">
<info gid="2757138724" type="Picture" src="https://cdn.animenewsnetwork.com/thumbnails/fit200x200/encyc/A24225-2757138724.1617642733.jpg" width="140" height="200">
...
</info>
<info gid="1643119455" type="Main title" lang="EN">She's My Knight</info>
<info gid="2475002983" type="Alternative title" lang="JA">Ikemen Kanojo to Heroine na Ore!?</info>
<info gid="2034824415" type="Alternative title" lang="JA">イケメン彼女とヒロインな俺!?</info>
<info gid="1694554971" type="Plot Summary">Haruma Ichinose, 17, has been popular since he was born. So popular, in fact, that he figured no one could even come close until he met Yuki Mogami. She's tall, cool, collected, and totally makes him crazy. He may just be in love but falling for someone even more dashing than himself is hard to swallow.</info>
<info gid="2542157561" type="Vintage">2019 (serialized on Palcy)</info>
<info gid="851836011" type="Vintage">2019-10-22 (serialized on Palcy)</info>
<staff gid="307631293">
<task>Story & Art</task>
<person id="206223">Saisou</person>
</staff>
</manga>
<anime id="24224" gid="885535394" type="TV" name="Watanuki-san Chi to" precision="TV" generated-on="2021-04-06T00:21:21Z">
...
</anime>
...
进入一个 pandas 数据框,其中每个动漫的 ID、名称和情节摘要(如果有的话)作为列。我已经能够使用此代码获取带有动漫 ID 和名称的数据框,但无法获取情节摘要:
import requests
import pandas as pd
import xml.etree.ElementTree as ET
response = requests.get('https://cdn.animenewsnetwork.com/encyclopedia/api.xml?title=24235/24233/24232/24231/24230/24229/24227/24225/24224/24223/24222/24220/24218/24217/24216/24215/24214/24213/24212/24211/24210/24209/24208/24207/24206/24205/24204/24203/24202/24201/24200/24199/24198/24196/24195/24194/24193/24192/24191/24189/24187/24186/24185/24183/24182/24180/24179/24178/24177/24176/')
root = ET.fromstring(response.text)
dfcols = ['id', 'name']
anime_df = pd.DataFrame(columns=dfcols)
for i in root.iter(tag='anime'):
anime_df = anime_df.append(
pd.Series([i.get('id'), i.get('name')], index=dfcols),
ignore_index=True)
anime_df.head()
我也可以用这段代码获取现有的情节摘要:
plot_list = root.findall('.//info[@type="Plot Summary"]')
for i in range(len(plot_list)):
print(plot_list[i].text)
但是,由于我使用的是 findall,因此无法 link 将情节摘要与其对应的 ids/names 联系起来。有什么想法吗?
我建议您将所有数据拉入字典,并在数据框中完成最后的工作。比单独创建系列并附加更有效。
我在下面提出的解决方案将 id
和 name
分别放入字典(defaultdict),同时将 plot summary
拉入不同的字典(mapping
) .
之后,您可以转换为pandas数据结构并合并。
from collections import defaultdict
data = defaultdict(list)
mapping = {}
In [142]: for entry in root:
...: data['id'].append(entry.attrib['id'])
...: data['name'].append(entry.attrib['name'])
...: for ent in entry.findall("./info"):
...: if ent.attrib['type'] == "Plot Summary":
...: mapping[entry.attrib['id']] = ent.text
In [150]: pd.DataFrame(data).merge(pd.Series(mapping, name='plot_summary'),
left_on='id',
right_index=True,
how='left')
Out[150]:
id name plot_summary
0 24235 Love After World Domination NaN
1 24233 Himitsu Kessha Yaruminati NaN
2 24232 Enman Kaiketsu! Enma-chan NaN
3 24231 Zenryoku Kaihi Flag-chan! NaN
4 24230 Konketsu no Karekore NaN
5 24229 Teikō Penguin NaN
6 24227 Black Channel NaN
7 24225 She's My Knight Haruma Ichinose, 17, has been popular since he...
8 24224 Watanuki-san Chi to NaN
9 24223 Watanuki-san Chi no NaN
10 24222 Tiger & Bunny 2 NaN
11 24220 Super Cub NaN
12 24218 FUUTO PI NaN
13 24217 Fūto Tantei NaN
14 24216 Inō no Aicis NaN
15 24215 Gyakuten Sekai no Denchi Shōjo NaN
16 24214 Eiga Yurukyan△ NaN
17 24213 Re:cycle of Penguindrum NaN
18 24212 That Time I Got Reincarnated as a Slime NaN
19 24211 Wonder Egg Priority NaN
20 24210 Dosukoi Sushi-Zumō NaN
21 24209 Motto! Majime ni Fumajime Kaiketsu Zorori NaN
22 24208 Pui Pui Molcar NaN
23 24207 Case Study of Vanitas NaN
24 24206 HOME! NaN
25 24205 Hachimitsu Suicide Machine NaN
26 24204 Deliver Police: Nishitokyo-shi Deliver Keisats... NaN
27 24203 Ryūsatsu no Kyōkotsu NaN
28 24202 Muteking the Dancing Hero NaN
29 24201 World Trigger NaN
30 24200 Gekijō-ban Utano☆Princesama♪ Maji Love ST☆RISH... NaN
31 24199 My Hero Academia THE MOVIE: World Heroes' Mission NaN
32 24198 Vampire Dies in No Time NaN
33 24196 Visual Prison NaN
34 24195 IDOLiSH7 Third Beat! Kujo starts carrying out his plans to defame G...
35 24194 Jujutsu Kaisen 0 NaN
36 24193 Gekijō-ban Jujutsu Kaisen 0 NaN
37 24192 takt op. NaN
38 24191 She Professed Herself Pupil of the Wise Man NaN
39 24189 Akebi's Sailor Uniform NaN
40 24187 Love and Heart Sure, university freshman Yagisawa has a lot o...
41 24186 Do It Yourself!! NaN
42 24185 Ningen Kaishūsha NaN
43 24183 Kanashiki Debu Neko-chan NaN
44 24182 Ikinuke! Bakusō! Kusohamu-chan! NaN
45 24180 Kaiju No. 8 NaN
46 24179 Phantom Seer NaN
47 24178 Magu-chan: God of Destruction The God of Destruction Magu Menueku has been s...
48 24177 i tell c NaN
49 24176 High School Family: Kokosei Kazoku NaN
我正在尝试将 XML 文件转换为以下格式:
<ann>
<anime id="24235" gid="2583955622" type="TV" name="Love After World Domination" precision="TV" generated-on="2021-04-06T00:15:25Z">
<related-prev rel="adapted from" id="24234"/>
<info gid="1661578035" type="Main title" lang="EN">Love After World Domination</info>
<info gid="2103040388" type="Alternative title" lang="JA">Sekai Seifuku no Ato de</info>
<info gid="2069464047" type="Alternative title" lang="JA">恋は世界征服のあとで</info>
<staff gid="1364018953">
...
</staff>
<staff gid="2582001321">
...
</staff>
</anime>
<manga id="24225" gid="1003998999" type="manga" name="She's My Knight" precision="manga" generated-on="2021-04-06T00:21:21Z">
<info gid="2757138724" type="Picture" src="https://cdn.animenewsnetwork.com/thumbnails/fit200x200/encyc/A24225-2757138724.1617642733.jpg" width="140" height="200">
...
</info>
<info gid="1643119455" type="Main title" lang="EN">She's My Knight</info>
<info gid="2475002983" type="Alternative title" lang="JA">Ikemen Kanojo to Heroine na Ore!?</info>
<info gid="2034824415" type="Alternative title" lang="JA">イケメン彼女とヒロインな俺!?</info>
<info gid="1694554971" type="Plot Summary">Haruma Ichinose, 17, has been popular since he was born. So popular, in fact, that he figured no one could even come close until he met Yuki Mogami. She's tall, cool, collected, and totally makes him crazy. He may just be in love but falling for someone even more dashing than himself is hard to swallow.</info>
<info gid="2542157561" type="Vintage">2019 (serialized on Palcy)</info>
<info gid="851836011" type="Vintage">2019-10-22 (serialized on Palcy)</info>
<staff gid="307631293">
<task>Story & Art</task>
<person id="206223">Saisou</person>
</staff>
</manga>
<anime id="24224" gid="885535394" type="TV" name="Watanuki-san Chi to" precision="TV" generated-on="2021-04-06T00:21:21Z">
...
</anime>
...
进入一个 pandas 数据框,其中每个动漫的 ID、名称和情节摘要(如果有的话)作为列。我已经能够使用此代码获取带有动漫 ID 和名称的数据框,但无法获取情节摘要:
import requests
import pandas as pd
import xml.etree.ElementTree as ET
response = requests.get('https://cdn.animenewsnetwork.com/encyclopedia/api.xml?title=24235/24233/24232/24231/24230/24229/24227/24225/24224/24223/24222/24220/24218/24217/24216/24215/24214/24213/24212/24211/24210/24209/24208/24207/24206/24205/24204/24203/24202/24201/24200/24199/24198/24196/24195/24194/24193/24192/24191/24189/24187/24186/24185/24183/24182/24180/24179/24178/24177/24176/')
root = ET.fromstring(response.text)
dfcols = ['id', 'name']
anime_df = pd.DataFrame(columns=dfcols)
for i in root.iter(tag='anime'):
anime_df = anime_df.append(
pd.Series([i.get('id'), i.get('name')], index=dfcols),
ignore_index=True)
anime_df.head()
我也可以用这段代码获取现有的情节摘要:
plot_list = root.findall('.//info[@type="Plot Summary"]')
for i in range(len(plot_list)):
print(plot_list[i].text)
但是,由于我使用的是 findall,因此无法 link 将情节摘要与其对应的 ids/names 联系起来。有什么想法吗?
我建议您将所有数据拉入字典,并在数据框中完成最后的工作。比单独创建系列并附加更有效。
我在下面提出的解决方案将 id
和 name
分别放入字典(defaultdict),同时将 plot summary
拉入不同的字典(mapping
) .
之后,您可以转换为pandas数据结构并合并。
from collections import defaultdict
data = defaultdict(list)
mapping = {}
In [142]: for entry in root:
...: data['id'].append(entry.attrib['id'])
...: data['name'].append(entry.attrib['name'])
...: for ent in entry.findall("./info"):
...: if ent.attrib['type'] == "Plot Summary":
...: mapping[entry.attrib['id']] = ent.text
In [150]: pd.DataFrame(data).merge(pd.Series(mapping, name='plot_summary'),
left_on='id',
right_index=True,
how='left')
Out[150]:
id name plot_summary
0 24235 Love After World Domination NaN
1 24233 Himitsu Kessha Yaruminati NaN
2 24232 Enman Kaiketsu! Enma-chan NaN
3 24231 Zenryoku Kaihi Flag-chan! NaN
4 24230 Konketsu no Karekore NaN
5 24229 Teikō Penguin NaN
6 24227 Black Channel NaN
7 24225 She's My Knight Haruma Ichinose, 17, has been popular since he...
8 24224 Watanuki-san Chi to NaN
9 24223 Watanuki-san Chi no NaN
10 24222 Tiger & Bunny 2 NaN
11 24220 Super Cub NaN
12 24218 FUUTO PI NaN
13 24217 Fūto Tantei NaN
14 24216 Inō no Aicis NaN
15 24215 Gyakuten Sekai no Denchi Shōjo NaN
16 24214 Eiga Yurukyan△ NaN
17 24213 Re:cycle of Penguindrum NaN
18 24212 That Time I Got Reincarnated as a Slime NaN
19 24211 Wonder Egg Priority NaN
20 24210 Dosukoi Sushi-Zumō NaN
21 24209 Motto! Majime ni Fumajime Kaiketsu Zorori NaN
22 24208 Pui Pui Molcar NaN
23 24207 Case Study of Vanitas NaN
24 24206 HOME! NaN
25 24205 Hachimitsu Suicide Machine NaN
26 24204 Deliver Police: Nishitokyo-shi Deliver Keisats... NaN
27 24203 Ryūsatsu no Kyōkotsu NaN
28 24202 Muteking the Dancing Hero NaN
29 24201 World Trigger NaN
30 24200 Gekijō-ban Utano☆Princesama♪ Maji Love ST☆RISH... NaN
31 24199 My Hero Academia THE MOVIE: World Heroes' Mission NaN
32 24198 Vampire Dies in No Time NaN
33 24196 Visual Prison NaN
34 24195 IDOLiSH7 Third Beat! Kujo starts carrying out his plans to defame G...
35 24194 Jujutsu Kaisen 0 NaN
36 24193 Gekijō-ban Jujutsu Kaisen 0 NaN
37 24192 takt op. NaN
38 24191 She Professed Herself Pupil of the Wise Man NaN
39 24189 Akebi's Sailor Uniform NaN
40 24187 Love and Heart Sure, university freshman Yagisawa has a lot o...
41 24186 Do It Yourself!! NaN
42 24185 Ningen Kaishūsha NaN
43 24183 Kanashiki Debu Neko-chan NaN
44 24182 Ikinuke! Bakusō! Kusohamu-chan! NaN
45 24180 Kaiju No. 8 NaN
46 24179 Phantom Seer NaN
47 24178 Magu-chan: God of Destruction The God of Destruction Magu Menueku has been s...
48 24177 i tell c NaN
49 24176 High School Family: Kokosei Kazoku NaN