使用 json_normalize 将 JSON 展平为 pandas
Flatten JSON to pandas using json_normalize
我需要将 XML 数据展平,然后将其转换为 pandas 的 JSON。
问题是,我希望 passage (<passage gare="87271460">
) 的值位于数据帧的每一行。
(上下文:Gare 是一个火车站,有一个 ID。我这样称呼 API https://api.transilien.com/gare/87271460/depart,我打算称呼 17 个不同的火车站)
我运行不知道我该怎么做。这是我目前所拥有的
response = requests.request("GET", url, headers=headers, data=payload)
# print(response.text)
dict = xmltodict.parse(response.content)
# print(dict)
s = json.dumps(dict).replace('\'', '"').replace('#', '').replace('@', '')
json_object = json.loads(s)
# print(json_object)
df = pd.json_normalize(json_object['passages'], record_path=['train'])
print(df)
这是我从请求中检索到的XML(删除不需要的字符后)
<?xml version="1.0" encoding="UTF-8"?>
<passages gare="87271460">
<train>
<date mode="R">10/04/2022 14:05</date>
<num>PIST64</num>
<miss>PIST</miss>
<term>87758896</term>
</train>
<train>
<date mode="R">10/04/2022 14:09</date>
<num>KALI66</num>
<miss>KALI</miss>
<term>87393579</term>
</train>
</passages>
我需要的最终输出是:
passage num miss term etat date.mode date.text
0 87271460 ERBE85 ERBE 87271486 Supprimé R 10/04/2022 16:09
1 87271460 PINS74 PINS 87758896 NaN R 10/04/2022 16:10
2 87271460 PINS80 PINS 87758896 NaN R 10/04/2022 16:17
3 87271460 KARE82 KARE 87758623 Supprimé R 10/04/2022 16:23
4 87271460 EPAU81 EPAU 87001479 NaN R 10/04/2022 16:29
5 87271460 ERIO91 ERIO 87001479 NaN R 10/04/2022 16:30
6 87271460 PINS86 PINS 87758896 NaN R 10/04/2022 16:32
7 87271460 KARE88 KARE 87393579 NaN R 10/04/2022 16:38
8 87271460 ERBE97 ERBE 87271486 Supprimé R 10/04/2022 16:39
9 87271460 EPAU93 EPAU 87001479 NaN R 10/04/2022 16:43
10 87271460 PINS92 PINS 87758896 NaN R 10/04/2022 16:47
11 87271460 EPIN99 EPIN 87001479 NaN R 10/04/2022 16:52
12 87271460 KARE94 KARE 87758623 Supprimé R 10/04/2022 16:53
13 87271460 ERAN67 ERAN 87001479 NaN R 10/04/2022 16:54
14 87271460 EPOL69 EPOL 87001479 NaN R 10/04/2022 17:01
15 87271460 PINS98 PINS 87758896 NaN R 10/04/2022 17:02
16 87271460 KABE02 KABE 87393579 NaN R 10/04/2022 17:08
17 87271460 ERAN73 ERAN 87001479 NaN R 10/04/2022 17:09
18 87271460 EPOL75 EPOL 87001479 NaN R 10/04/2022 17:16
19 87271460 PITA06 PITA 87758896 NaN R 10/04/2022 17:17
20 87271460 KABE08 KABE 87393579 NaN R 10/04/2022 17:23
21 87271460 ERAN79 ERAN 87001479 NaN R 10/04/2022 17:24
22 87271460 EPOL81 EPOL 87001479 NaN R 10/04/2022 17:31
23 87271460 PITA12 PITA 87758896 NaN R 10/04/2022 17:32
24 87271460 KABE14 KABE 87393579 NaN R 10/04/2022 17:38
25 87271460 ERAN85 ERAN 87001479 NaN R 10/04/2022 17:39
26 87271460 EPOL87 EPOL 87001479 NaN R 10/04/2022 17:46
27 87271460 PITA18 PITA 87758896 NaN R 10/04/2022 17:47
28 87271460 KABE20 KABE 87393579 NaN R 10/04/2022 17:53
29 87271460 ERAN91 ERAN 87001479 NaN R 10/04/2022 17:54
您可以从完整的 json(而不仅仅是 passage
)创建数据框,然后将 gare
列连接到规范化的 train
列:
response = """<?xml version="1.0" encoding="UTF-8"?>
<passages gare="87271460">
<train>
<date mode="R">10/04/2022 14:05</date>
<num>PIST64</num>
<miss>PIST</miss>
<term>87758896</term>
</train>
<train>
<date mode="R">10/04/2022 14:09</date>
<num>KALI66</num>
<miss>KALI</miss>
<term>87393579</term>
</train>
</passages>
"""
dict = xmltodict.parse(response)
s = json.dumps(dict).replace('\'', '"').replace('#', '').replace('@', '')
json_object = json.loads(s)
df = pd.DataFrame.from_dict(json_object, orient='index')
df = df.explode('train').reset_index(drop=True)
df = df.join(pd.json_normalize(df['train'])).drop('train', 1)
print(df)
输出:
gare num miss term date.mode date.text
0 87271460 PIST64 PIST 87758896 R 10/04/2022 14:05
1 87271460 KALI66 KALI 87393579 R 10/04/2022 14:09
我需要将 XML 数据展平,然后将其转换为 pandas 的 JSON。
问题是,我希望 passage (<passage gare="87271460">
) 的值位于数据帧的每一行。
(上下文:Gare 是一个火车站,有一个 ID。我这样称呼 API https://api.transilien.com/gare/87271460/depart,我打算称呼 17 个不同的火车站)
我运行不知道我该怎么做。这是我目前所拥有的
response = requests.request("GET", url, headers=headers, data=payload)
# print(response.text)
dict = xmltodict.parse(response.content)
# print(dict)
s = json.dumps(dict).replace('\'', '"').replace('#', '').replace('@', '')
json_object = json.loads(s)
# print(json_object)
df = pd.json_normalize(json_object['passages'], record_path=['train'])
print(df)
这是我从请求中检索到的XML(删除不需要的字符后)
<?xml version="1.0" encoding="UTF-8"?>
<passages gare="87271460">
<train>
<date mode="R">10/04/2022 14:05</date>
<num>PIST64</num>
<miss>PIST</miss>
<term>87758896</term>
</train>
<train>
<date mode="R">10/04/2022 14:09</date>
<num>KALI66</num>
<miss>KALI</miss>
<term>87393579</term>
</train>
</passages>
我需要的最终输出是:
passage num miss term etat date.mode date.text
0 87271460 ERBE85 ERBE 87271486 Supprimé R 10/04/2022 16:09
1 87271460 PINS74 PINS 87758896 NaN R 10/04/2022 16:10
2 87271460 PINS80 PINS 87758896 NaN R 10/04/2022 16:17
3 87271460 KARE82 KARE 87758623 Supprimé R 10/04/2022 16:23
4 87271460 EPAU81 EPAU 87001479 NaN R 10/04/2022 16:29
5 87271460 ERIO91 ERIO 87001479 NaN R 10/04/2022 16:30
6 87271460 PINS86 PINS 87758896 NaN R 10/04/2022 16:32
7 87271460 KARE88 KARE 87393579 NaN R 10/04/2022 16:38
8 87271460 ERBE97 ERBE 87271486 Supprimé R 10/04/2022 16:39
9 87271460 EPAU93 EPAU 87001479 NaN R 10/04/2022 16:43
10 87271460 PINS92 PINS 87758896 NaN R 10/04/2022 16:47
11 87271460 EPIN99 EPIN 87001479 NaN R 10/04/2022 16:52
12 87271460 KARE94 KARE 87758623 Supprimé R 10/04/2022 16:53
13 87271460 ERAN67 ERAN 87001479 NaN R 10/04/2022 16:54
14 87271460 EPOL69 EPOL 87001479 NaN R 10/04/2022 17:01
15 87271460 PINS98 PINS 87758896 NaN R 10/04/2022 17:02
16 87271460 KABE02 KABE 87393579 NaN R 10/04/2022 17:08
17 87271460 ERAN73 ERAN 87001479 NaN R 10/04/2022 17:09
18 87271460 EPOL75 EPOL 87001479 NaN R 10/04/2022 17:16
19 87271460 PITA06 PITA 87758896 NaN R 10/04/2022 17:17
20 87271460 KABE08 KABE 87393579 NaN R 10/04/2022 17:23
21 87271460 ERAN79 ERAN 87001479 NaN R 10/04/2022 17:24
22 87271460 EPOL81 EPOL 87001479 NaN R 10/04/2022 17:31
23 87271460 PITA12 PITA 87758896 NaN R 10/04/2022 17:32
24 87271460 KABE14 KABE 87393579 NaN R 10/04/2022 17:38
25 87271460 ERAN85 ERAN 87001479 NaN R 10/04/2022 17:39
26 87271460 EPOL87 EPOL 87001479 NaN R 10/04/2022 17:46
27 87271460 PITA18 PITA 87758896 NaN R 10/04/2022 17:47
28 87271460 KABE20 KABE 87393579 NaN R 10/04/2022 17:53
29 87271460 ERAN91 ERAN 87001479 NaN R 10/04/2022 17:54
您可以从完整的 json(而不仅仅是 passage
)创建数据框,然后将 gare
列连接到规范化的 train
列:
response = """<?xml version="1.0" encoding="UTF-8"?>
<passages gare="87271460">
<train>
<date mode="R">10/04/2022 14:05</date>
<num>PIST64</num>
<miss>PIST</miss>
<term>87758896</term>
</train>
<train>
<date mode="R">10/04/2022 14:09</date>
<num>KALI66</num>
<miss>KALI</miss>
<term>87393579</term>
</train>
</passages>
"""
dict = xmltodict.parse(response)
s = json.dumps(dict).replace('\'', '"').replace('#', '').replace('@', '')
json_object = json.loads(s)
df = pd.DataFrame.from_dict(json_object, orient='index')
df = df.explode('train').reset_index(drop=True)
df = df.join(pd.json_normalize(df['train'])).drop('train', 1)
print(df)
输出:
gare num miss term date.mode date.text
0 87271460 PIST64 PIST 87758896 R 10/04/2022 14:05
1 87271460 KALI66 KALI 87393579 R 10/04/2022 14:09