使用 json_normalize 将 JSON 展平为 pandas

Flatten JSON to pandas using json_normalize

我需要将 XML 数据展平,然后将其转换为 pandas 的 JSON。 问题是,我希望 passage (<passage gare="87271460">) 的值位于数据帧的每一行。

(上下文:Gare 是一个火车站,有一个 ID。我这样称呼 API https://api.transilien.com/gare/87271460/depart,我打算称呼 17 个不同的火车站)

我运行不知道我该怎么做。这是我目前所拥有的

    response = requests.request("GET", url, headers=headers, data=payload)
    # print(response.text)
    dict = xmltodict.parse(response.content)
    # print(dict)
    s = json.dumps(dict).replace('\'', '"').replace('#', '').replace('@', '')
    json_object = json.loads(s)
    # print(json_object)
    df = pd.json_normalize(json_object['passages'], record_path=['train'])
    print(df)

这是我从请求中检索到的XML(删除不需要的字符后)

<?xml version="1.0" encoding="UTF-8"?>
<passages gare="87271460">
    <train>
        <date mode="R">10/04/2022 14:05</date>
        <num>PIST64</num>
        <miss>PIST</miss>
        <term>87758896</term>
    </train>
    <train>
        <date mode="R">10/04/2022 14:09</date>
        <num>KALI66</num>
        <miss>KALI</miss>
        <term>87393579</term>
    </train>
</passages>

我需要的最终输出是:

    passage   num     miss      term      etat date.mode         date.text
0   87271460  ERBE85  ERBE  87271486  Supprimé         R  10/04/2022 16:09
1   87271460  PINS74  PINS  87758896       NaN         R  10/04/2022 16:10
2   87271460  PINS80  PINS  87758896       NaN         R  10/04/2022 16:17
3   87271460  KARE82  KARE  87758623  Supprimé         R  10/04/2022 16:23
4   87271460  EPAU81  EPAU  87001479       NaN         R  10/04/2022 16:29
5   87271460  ERIO91  ERIO  87001479       NaN         R  10/04/2022 16:30
6   87271460  PINS86  PINS  87758896       NaN         R  10/04/2022 16:32
7   87271460  KARE88  KARE  87393579       NaN         R  10/04/2022 16:38
8   87271460  ERBE97  ERBE  87271486  Supprimé         R  10/04/2022 16:39
9   87271460  EPAU93  EPAU  87001479       NaN         R  10/04/2022 16:43
10  87271460  PINS92  PINS  87758896       NaN         R  10/04/2022 16:47
11  87271460  EPIN99  EPIN  87001479       NaN         R  10/04/2022 16:52
12  87271460  KARE94  KARE  87758623  Supprimé         R  10/04/2022 16:53
13  87271460  ERAN67  ERAN  87001479       NaN         R  10/04/2022 16:54
14  87271460  EPOL69  EPOL  87001479       NaN         R  10/04/2022 17:01
15  87271460  PINS98  PINS  87758896       NaN         R  10/04/2022 17:02
16  87271460  KABE02  KABE  87393579       NaN         R  10/04/2022 17:08
17  87271460  ERAN73  ERAN  87001479       NaN         R  10/04/2022 17:09
18  87271460  EPOL75  EPOL  87001479       NaN         R  10/04/2022 17:16
19  87271460  PITA06  PITA  87758896       NaN         R  10/04/2022 17:17
20  87271460  KABE08  KABE  87393579       NaN         R  10/04/2022 17:23
21  87271460  ERAN79  ERAN  87001479       NaN         R  10/04/2022 17:24
22  87271460  EPOL81  EPOL  87001479       NaN         R  10/04/2022 17:31
23  87271460  PITA12  PITA  87758896       NaN         R  10/04/2022 17:32
24  87271460  KABE14  KABE  87393579       NaN         R  10/04/2022 17:38
25  87271460  ERAN85  ERAN  87001479       NaN         R  10/04/2022 17:39
26  87271460  EPOL87  EPOL  87001479       NaN         R  10/04/2022 17:46
27  87271460  PITA18  PITA  87758896       NaN         R  10/04/2022 17:47
28  87271460  KABE20  KABE  87393579       NaN         R  10/04/2022 17:53
29  87271460  ERAN91  ERAN  87001479       NaN         R  10/04/2022 17:54

您可以从完整的 json(而不仅仅是 passage)创建数据框,然后将 gare 列连接到规范化的 train 列:

response = """<?xml version="1.0" encoding="UTF-8"?>
<passages gare="87271460">
    <train>
        <date mode="R">10/04/2022 14:05</date>
        <num>PIST64</num>
        <miss>PIST</miss>
        <term>87758896</term>
    </train>
    <train>
        <date mode="R">10/04/2022 14:09</date>
        <num>KALI66</num>
        <miss>KALI</miss>
        <term>87393579</term>
    </train>
</passages>
"""

dict = xmltodict.parse(response)
s = json.dumps(dict).replace('\'', '"').replace('#', '').replace('@', '')
json_object = json.loads(s)

df = pd.DataFrame.from_dict(json_object, orient='index')
df = df.explode('train').reset_index(drop=True)
df = df.join(pd.json_normalize(df['train'])).drop('train', 1)
print(df)

输出:

       gare     num  miss      term date.mode         date.text
0  87271460  PIST64  PIST  87758896         R  10/04/2022 14:05
1  87271460  KALI66  KALI  87393579         R  10/04/2022 14:09