从 mongodb 集合嵌套数组生成一个新的 Dataframe

Question

我正在尝试从 mongodb 集合生成一个新的数据框，目标是制作一个仅代表 'events' 列的新 df：

例如：

{
    "_id" : 1641008579,
    "status" : "init",
    "description" : "Test",
    "attachment" : null,
    "start" : "08:00",
    "user" : "Jenny",
    "timestamp" : ISODate("2022-01-01T04:43:11.380Z"),
    "events" : [ 
        {
            "id" : 1641008580,
            "status" : "start",
            "description" : "First Event",
            "user" : "Jenny",
            "timestamp" : ISODate("2022-01-01T04:43:11.380Z")
        }, 
        {
            "id" : 1641008581,
            "status" : "progress",
            "description" : "Middle of the Event",
            "user" : "Joe",
            "timestamp" : ISODate("2022-01-01T05:43:11.380Z")
        }, 
        {
            "id" : 1641008582,
            "status" : "end",
            "description" : "Last Event",
            "user" : "Alain",
            "timestamp" : ISODate("2022-01-01T06:43:11.380Z")
        }
    ]
}

知道如何开始一种方法才能获得以下结果吗？

event_df 应该像下面这样：

    id          status      description             user    timestamp
0   1641008580  start       First Event             Jenny   "2022-01-01T04:43:11.380Z"
1   1641008581  progress    Middle of the Event     Joe     "2022-01-01T05:43:11.380Z"
2   1641008582  end         Last Event              Alain   "2022-01-01T06:43:11.380Z"

/K

Answer 1

Pandas' pandas.json_normalize 方法在这里非常有效，它将“将半结构化 JSON 数据规范化为平面 table。”返回 DataFrame.

API 参考 -> pandas.json_normalize

import json
import pandas as pd

with open('mongo.json') as json_file: # retrieve the json file
    data = json.load(json_file) # deserialize the json file to a dict 
    events_df = pd.json_normalize(data['events']) # normalize and create a dataframe 
    print(events_df)

Answer 2

这是加载集合后的函数：

def set_event_2_df(last_situation):
    for doc in last_situation:
        for k, v in doc.items():
            try:
                if k == 'events':
                    for i, e in enumerate(doc['events']):
                        new_row = {
                            'id': str(doc['events'][i]['id']),
                            'status': doc['events'][i]['status'],
                            'description': doc['events'][i]['description'],
                            'user': doc['events'][i]['user'],
                            'timestamp': doc['events'][i]['timestamp']
                        }
                        df_event = df_event.append(new_row, ignore_index=True)
            except Exception as e:
                print('EXP - {}'.format(e))

从 mongodb 集合嵌套数组生成一个新的 Dataframe

generate a new Dataframe from mongodb collection nested array

python

arrays

nested

mongodb

dataframe