将嵌套的 mongo 数据库文档转换为 pandas 数据框
Convert nested mongo db documents into pandas dataframe
我有一个 mongoDB collection,里面有这样的文档
doc = {
"_id": {
"$oid": "516622c9ce21150200000d87"
},
"SubmissionDate": {
"$date": "2013-04-11T02:41:13.162Z"
},
"isComplete": True,
"Rounds": [
{
"Photo": [
],
"A": {
"Complexity": 55,
"Colour": 85,
"Deep": 51,
"Effervescence": 44
},
"B": {
"QualityPIDs": [
],
"QualityScales": [
],
"Complexity": 43,
"Qualities": [
]
},
"C": {
"QualityPIDs": [
],
"QualityScales": [
],
"Complexity": 60,
"UHS": 46,
"Colour": 33,
"Qualities": [
]
},
"D": {
"Complexity": 73,
"Duration": 68,
"Quality": 65
}
}
],
"Item": {
"_id": {
"$oid": "51e6d678c06918db21156f92"
},
"Country": "Australia",
"Name": "King",
"PeopleId": {
"$oid": "51dddb69a9d9350200000"
},
"Style": "Apple",
"Type": "Flat",
"UserSubmitted": False
}
}
我需要将这个 collection 转换成 pandas 数据帧。
此处建议的解决方案How to import data from mongodb to pandas?
做主要工作。但我还有
Rounds 列,里面有字典。
为了访问 Rounds
的子词典,我做了一组循环
df = pd.json_normalize(doc)
A_data = pd.DataFrame(columns=df.Rounds[0][0]['A'].keys())
for i in range(len(df.Rounds)):
A_data = A_data.append(pd.json_normalize(df.Rounds[0][0]['A']), ignore_index=True)
最后我将 A_data 连接到我的主数据框。
有没有更快的方法?现在循环需要很多时间。谢谢!
dict
的每个级别都可以使用mata
参数指定,record_path
使用'Rounds'
。
import pandas as pd
meta = [['_id', '$oid'],
['Item', 'Country'],
['Item', 'Name'],
['Item', 'Style'],
['Item', 'Type'],
['Item', 'UserSubmitted'],
['Item', '_id', '$oid'],
['Item', 'PeopleId', '$oid'],
['SubmissionDate', '$date'],
'isComplete']
df = pd.json_normalize(doc, record_path='Rounds', meta=meta)
# display(df)
Photo A.Complexity A.Colour A.Deep A.Effervescence B.QualityPIDs B.QualityScales B.Complexity B.Qualities C.QualityPIDs C.QualityScales C.Complexity C.UHS C.Colour C.Qualities D.Complexity D.Duration D.Quality _id.$oid Item.Country Item.Name Item.Style Item.Type Item.UserSubmitted Item._id.$oid Item.PeopleId.$oid SubmissionDate.$date isComplete
0 [] 55 85 51 44 [] [] 43 [] [] [] 60 46 33 [] 73 68 65 516622c9ce21150200000d87 Australia King Apple Flat False 51e6d678c06918db21156f92 51dddb69a9d9350200000 2013-04-11T02:41:13.162Z True
我有一个 mongoDB collection,里面有这样的文档
doc = {
"_id": {
"$oid": "516622c9ce21150200000d87"
},
"SubmissionDate": {
"$date": "2013-04-11T02:41:13.162Z"
},
"isComplete": True,
"Rounds": [
{
"Photo": [
],
"A": {
"Complexity": 55,
"Colour": 85,
"Deep": 51,
"Effervescence": 44
},
"B": {
"QualityPIDs": [
],
"QualityScales": [
],
"Complexity": 43,
"Qualities": [
]
},
"C": {
"QualityPIDs": [
],
"QualityScales": [
],
"Complexity": 60,
"UHS": 46,
"Colour": 33,
"Qualities": [
]
},
"D": {
"Complexity": 73,
"Duration": 68,
"Quality": 65
}
}
],
"Item": {
"_id": {
"$oid": "51e6d678c06918db21156f92"
},
"Country": "Australia",
"Name": "King",
"PeopleId": {
"$oid": "51dddb69a9d9350200000"
},
"Style": "Apple",
"Type": "Flat",
"UserSubmitted": False
}
}
我需要将这个 collection 转换成 pandas 数据帧。
此处建议的解决方案How to import data from mongodb to pandas? 做主要工作。但我还有 Rounds 列,里面有字典。
为了访问 Rounds
的子词典,我做了一组循环df = pd.json_normalize(doc)
A_data = pd.DataFrame(columns=df.Rounds[0][0]['A'].keys())
for i in range(len(df.Rounds)):
A_data = A_data.append(pd.json_normalize(df.Rounds[0][0]['A']), ignore_index=True)
最后我将 A_data 连接到我的主数据框。
有没有更快的方法?现在循环需要很多时间。谢谢!
dict
的每个级别都可以使用mata
参数指定,record_path
使用'Rounds'
。
import pandas as pd
meta = [['_id', '$oid'],
['Item', 'Country'],
['Item', 'Name'],
['Item', 'Style'],
['Item', 'Type'],
['Item', 'UserSubmitted'],
['Item', '_id', '$oid'],
['Item', 'PeopleId', '$oid'],
['SubmissionDate', '$date'],
'isComplete']
df = pd.json_normalize(doc, record_path='Rounds', meta=meta)
# display(df)
Photo A.Complexity A.Colour A.Deep A.Effervescence B.QualityPIDs B.QualityScales B.Complexity B.Qualities C.QualityPIDs C.QualityScales C.Complexity C.UHS C.Colour C.Qualities D.Complexity D.Duration D.Quality _id.$oid Item.Country Item.Name Item.Style Item.Type Item.UserSubmitted Item._id.$oid Item.PeopleId.$oid SubmissionDate.$date isComplete
0 [] 55 85 51 44 [] [] 43 [] [] [] 60 46 33 [] 73 68 65 516622c9ce21150200000d87 Australia King Apple Flat False 51e6d678c06918db21156f92 51dddb69a9d9350200000 2013-04-11T02:41:13.162Z True