使用 pandas json_normalize 展平包含多个嵌套列表的 Dict 列表
Flattening List of Dict containing multiple nested lists using pandas json_normalize
我们有以下列表 -
users_info = [
{
"Id": "21",
"Name": "ABC",
"Country": {
"Name": "Country 1"
},
"Addresses": {
"records": [
{
"addressId": "12",
"line1": "xyz, 102",
"city": "PQR"
},
{
"addressId": "13",
"line1": "YTR, 102",
"city": "NMS"
}
]
},
"Education": {
"records": [
{
"Id": "45",
"Degree": "Bachelors"
},
{
"Id": "49",
"Degree": "Masters"
}
]
}
},
{
"Id": "26",
"Name": "PEW",
"Country": {
"Name": "Country 2"
},
"Addresses": {
"records": [
{
"addressId": "10",
"line1": "BTR, 12",
"city": "UYT"
},
{
"addressId": "123",
"line1": "MEQW, 6",
"city": "KJH"
}
]
},
"Education": {
"records": [
{
"Id": "45",
"Degree": "Bachelors"
},
{
"Id": "49",
"Degree": "Masters"
}
]
}
},
{
"Id": "214",
"Name": "TUF",
"Country": None,
"Addresses": {
"records": None
},
"Education": {
"records": [
{
"Id": "45",
"Degree": "Bachelors"
},
{
"Id": "49",
"Degree": "Masters"
}
]
}
},
{
"Id": "2609",
"Name": "JJU",
"Country": {
"Name": "Country 2"
},
"Addresses": {
"records": [
{
"addressId": "10",
"line1": "BTR, 12",
"city": "UYT"
},
{
"addressId": "123",
"line1": "MEQW, 6",
"city": "KJH"
}
]
},
"Education": None
}
]
我想展平这个字典列表并创建 pandas 数据框 -
Id | Name | Country.Name | Addresses.addressId | Addresses.line1 | Addresses.city | Education.Id | Education.Degree
有两个字典列表 - 地址和教育。这些(国家、地址和教育)中的任何一个都可能是 None
如何使用上述数据创建数据框以获得所需的格式?
这是我试过的 -
dataset_1 = pandas.json_normalize([user for user in users_info],
record_path = ["Addresses"],
meta = ["Id", "Name", ["Country", "Name"]],
errors="ignore"
)
dataset_2 = pandas.json_normalize([user for user in users_info],
record_path = ["Education"],
meta = ["Id", "Name", ["Country", "Name"]],
errors="ignore"
)
dataset = pandas.concat([dataset_1, dataset_2], axis=1)
当 dataset_1 执行时,我得到 -
NoneType object is not iterable error
您可以使用:
df = pd.json_normalize(users_info)
df_addresses = df['Addresses.records'].explode().apply(pd.Series)
df_addresses.rename(columns={col:f'Addresses.{col}' for col in df_addresses.columns}, inplace=True)
df_education = df['Education.records'].explode().apply(pd.Series)
df_education.rename(columns={col:f'Education.{col}' for col in df_education.columns}, inplace=True)
cols = [col for col in df.columns if col not in ['Addresses.records', 'Education.records']]
df = df[cols].join(df_addresses).join(df_education)
df.dropna(axis=1, how='all', inplace=True)
print(df)
OUTPUT
Id Name Country.Name Addresses.addressId Addresses.line1 Addresses.city Education.Degree Education.Id
0 21 ABC Country 1 12 xyz, 102 PQR Bachelors 45
0 21 ABC Country 1 12 xyz, 102 PQR Masters 49
0 21 ABC Country 1 13 YTR, 102 NMS Bachelors 45
0 21 ABC Country 1 13 YTR, 102 NMS Masters 49
1 26 PEW Country 2 10 BTR, 12 UYT Bachelors 45
1 26 PEW Country 2 10 BTR, 12 UYT Masters 49
1 26 PEW Country 2 123 MEQW, 6 KJH Bachelors 45
1 26 PEW Country 2 123 MEQW, 6 KJH Masters 49
2 214 TUF NaN NaN NaN NaN Bachelors 45
2 214 TUF NaN NaN NaN NaN Masters 49
3 2609 JJU Country 2 10 BTR, 12 UYT NaN NaN
3 2609 JJU Country 2 123 MEQW, 6 KJH NaN NaN
我们有以下列表 -
users_info = [
{
"Id": "21",
"Name": "ABC",
"Country": {
"Name": "Country 1"
},
"Addresses": {
"records": [
{
"addressId": "12",
"line1": "xyz, 102",
"city": "PQR"
},
{
"addressId": "13",
"line1": "YTR, 102",
"city": "NMS"
}
]
},
"Education": {
"records": [
{
"Id": "45",
"Degree": "Bachelors"
},
{
"Id": "49",
"Degree": "Masters"
}
]
}
},
{
"Id": "26",
"Name": "PEW",
"Country": {
"Name": "Country 2"
},
"Addresses": {
"records": [
{
"addressId": "10",
"line1": "BTR, 12",
"city": "UYT"
},
{
"addressId": "123",
"line1": "MEQW, 6",
"city": "KJH"
}
]
},
"Education": {
"records": [
{
"Id": "45",
"Degree": "Bachelors"
},
{
"Id": "49",
"Degree": "Masters"
}
]
}
},
{
"Id": "214",
"Name": "TUF",
"Country": None,
"Addresses": {
"records": None
},
"Education": {
"records": [
{
"Id": "45",
"Degree": "Bachelors"
},
{
"Id": "49",
"Degree": "Masters"
}
]
}
},
{
"Id": "2609",
"Name": "JJU",
"Country": {
"Name": "Country 2"
},
"Addresses": {
"records": [
{
"addressId": "10",
"line1": "BTR, 12",
"city": "UYT"
},
{
"addressId": "123",
"line1": "MEQW, 6",
"city": "KJH"
}
]
},
"Education": None
}
]
我想展平这个字典列表并创建 pandas 数据框 -
Id | Name | Country.Name | Addresses.addressId | Addresses.line1 | Addresses.city | Education.Id | Education.Degree
有两个字典列表 - 地址和教育。这些(国家、地址和教育)中的任何一个都可能是 None
如何使用上述数据创建数据框以获得所需的格式?
这是我试过的 -
dataset_1 = pandas.json_normalize([user for user in users_info],
record_path = ["Addresses"],
meta = ["Id", "Name", ["Country", "Name"]],
errors="ignore"
)
dataset_2 = pandas.json_normalize([user for user in users_info],
record_path = ["Education"],
meta = ["Id", "Name", ["Country", "Name"]],
errors="ignore"
)
dataset = pandas.concat([dataset_1, dataset_2], axis=1)
当 dataset_1 执行时,我得到 -
NoneType object is not iterable error
您可以使用:
df = pd.json_normalize(users_info)
df_addresses = df['Addresses.records'].explode().apply(pd.Series)
df_addresses.rename(columns={col:f'Addresses.{col}' for col in df_addresses.columns}, inplace=True)
df_education = df['Education.records'].explode().apply(pd.Series)
df_education.rename(columns={col:f'Education.{col}' for col in df_education.columns}, inplace=True)
cols = [col for col in df.columns if col not in ['Addresses.records', 'Education.records']]
df = df[cols].join(df_addresses).join(df_education)
df.dropna(axis=1, how='all', inplace=True)
print(df)
OUTPUT
Id Name Country.Name Addresses.addressId Addresses.line1 Addresses.city Education.Degree Education.Id
0 21 ABC Country 1 12 xyz, 102 PQR Bachelors 45
0 21 ABC Country 1 12 xyz, 102 PQR Masters 49
0 21 ABC Country 1 13 YTR, 102 NMS Bachelors 45
0 21 ABC Country 1 13 YTR, 102 NMS Masters 49
1 26 PEW Country 2 10 BTR, 12 UYT Bachelors 45
1 26 PEW Country 2 10 BTR, 12 UYT Masters 49
1 26 PEW Country 2 123 MEQW, 6 KJH Bachelors 45
1 26 PEW Country 2 123 MEQW, 6 KJH Masters 49
2 214 TUF NaN NaN NaN NaN Bachelors 45
2 214 TUF NaN NaN NaN NaN Masters 49
3 2609 JJU Country 2 10 BTR, 12 UYT NaN NaN
3 2609 JJU Country 2 123 MEQW, 6 KJH NaN NaN