将三重嵌套 json 展平为数据框
Flatten a tripled nested json into a dataframe
问题
我得到了一个非常大的 json 文件,看起来像这个最小的例子:
json_file = """
{
"products":
[
{
"id":"0",
"name": "First",
"emptylist":[],
"properties" :
{
"id" : "",
"name" : ""
}
},
{
"id":"1",
"name": "Second",
"emptylist":[],
"properties":
{
"id" : "23",
"name" : "a useful product",
"features" :
[
{
"name":"Features",
"id":"18",
"features":
[
{
"id":"1001",
"name":"Colour",
"value":"Black"
},
{
"id":"2093",
"name":"Material",
"value":"Plastic"
}
]
},
{
"name":"Sizes",
"id":"34",
"features":
[
{
"id":"4736",
"name":"Length",
"value":"56"
},
{
"id":"8745",
"name":"Width",
"value":"76"
}
]
}
]
}
},
{
"id":"2",
"name": "Third",
"properties" :
{
"id" : "876",
"name" : "another one",
"features" :
[
{
"name":"Box",
"id":"937",
"features":
[
{
"id":"3758",
"name":"Amount",
"value":"1"
},
{
"id":"2222",
"name":"Packaging",
"value":"Blister"
}
]
},
{
"name":"Features",
"id":"8473",
"features":
[
{
"id":"9372",
"name":"Colour",
"value":"White"
},
{
"id":"9375",
"name":"Position",
"value":"A"
},
{
"id":"2654",
"name":"Amount",
"value":"6"
}
]
}
]
}
}
]
}
"""
我想用它做一个平面 table。它应该看起来像这样:
id name emptylist properties.id properties.name properties.features.name properties.features.id properties.features.features.id properties.features.features.name properties.features.features.value
0 First [] "" "" NaN NaN NaN NaN NaN
1 Second [] "23" "a useful product" Features 18 1001 Colour Black
1 Second [] "23" "a useful product" Features 18 2093 Material Plastic
1 Second [] "23" "a useful product" Sizes 34 4736 Length 56
1 Second [] "23" "a useful product" Sizes 34 8745 Width 76
2 Third "876" "another one" Box 937 3758 Amount 1
2 Third "876" "another one" Box 937 2222 Packaging Blister
2 Third "876" "another one" Features 8473 9372 Colour White
2 Third "876" "another one" Features 8473 9375 Position A
2 Third "876" "another one" Features 8473 2654 Amount 6
我试过的
我试过这个:
import pandas as pd
import json
j = json.loads(json_file)
df = pd.json_normalize(j['products'])
df
id name emptylist properties.id properties.name properties.features
0 0 First [] NaN
1 1 Second [] 23 a useful product [{'name': 'Features', 'id': '18', 'features': ...
2 2 Third NaN 876 another one [{'name': 'Box', 'id': '937', 'features': [{'i...
并且我尝试尝试使用其他参数,但我一无所获。看来这不是正确的方法。
谁能帮帮我?
附加信息
我用 R 得到了一个可行的解决方案,但我需要能够用 Python 来完成。
如果有帮助,这将是我试图在 Python.
中翻译的 R 代码
library(tidyr)
jsonlite::fromJSON(json_file)$products %>%
jsonlite::flatten() %>%
unnest(properties.features , names_sep = ".", keep_empty = TRUE) %>%
unnest(properties.features.features, names_sep = ".", keep_empty = TRUE)
编辑
在@piterbarg 的帮助下和一些研究我找到了这个解决方案:
j = json.loads(json_file)
df = pd.json_normalize(j['products'])
df1 = df.explode('properties.features')
df2 = pd.concat([df1.reset_index(drop=True).drop('properties.features', axis = 1),
df1['properties.features'].apply(pd.Series).reset_index(drop=True).add_prefix("properties.features.").drop("properties.features.0", axis = 1)], axis = 1)
df2 = df2.explode('properties.features.features')
df3 = pd.concat([df2.reset_index(drop=True).drop('properties.features.features', axis = 1),
df2['properties.features.features'].apply(pd.Series).reset_index(drop=True).add_prefix("properties.features.features.").drop("properties.features.features.0", axis = 1)], axis = 1)
df3
有了这个,我得到了我正在寻找的解决方案,但代码看起来很乱,我不确定这个解决方案的效率如何。有帮助吗?
这可以通过重复应用 explode
来扩展列表和 apply(pd.Series)
来扩展字典来完成,虽然有点乏味:
df1 = df.explode('properties.features')
df2 = df1.join(df1['properties.features'].apply(pd.Series), lsuffix = '', rsuffix = '.properties.features').explode('features').drop(columns = 'properties.features')
df3 = df2.join(df2['features'].apply(pd.Series), lsuffix = '', rsuffix='.features').drop(columns = ['features','emptylist']).drop_duplicates()
df3
看起来像这样:
id name properties.id properties.name 0 id.properties.features name.properties.features 0.features id.features name.features value
-- ---- ------ --------------- ----------------- --- ------------------------ -------------------------- ------------ ------------- --------------- -------
0 0 First nan nan nan nan nan nan nan
1 1 Second 23 a useful product nan 18 Features nan 1001 Colour Black
1 1 Second 23 a useful product nan 18 Features nan 2093 Material Plastic
1 1 Second 23 a useful product nan 18 Features nan 4736 Length 56
1 1 Second 23 a useful product nan 18 Features nan 8745 Width 76
1 1 Second 23 a useful product nan 34 Sizes nan 1001 Colour Black
1 1 Second 23 a useful product nan 34 Sizes nan 2093 Material Plastic
1 1 Second 23 a useful product nan 34 Sizes nan 4736 Length 56
1 1 Second 23 a useful product nan 34 Sizes nan 8745 Width 76
2 2 Third 876 another one nan 937 Box nan 3758 Amount 1
2 2 Third 876 another one nan 937 Box nan 2222 Packaging Blister
2 2 Third 876 another one nan 937 Box nan 9372 Colour White
2 2 Third 876 another one nan 937 Box nan 9375 Position A
2 2 Third 876 another one nan 937 Box nan 2654 Amount 6
2 2 Third 876 another one nan 8473 Features nan 3758 Amount 1
2 2 Third 876 another one nan 8473 Features nan 2222 Packaging Blister
2 2 Third 876 another one nan 8473 Features nan 9372 Colour White
2 2 Third 876 another one nan 8473 Features nan 9375 Position A
2 2 Third 876 another one nan 8473 Features nan 2654 Amount 6
名称与您想要的不太一样,如果需要,可以使用 .rename(columns = {...})
修复
它与您在 Edit 中的类似,但可能语法更短且性能更高。
如果您的 DataFrame 中有 NaN,旧版本的 Pandas 可能会在 json_normalize
上失败。
此解决方案应适用于 Pandas 1.3+。
df = pd.json_normalize(products)
df = df.explode('properties.features')
df = pd.concat([df.drop('properties.features', axis=1).reset_index(drop=True),
pd.json_normalize(df['properties.features']).add_prefix('properties.features.')], axis=1)
df = df.explode('properties.features.features')
df = pd.concat([df.drop('properties.features.features', axis=1).reset_index(drop=True),
pd.json_normalize(df['properties.features.features']).add_prefix('properties.features.features.')], axis=1)
性能。有 1000 个产品。
Code in Edit : 4.85 s ± 218 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This solution: 58.3 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
问题
我得到了一个非常大的 json 文件,看起来像这个最小的例子:
json_file = """
{
"products":
[
{
"id":"0",
"name": "First",
"emptylist":[],
"properties" :
{
"id" : "",
"name" : ""
}
},
{
"id":"1",
"name": "Second",
"emptylist":[],
"properties":
{
"id" : "23",
"name" : "a useful product",
"features" :
[
{
"name":"Features",
"id":"18",
"features":
[
{
"id":"1001",
"name":"Colour",
"value":"Black"
},
{
"id":"2093",
"name":"Material",
"value":"Plastic"
}
]
},
{
"name":"Sizes",
"id":"34",
"features":
[
{
"id":"4736",
"name":"Length",
"value":"56"
},
{
"id":"8745",
"name":"Width",
"value":"76"
}
]
}
]
}
},
{
"id":"2",
"name": "Third",
"properties" :
{
"id" : "876",
"name" : "another one",
"features" :
[
{
"name":"Box",
"id":"937",
"features":
[
{
"id":"3758",
"name":"Amount",
"value":"1"
},
{
"id":"2222",
"name":"Packaging",
"value":"Blister"
}
]
},
{
"name":"Features",
"id":"8473",
"features":
[
{
"id":"9372",
"name":"Colour",
"value":"White"
},
{
"id":"9375",
"name":"Position",
"value":"A"
},
{
"id":"2654",
"name":"Amount",
"value":"6"
}
]
}
]
}
}
]
}
"""
我想用它做一个平面 table。它应该看起来像这样:
id name emptylist properties.id properties.name properties.features.name properties.features.id properties.features.features.id properties.features.features.name properties.features.features.value
0 First [] "" "" NaN NaN NaN NaN NaN
1 Second [] "23" "a useful product" Features 18 1001 Colour Black
1 Second [] "23" "a useful product" Features 18 2093 Material Plastic
1 Second [] "23" "a useful product" Sizes 34 4736 Length 56
1 Second [] "23" "a useful product" Sizes 34 8745 Width 76
2 Third "876" "another one" Box 937 3758 Amount 1
2 Third "876" "another one" Box 937 2222 Packaging Blister
2 Third "876" "another one" Features 8473 9372 Colour White
2 Third "876" "another one" Features 8473 9375 Position A
2 Third "876" "another one" Features 8473 2654 Amount 6
我试过的
我试过这个:
import pandas as pd
import json
j = json.loads(json_file)
df = pd.json_normalize(j['products'])
df
id name emptylist properties.id properties.name properties.features
0 0 First [] NaN
1 1 Second [] 23 a useful product [{'name': 'Features', 'id': '18', 'features': ...
2 2 Third NaN 876 another one [{'name': 'Box', 'id': '937', 'features': [{'i...
并且我尝试尝试使用其他参数,但我一无所获。看来这不是正确的方法。
谁能帮帮我?
附加信息
我用 R 得到了一个可行的解决方案,但我需要能够用 Python 来完成。 如果有帮助,这将是我试图在 Python.
中翻译的 R 代码library(tidyr)
jsonlite::fromJSON(json_file)$products %>%
jsonlite::flatten() %>%
unnest(properties.features , names_sep = ".", keep_empty = TRUE) %>%
unnest(properties.features.features, names_sep = ".", keep_empty = TRUE)
编辑
在@piterbarg 的帮助下和一些研究我找到了这个解决方案:
j = json.loads(json_file)
df = pd.json_normalize(j['products'])
df1 = df.explode('properties.features')
df2 = pd.concat([df1.reset_index(drop=True).drop('properties.features', axis = 1),
df1['properties.features'].apply(pd.Series).reset_index(drop=True).add_prefix("properties.features.").drop("properties.features.0", axis = 1)], axis = 1)
df2 = df2.explode('properties.features.features')
df3 = pd.concat([df2.reset_index(drop=True).drop('properties.features.features', axis = 1),
df2['properties.features.features'].apply(pd.Series).reset_index(drop=True).add_prefix("properties.features.features.").drop("properties.features.features.0", axis = 1)], axis = 1)
df3
有了这个,我得到了我正在寻找的解决方案,但代码看起来很乱,我不确定这个解决方案的效率如何。有帮助吗?
这可以通过重复应用 explode
来扩展列表和 apply(pd.Series)
来扩展字典来完成,虽然有点乏味:
df1 = df.explode('properties.features')
df2 = df1.join(df1['properties.features'].apply(pd.Series), lsuffix = '', rsuffix = '.properties.features').explode('features').drop(columns = 'properties.features')
df3 = df2.join(df2['features'].apply(pd.Series), lsuffix = '', rsuffix='.features').drop(columns = ['features','emptylist']).drop_duplicates()
df3
看起来像这样:
id name properties.id properties.name 0 id.properties.features name.properties.features 0.features id.features name.features value
-- ---- ------ --------------- ----------------- --- ------------------------ -------------------------- ------------ ------------- --------------- -------
0 0 First nan nan nan nan nan nan nan
1 1 Second 23 a useful product nan 18 Features nan 1001 Colour Black
1 1 Second 23 a useful product nan 18 Features nan 2093 Material Plastic
1 1 Second 23 a useful product nan 18 Features nan 4736 Length 56
1 1 Second 23 a useful product nan 18 Features nan 8745 Width 76
1 1 Second 23 a useful product nan 34 Sizes nan 1001 Colour Black
1 1 Second 23 a useful product nan 34 Sizes nan 2093 Material Plastic
1 1 Second 23 a useful product nan 34 Sizes nan 4736 Length 56
1 1 Second 23 a useful product nan 34 Sizes nan 8745 Width 76
2 2 Third 876 another one nan 937 Box nan 3758 Amount 1
2 2 Third 876 another one nan 937 Box nan 2222 Packaging Blister
2 2 Third 876 another one nan 937 Box nan 9372 Colour White
2 2 Third 876 another one nan 937 Box nan 9375 Position A
2 2 Third 876 another one nan 937 Box nan 2654 Amount 6
2 2 Third 876 another one nan 8473 Features nan 3758 Amount 1
2 2 Third 876 another one nan 8473 Features nan 2222 Packaging Blister
2 2 Third 876 another one nan 8473 Features nan 9372 Colour White
2 2 Third 876 another one nan 8473 Features nan 9375 Position A
2 2 Third 876 another one nan 8473 Features nan 2654 Amount 6
名称与您想要的不太一样,如果需要,可以使用 .rename(columns = {...})
修复
它与您在 Edit 中的类似,但可能语法更短且性能更高。
如果您的 DataFrame 中有 NaN,旧版本的 Pandas 可能会在 json_normalize
上失败。
此解决方案应适用于 Pandas 1.3+。
df = pd.json_normalize(products)
df = df.explode('properties.features')
df = pd.concat([df.drop('properties.features', axis=1).reset_index(drop=True),
pd.json_normalize(df['properties.features']).add_prefix('properties.features.')], axis=1)
df = df.explode('properties.features.features')
df = pd.concat([df.drop('properties.features.features', axis=1).reset_index(drop=True),
pd.json_normalize(df['properties.features.features']).add_prefix('properties.features.features.')], axis=1)
性能。有 1000 个产品。
Code in Edit : 4.85 s ± 218 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This solution: 58.3 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)