复杂的字典到多列数据框
Complex dict to multicolumn dataframe
我有一个复杂的字典,其中存储了各种“深度”值。结构如下所示:
{
"key1":"value1",
"key2":[
{
"key2.1a":"value2.1a",
"key2.2a":"value2.2a",
"key2.3a":{
"keya2.3.1a":"value2.3.1a"
},
"key2.4a":"value2.4a",
"key2.5a":"value2.5a",
"key2.6a":"value2.6a",
"key2.7a":"value2.7a",
"key2.8a":"value2.8a",
"key2.9a":"value2.9a",
"key2.10a":{
"key2.10.1a":"value2.10.1a",
"key2.10.2a":"value2.10.2a",
"key2.10.3a":"value2.10.3a",
"key2.10.4a":{
"key2.10.4.1a":"value2.10.4.1a"
}
},
"key2.11a":{
"key2.11.1a":"value2.11.1a",
"key2.11.2a":"value2.11.2a"
},
"key2.12a":"value2.12a",
"key2.13a":"value2.13a"
},
{
"key2.1b":"value2.1b",
"key2.2b":"value2.2b",
"key2.3b":{
"keya2.3.1b":"value2.3.1b"
},
"key2.4b":"value2.4b",
"key2.5b":"value2.5b",
"key2.6b":"value2.6b",
"key2.7b":"value2.7b",
"key2.8b":"value2.8b",
"key2.9b":"value2.9b",
"key2.10b":{
"key2.10.1b":"value2.10.1b",
"key2.10.2b":"value2.10.2b",
"key2.10.3b":"value2.10.3b",
"key2.10.4b":{
"key2.10.4.1b":"value2.10.4.1b"
}
},
"key2.11b":{
"key2.11.1b":"value2.11.1b",
"key2.11.2b":"value2.11.2b"
},
"key2.12b":"value2.12b",
"key2.13b":"value2.13b"
}
]
"key3":"value3"
}
数字代表树的“深度”,字母(“a”和“b”)是单独的记录。
我想要一个带有分层索引列的 DataFrame,看起来或多或少像这样:
现在我尝试对列使用 MultiIndex:
columns = pd.MultiIndex.from_product([["key1", "key2", "key3"], ["key2.1","key2.2","key2.3"]])
df = pd.DataFrame(dict, columns = columns)
但它给了我一个空的 DataFrame。有没有办法为每一列指定一个“路径”?
import pandas as pd
from pandas import DataFrame
nested_dict = {
"key1":"value1",
"key2":[
{
"key2.1a":"value2.1a",
"key2.2a":"value2.2a",
"key2.3a":{
"keya2.3.1a":"value2.3.1a"
},
"key2.4a":"value2.4a",
"key2.5a":"value2.5a",
"key2.6a":"value2.6a",
"key2.7a":"value2.7a",
"key2.8a":"value2.8a",
"key2.9a":"value2.9a",
"key2.10a":{
"key2.10.1a":"value2.10.1a",
"key2.10.2a":"value2.10.2a",
"key2.10.3a":"value2.10.3a",
"key2.10.4a":{
"key2.10.4.1a":"value2.10.4.1a"
}
},
"key2.11a":{
"key2.11.1a":"value2.11.1a",
"key2.11.2a":"value2.11.2a"
},
"key2.12a":"value2.12a",
"key2.13a":"value2.13a"
},
{
"key2.1b":"value2.1b",
"key2.2b":"value2.2b",
"key2.3b":{
"keya2.3.1b":"value2.3.1b"
},
"key2.4b":"value2.4b",
"key2.5b":"value2.5b",
"key2.6b":"value2.6b",
"key2.7b":"value2.7b",
"key2.8b":"value2.8b",
"key2.9b":"value2.9b",
"key2.10b":{
"key2.10.1b":"value2.10.1b",
"key2.10.2b":"value2.10.2b",
"key2.10.3b":"value2.10.3b",
"key2.10.4b":{
"key2.10.4.1b":"value2.10.4.1b"
}
},
"key2.11b":{
"key2.11.1b":"value2.11.1b",
"key2.11.2b":"value2.11.2b"
},
"key2.12b":"value2.12b",
"key2.13b":"value2.13b"
}
],
"key3":"value3"
}
pd_dataframe = pd.DataFrame(nested_dict)
print(pd_dataframe)
pd_dataframe.transpose()
我明白了。我以为我必须为嵌套字典的每个分支提供某种路径,但有一种更 pythonic 的方法可以做到这一点:
df = pd.json_normalize(dict, record_path='key2', max_level=4)
这并没有像我一开始想要的那样创建多索引列,而只是其中包含重复值的列。但这是一种可以使用的解决方案。
这段代码怎么样?
输出应如下所示:
import pandas as pd
from pandas import DataFrame
nested_dict = {
"key1":"value1",
"key2":[
{
"key2.1a":"value2.1a",
"key2.2a":"value2.2a",
"key2.3a":{
"keya2.3.1a":"value2.3.1a"
},
"key2.4a":"value2.4a",
"key2.5a":"value2.5a",
"key2.6a":"value2.6a",
"key2.7a":"value2.7a",
"key2.8a":"value2.8a",
"key2.9a":"value2.9a",
"key2.10a":{
"key2.10.1a":"value2.10.1a",
"key2.10.2a":"value2.10.2a",
"key2.10.3a":"value2.10.3a",
"key2.10.4a":{
"key2.10.4.1a":"value2.10.4.1a"
}
},
"key2.11a":{
"key2.11.1a":"value2.11.1a",
"key2.11.2a":"value2.11.2a"
},
"key2.12a":"value2.12a",
"key2.13a":"value2.13a"
},
{
"key2.1b":"value2.1b",
"key2.2b":"value2.2b",
"key2.3b":{
"keya2.3.1b":"value2.3.1b"
},
"key2.4b":"value2.4b",
"key2.5b":"value2.5b",
"key2.6b":"value2.6b",
"key2.7b":"value2.7b",
"key2.8b":"value2.8b",
"key2.9b":"value2.9b",
"key2.10b":{
"key2.10.1b":"value2.10.1b",
"key2.10.2b":"value2.10.2b",
"key2.10.3b":"value2.10.3b",
"key2.10.4b":{
"key2.10.4.1b":"value2.10.4.1b"
}
},
"key2.11b":{
"key2.11.1b":"value2.11.1b",
"key2.11.2b":"value2.11.2b"
},
"key2.12b":"value2.12b",
"key2.13b":"value2.13b"
}
],
"key3":"value3"
}
pd_dataframe = pd.DataFrame(nested_dict)
print(pd_dataframe)
reform = {(outerKey, innerKey): values for outerKey, innerDict in pd_dataframe.iteritems() for innerKey, values in innerDict.iteritems()}
reform
pd.DataFrame(reform)
pd.DataFrame(reform).T
我有一个复杂的字典,其中存储了各种“深度”值。结构如下所示:
{
"key1":"value1",
"key2":[
{
"key2.1a":"value2.1a",
"key2.2a":"value2.2a",
"key2.3a":{
"keya2.3.1a":"value2.3.1a"
},
"key2.4a":"value2.4a",
"key2.5a":"value2.5a",
"key2.6a":"value2.6a",
"key2.7a":"value2.7a",
"key2.8a":"value2.8a",
"key2.9a":"value2.9a",
"key2.10a":{
"key2.10.1a":"value2.10.1a",
"key2.10.2a":"value2.10.2a",
"key2.10.3a":"value2.10.3a",
"key2.10.4a":{
"key2.10.4.1a":"value2.10.4.1a"
}
},
"key2.11a":{
"key2.11.1a":"value2.11.1a",
"key2.11.2a":"value2.11.2a"
},
"key2.12a":"value2.12a",
"key2.13a":"value2.13a"
},
{
"key2.1b":"value2.1b",
"key2.2b":"value2.2b",
"key2.3b":{
"keya2.3.1b":"value2.3.1b"
},
"key2.4b":"value2.4b",
"key2.5b":"value2.5b",
"key2.6b":"value2.6b",
"key2.7b":"value2.7b",
"key2.8b":"value2.8b",
"key2.9b":"value2.9b",
"key2.10b":{
"key2.10.1b":"value2.10.1b",
"key2.10.2b":"value2.10.2b",
"key2.10.3b":"value2.10.3b",
"key2.10.4b":{
"key2.10.4.1b":"value2.10.4.1b"
}
},
"key2.11b":{
"key2.11.1b":"value2.11.1b",
"key2.11.2b":"value2.11.2b"
},
"key2.12b":"value2.12b",
"key2.13b":"value2.13b"
}
]
"key3":"value3"
}
数字代表树的“深度”,字母(“a”和“b”)是单独的记录。
我想要一个带有分层索引列的 DataFrame,看起来或多或少像这样:
现在我尝试对列使用 MultiIndex:
columns = pd.MultiIndex.from_product([["key1", "key2", "key3"], ["key2.1","key2.2","key2.3"]])
df = pd.DataFrame(dict, columns = columns)
但它给了我一个空的 DataFrame。有没有办法为每一列指定一个“路径”?
import pandas as pd
from pandas import DataFrame
nested_dict = {
"key1":"value1",
"key2":[
{
"key2.1a":"value2.1a",
"key2.2a":"value2.2a",
"key2.3a":{
"keya2.3.1a":"value2.3.1a"
},
"key2.4a":"value2.4a",
"key2.5a":"value2.5a",
"key2.6a":"value2.6a",
"key2.7a":"value2.7a",
"key2.8a":"value2.8a",
"key2.9a":"value2.9a",
"key2.10a":{
"key2.10.1a":"value2.10.1a",
"key2.10.2a":"value2.10.2a",
"key2.10.3a":"value2.10.3a",
"key2.10.4a":{
"key2.10.4.1a":"value2.10.4.1a"
}
},
"key2.11a":{
"key2.11.1a":"value2.11.1a",
"key2.11.2a":"value2.11.2a"
},
"key2.12a":"value2.12a",
"key2.13a":"value2.13a"
},
{
"key2.1b":"value2.1b",
"key2.2b":"value2.2b",
"key2.3b":{
"keya2.3.1b":"value2.3.1b"
},
"key2.4b":"value2.4b",
"key2.5b":"value2.5b",
"key2.6b":"value2.6b",
"key2.7b":"value2.7b",
"key2.8b":"value2.8b",
"key2.9b":"value2.9b",
"key2.10b":{
"key2.10.1b":"value2.10.1b",
"key2.10.2b":"value2.10.2b",
"key2.10.3b":"value2.10.3b",
"key2.10.4b":{
"key2.10.4.1b":"value2.10.4.1b"
}
},
"key2.11b":{
"key2.11.1b":"value2.11.1b",
"key2.11.2b":"value2.11.2b"
},
"key2.12b":"value2.12b",
"key2.13b":"value2.13b"
}
],
"key3":"value3"
}
pd_dataframe = pd.DataFrame(nested_dict)
print(pd_dataframe)
pd_dataframe.transpose()
我明白了。我以为我必须为嵌套字典的每个分支提供某种路径,但有一种更 pythonic 的方法可以做到这一点:
df = pd.json_normalize(dict, record_path='key2', max_level=4)
这并没有像我一开始想要的那样创建多索引列,而只是其中包含重复值的列。但这是一种可以使用的解决方案。
这段代码怎么样?
输出应如下所示:
import pandas as pd
from pandas import DataFrame
nested_dict = {
"key1":"value1",
"key2":[
{
"key2.1a":"value2.1a",
"key2.2a":"value2.2a",
"key2.3a":{
"keya2.3.1a":"value2.3.1a"
},
"key2.4a":"value2.4a",
"key2.5a":"value2.5a",
"key2.6a":"value2.6a",
"key2.7a":"value2.7a",
"key2.8a":"value2.8a",
"key2.9a":"value2.9a",
"key2.10a":{
"key2.10.1a":"value2.10.1a",
"key2.10.2a":"value2.10.2a",
"key2.10.3a":"value2.10.3a",
"key2.10.4a":{
"key2.10.4.1a":"value2.10.4.1a"
}
},
"key2.11a":{
"key2.11.1a":"value2.11.1a",
"key2.11.2a":"value2.11.2a"
},
"key2.12a":"value2.12a",
"key2.13a":"value2.13a"
},
{
"key2.1b":"value2.1b",
"key2.2b":"value2.2b",
"key2.3b":{
"keya2.3.1b":"value2.3.1b"
},
"key2.4b":"value2.4b",
"key2.5b":"value2.5b",
"key2.6b":"value2.6b",
"key2.7b":"value2.7b",
"key2.8b":"value2.8b",
"key2.9b":"value2.9b",
"key2.10b":{
"key2.10.1b":"value2.10.1b",
"key2.10.2b":"value2.10.2b",
"key2.10.3b":"value2.10.3b",
"key2.10.4b":{
"key2.10.4.1b":"value2.10.4.1b"
}
},
"key2.11b":{
"key2.11.1b":"value2.11.1b",
"key2.11.2b":"value2.11.2b"
},
"key2.12b":"value2.12b",
"key2.13b":"value2.13b"
}
],
"key3":"value3"
}
pd_dataframe = pd.DataFrame(nested_dict)
print(pd_dataframe)
reform = {(outerKey, innerKey): values for outerKey, innerDict in pd_dataframe.iteritems() for innerKey, values in innerDict.iteritems()}
reform
pd.DataFrame(reform)
pd.DataFrame(reform).T