将三重嵌套 json 展平为数据框

Flatten a tripled nested json into a dataframe

问题

我得到了一个非常大的 json 文件,看起来像这个最小的例子:

json_file = """
{
    "products":
    [

        {
            "id":"0",
            "name": "First",
            "emptylist":[],
            "properties" : 
            {
              "id" : "",
              "name" : ""
            }
        },
        {
            "id":"1",
            "name": "Second",
            "emptylist":[],
            "properties": 
            {
                "id" : "23",
                "name" : "a useful product",
                "features" :
                [
                    {
                        "name":"Features",
                        "id":"18",
                        "features":
                        [
                            {
                                "id":"1001",
                                "name":"Colour",
                                "value":"Black"
                            },
                            {
                                "id":"2093",
                                "name":"Material",
                                "value":"Plastic"
                            }
                        ]
                    },
                    {
                        "name":"Sizes",
                        "id":"34",
                        "features":
                        [
                            {
                                "id":"4736",
                                "name":"Length",
                                "value":"56"
                            },
                            {
                                "id":"8745",
                                "name":"Width",
                                "value":"76"
                            }
                        ]
                    }
                ]
            }
        },
        {
            "id":"2",
            "name": "Third",
            "properties" : 
            {
                "id" : "876",
                "name" : "another one",
                "features" : 
                [
                    {
                        "name":"Box",
                        "id":"937",
                        "features":
                        [
                            {
                                "id":"3758",
                                "name":"Amount",
                                "value":"1"
                            },
                            {
                                "id":"2222",
                                "name":"Packaging",
                                "value":"Blister"
                            }
                        ]
                    },
                    {
                        "name":"Features",
                        "id":"8473",
                        "features":
                        [
                            {
                                "id":"9372",
                                "name":"Colour",
                                "value":"White"
                            },
                            {
                                "id":"9375",
                                "name":"Position",
                                "value":"A"
                            },
                            {
                                "id":"2654",
                                "name":"Amount",
                                "value":"6"
                            }
                        ]
                    }
                ]
            }
        }
    ]
}
"""

我想用它做一个平面 table。它应该看起来像这样:

id    name   emptylist  properties.id properties.name    properties.features.name properties.features.id properties.features.features.id properties.features.features.name properties.features.features.value
0     First  []         ""            ""                 NaN                      NaN                    NaN                             NaN                               NaN                               
1     Second []         "23"          "a useful product" Features                 18                     1001                            Colour                            Black                             
1     Second []         "23"          "a useful product" Features                 18                     2093                            Material                          Plastic                           
1     Second []         "23"          "a useful product" Sizes                    34                     4736                            Length                            56                                
1     Second []         "23"          "a useful product" Sizes                    34                     8745                            Width                             76                                
2     Third             "876"         "another one"      Box                      937                    3758                            Amount                            1                                 
2     Third             "876"         "another one"      Box                      937                    2222                            Packaging                         Blister                           
2     Third             "876"         "another one"      Features                 8473                   9372                            Colour                            White                             
2     Third             "876"         "another one"      Features                 8473                   9375                            Position                          A                                 
2     Third             "876"         "another one"      Features                 8473                   2654                            Amount                            6                             

我试过的

我试过这个:

import pandas as pd
import json

j = json.loads(json_file)
df = pd.json_normalize(j['products'])
df

  id    name emptylist properties.id   properties.name                                 properties.features  
0  0   First        []                                                                                 NaN  
1  1  Second        []            23  a useful product   [{'name': 'Features', 'id': '18', 'features': ...  
2  2   Third       NaN           876       another one   [{'name': 'Box', 'id': '937', 'features': [{'i...  

   

并且我尝试尝试使用其他参数,但我一无所获。看来这不是正确的方法。

谁能帮帮我?


附加信息

我用 R 得到了一个可行的解决方案,但我需要能够用 Python 来完成。 如果有帮助,这将是我试图在 Python.

中翻译的 R 代码
library(tidyr)
jsonlite::fromJSON(json_file)$products %>% 
  jsonlite::flatten() %>%
  unnest(properties.features         , names_sep = ".", keep_empty = TRUE) %>% 
  unnest(properties.features.features, names_sep = ".", keep_empty = TRUE)

编辑

在@piterbarg 的帮助下和一些研究我找到了这个解决方案:

j = json.loads(json_file)
df = pd.json_normalize(j['products'])
df1 = df.explode('properties.features')
df2 = pd.concat([df1.reset_index(drop=True).drop('properties.features', axis = 1), 
                df1['properties.features'].apply(pd.Series).reset_index(drop=True).add_prefix("properties.features.").drop("properties.features.0", axis = 1)], axis = 1)
df2 = df2.explode('properties.features.features')
df3 = pd.concat([df2.reset_index(drop=True).drop('properties.features.features', axis = 1), 
                df2['properties.features.features'].apply(pd.Series).reset_index(drop=True).add_prefix("properties.features.features.").drop("properties.features.features.0", axis = 1)], axis = 1)
df3

有了这个,我得到了我正在寻找的解决方案,但代码看起来很乱,我不确定这个解决方案的效率如何。有帮助吗?

这可以通过重复应用 explode 来扩展列表和 apply(pd.Series) 来扩展字典来完成,虽然有点乏味:

df1 = df.explode('properties.features')
df2 = df1.join(df1['properties.features'].apply(pd.Series), lsuffix = '', rsuffix = '.properties.features').explode('features').drop(columns = 'properties.features')
df3 = df2.join(df2['features'].apply(pd.Series), lsuffix = '', rsuffix='.features').drop(columns = ['features','emptylist']).drop_duplicates()

df3 看起来像这样:

      id  name    properties.id    properties.name      0    id.properties.features  name.properties.features      0.features    id.features  name.features    value
--  ----  ------  ---------------  -----------------  ---  ------------------------  --------------------------  ------------  -------------  ---------------  -------
 0     0  First                                       nan                       nan  nan                                  nan            nan  nan              nan
 1     1  Second  23               a useful product   nan                        18  Features                             nan           1001  Colour           Black
 1     1  Second  23               a useful product   nan                        18  Features                             nan           2093  Material         Plastic
 1     1  Second  23               a useful product   nan                        18  Features                             nan           4736  Length           56
 1     1  Second  23               a useful product   nan                        18  Features                             nan           8745  Width            76
 1     1  Second  23               a useful product   nan                        34  Sizes                                nan           1001  Colour           Black
 1     1  Second  23               a useful product   nan                        34  Sizes                                nan           2093  Material         Plastic
 1     1  Second  23               a useful product   nan                        34  Sizes                                nan           4736  Length           56
 1     1  Second  23               a useful product   nan                        34  Sizes                                nan           8745  Width            76
 2     2  Third   876              another one        nan                       937  Box                                  nan           3758  Amount           1
 2     2  Third   876              another one        nan                       937  Box                                  nan           2222  Packaging        Blister
 2     2  Third   876              another one        nan                       937  Box                                  nan           9372  Colour           White
 2     2  Third   876              another one        nan                       937  Box                                  nan           9375  Position         A
 2     2  Third   876              another one        nan                       937  Box                                  nan           2654  Amount           6
 2     2  Third   876              another one        nan                      8473  Features                             nan           3758  Amount           1
 2     2  Third   876              another one        nan                      8473  Features                             nan           2222  Packaging        Blister
 2     2  Third   876              another one        nan                      8473  Features                             nan           9372  Colour           White
 2     2  Third   876              another one        nan                      8473  Features                             nan           9375  Position         A
 2     2  Third   876              another one        nan                      8473  Features                             nan           2654  Amount           6

名称与您想要的不太一样,如果需要,可以使用 .rename(columns = {...}) 修复

它与您在 Edit 中的类似,但可能语法更短且性能更高。

如果您的 DataFrame 中有 NaN,旧版本的 Pandas 可能会在 json_normalize 上失败。

此解决方案应适用于 Pandas 1.3+。

df = pd.json_normalize(products)
df = df.explode('properties.features')
df = pd.concat([df.drop('properties.features', axis=1).reset_index(drop=True),
                pd.json_normalize(df['properties.features']).add_prefix('properties.features.')], axis=1)
df = df.explode('properties.features.features')
df = pd.concat([df.drop('properties.features.features', axis=1).reset_index(drop=True),
                pd.json_normalize(df['properties.features.features']).add_prefix('properties.features.features.')], axis=1)

性能。有 1000 个产品。

Code in Edit : 4.85 s ± 218 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This solution: 58.3 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)