解析 r 中的多级 json 文件

Parse multi-level json file in r

我对 R 有很好的理解,但对 JSON 文件类型和解析的最佳实践不熟悉。我在从原始 JSON 文件构建数据框时遇到困难。 JSON 文件(下面的数据)由每个用户有多个观察值的重复测量数据组成。

当raw文件读入r时

 jdata<-read_json("./raw.json")

它以“列表 1”的形式出现,该列表为 user_ids。在每个 user_id 中还有更多列表,例如 -

jdata$user_id$`sjohnson`$date$`2020-09-25`$city

最后一个位置实际上分为两个选项 - $city 或 $zip。在最高级别,完整文件中大约有 89 个用户。

我的目标是最终得到一个矩形数据框或多个数据框,我可以像这样合并在一起 - 我实际上不需要邮政编码。

example table

我试过 jsonlite 和 tidyverse,我似乎得到的最远的是一个数据框,在最小级别有一个变量 - 城市和邮政编码交替行 使用这个

df  <-  as.data.frame(matrix(unlist(jdata), nrow=length(unlist(jdata["users"]))))

任何 help/suggestions 接近上述 table 的人将不胜感激。我有一种感觉,我无法通过不同的级别将其循环回来。

这是原始 json 文件结构的示例:

 {
  "user_id": {
    "sjohnson": {
      "date": {
        "2020-09-25": {
              "city": "Denver",
              "zip": "80014"
            },
            "2020-10-01": {
              "city": "Atlanta",
              "zip": "30301"
            },
            "2020-11-04": {
              "city": "Jacksonville",
              "zip": "14001"
            }
         },
    "asmith: {
      "date": {
        "2020-10-16": {
              "city": "Cleavland",
              "zip": "34321"
        },
        "2020-11-10": {
              "City": "Elmhurst",
              "zip": "00013
            },
            "2020-11-10 08:49:36": {
              "location": null,
              "timestamp": 1605016176013
            }
          }
 

这是 tidyverse: a custom function unnestable() designed to recursively unnest into a table the contents of a list 中的解决方案,就像您描述的那样。有关此类列表格式及其 table 的详细信息,请参阅 详细信息

解决方案

首先确保存在必要的库:

library(jsonlite)
library(tidyverse)

然后定义unnestable()函数如下:

unnestable <- function(v) {
  # If we've reached the bottommost list, simply treat it as a table...
  if(all(sapply(
    X = v,
    # Check that each element is a single value (or NULL).
    FUN = function(x) {
      is.null(x) || purrr::is_scalar_atomic(x)
    },
    simplify = TRUE
  ))) {
    v %>%
      # Replace any NULLs with NAs to preserve blank fields...
      sapply(
        FUN = function(x) {
          if(is.null(x))
            NA
          else
            x
        },
        simplify = FALSE
      ) %>%
      # ...and convert this bottommost list into a table.
      tidyr::as_tibble()
  }
  # ...but if this list contains another nested list, then recursively unnest its
  # contents and combine their tabular results.
  else if(purrr::is_scalar_list(v)) {
    # Take the contents within the nested list...
    v[[1]] %>%
      # ...apply this 'unnestable()' function to them recursively...
      sapply(
        FUN = unnestable,
        simplify = FALSE,
        USE.NAMES = TRUE
      ) %>%
      # ...and stack their results.
      dplyr::bind_rows(.id = names(v)[1])
  }
  # Otherwise, the format is unrecognized and yields no results.
  else {
    NULL
  }
}

最后对JSON数据进行如下处理:

# Read the JSON file into an R list.
jdata <- jsonlite::read_json("./raw.json")


# Flatten the R list into a table, via 'unnestable()'
flat_data <- unnestable(jdata)


# View the raw table.
flat_data

当然,您可以根据需要重新格式化 table:

library(lubridate)

flat_data <- flat_data %>%
  dplyr::transmute(
    user_id = as.character(user_id),
    date = lubridate::as_datetime(date),
    city = as.character(city)
  ) %>%
  dplyr::distinct()


# View the reformatted table.
flat_data

结果

给定一个 raw.json 文件,就像这里采样的那样

{
  "user_id": {
    "sjohnson": {
      "date": {
        "2020-09-25": {
          "city": "Denver",
          "zip": "80014"
        },
        "2020-10-01": {
          "city": "Atlanta",
          "zip": "30301"
        },
        "2020-11-04": {
          "city": "Jacksonville",
          "zip": "14001"
        }
      }
    },
    "asmith": {
      "date": {
        "2020-10-16": {
          "city": "Cleavland",
          "zip": "34321"
        },
        "2020-11-10": {
          "city": "Elmhurst",
          "zip": "00013"
        },
        "2020-11-10 08:49:36": {
          "location": null,
          "timestamp": 1605016176013
        }
      }
    }
  }
}

然后 unnestable() 会产生一个 tibble 这样的

# A tibble: 6 x 6
  user_id  date                city         zip   location     timestamp
  <chr>    <chr>               <chr>        <chr> <lgl>            <dbl>
1 sjohnson 2020-09-25          Denver       80014 NA                  NA
2 sjohnson 2020-10-01          Atlanta      30301 NA                  NA
3 sjohnson 2020-11-04          Jacksonville 14001 NA                  NA
4 asmith   2020-10-16          Cleavland    34321 NA                  NA
5 asmith   2020-11-10          Elmhurst     00013 NA                  NA
6 asmith   2020-11-10 08:49:36 NA           NA    NA       1605016176013

其中 dplyr 将格式化为以下结果:

# A tibble: 6 x 3
  user_id  date                city        
  <chr>    <dttm>              <chr>       
1 sjohnson 2020-09-25 00:00:00 Denver      
2 sjohnson 2020-10-01 00:00:00 Atlanta     
3 sjohnson 2020-11-04 00:00:00 Jacksonville
4 asmith   2020-10-16 00:00:00 Cleavland   
5 asmith   2020-11-10 00:00:00 Elmhurst    
6 asmith   2020-11-10 08:49:36 NA          

详情

列表格式

准确的说,list代表字段{group_1, group_2, ..., group_n}的嵌套分组,它必须是表格:

list(
  group_1 = list(
    "value_1" = list(
      group_2 = list(
        "value_1.1" = list(
          # .
          #  .
          #   .
               group_n = list(
                 "value_1.1.….n.1" = list(
                   field_a =    1,
                   field_b = TRUE
                 ),
                 "value_1.1.….n.2" = list(
                   field_a =   2,
                   field_c = "2"
                 )
                 # ...
               )
        ),
        "value_1.2" = list(
          # .
          #  .
          #   .
        )
        # ...
      )
    ),
    "value_2" = list(
      group_2 = list(
        "value_2.1" = list(
          # .
          #  .
          #   .
               group_n = list(
                 "value_2.1.….n.1" = list(
                   field_a =   3,
                   field_d = 3.0
                 )
                 # ...
               )
        ),
        "value_2.2" = list(
          # .
          #  .
          #   .
        )
        # ...
      )
    )
    # ...
  )
)

Table格式

给定这种形式的 listunnestable() 会将其展平为以下形式的 table:

# A tibble: … x …
  group_1 group_2   ... group_n         field_a field_b field_c field_d
  <chr>   <chr>     ... <chr>             <dbl> <lgl>   <chr>     <dbl>
1 value_1 value_1.1 ... value_1.1.….n.1       1 TRUE    NA           NA
2 value_1 value_1.1 ... value_1.1.….n.2       2 NA      2            NA
3 value_1 value_1.2 ... value_1.2.….n.1     ... ...     ...         ...
⋮    ⋮         ⋮                 ⋮              ⋮  ⋮       ⋮             ⋮
j value_2 value_2.1 ... value_2.1.….n.1       3 NA      NA            3
⋮    ⋮         ⋮                 ⋮              ⋮  ⋮       ⋮             ⋮
k value_2 value_2.2 ... value_2.2.….n.1     ... ...     ...         ...
⋮    ⋮         ⋮                 ⋮              ⋮  ⋮       ⋮             ⋮

我们可以一步一步构建我们想要的结构:

library(jsonlite)
library(tidyverse)

df <- fromJSON('{
   "user_id": {
    "sjohnson": {
       "date": {
        "2020-09-25": {
           "city": "Denver",
          "zip": "80014"
        },
        "2020-10-01": {
          "city": "Atlanta",
          "zip": "30301"
         },
        "2020-11-04": {
          "city": "Jacksonville",
          "zip": "14001"
        }
       }
    },
    "asmith": {
       "date": {
         "2020-10-16": {
           "city": "Cleavland",
           "zip": "34321"
         },
        "2020-11-10": {
           "city": "Elmhurst",
           "zip": "00013"
         },
         "2020-11-10 08:49:36": {
          "location": null,
          "timestamp": 1605016176013
        }
       }
     }
   }
}')

df %>%
  bind_rows() %>%
  pivot_longer(everything(), names_to = 'user_id') %>%
  unnest_longer(value, indices_to = 'date') %>%
  unnest_longer(value, indices_to = 'var') %>%
  mutate(city = unlist(value)) %>%
  filter(var == 'city') %>%
  select(-var, -value)

给出:

# A tibble: 5 x 3
  user_id  date       city        
  <chr>    <chr>      <chr>       
1 sjohnson 2020-09-25 Denver      
2 sjohnson 2020-10-01 Atlanta     
3 sjohnson 2020-11-04 Jacksonville
4 asmith   2020-10-16 Cleavland   
5 asmith   2020-11-10 Elmhurst

受@Greg 启发的替代解决方案,我们更改最后两行:

df %>%
  bind_rows() %>%
  pivot_longer(everything(), names_to = 'user_id') %>%
  unnest_longer(value, indices_to = 'date') %>%
  unnest_longer(value, indices_to = 'var') %>%
  mutate(value = unlist(value)) %>%
  pivot_wider(names_from = "var") %>%
  select(user_id, date, city)

除了城市为 NA 的另一种情况外,这给出了几乎相同的结果:

# A tibble: 6 x 3
  user_id  date                city        
  <chr>    <chr>               <chr>       
1 sjohnson 2020-09-25          Denver      
2 sjohnson 2020-10-01          Atlanta     
3 sjohnson 2020-11-04          Jacksonville
4 asmith   2020-10-16          Cleavland   
5 asmith   2020-11-10          Elmhurst    
6 asmith   2020-11-10 08:49:36 NA    

另一个(直接的)解决方案使用 rrapply-package 中的 rrapply() 进行繁重的工作:

library(rrapply)
library(dplyr)

rrapply(jdata, how = "melt") %>%
  filter(L5 == "city") %>%
  select(user_id = L2, date = L4, city = value)

#>    user_id       date         city
#> 1 sjohnson 2020-09-25       Denver
#> 2 sjohnson 2020-10-01      Atlanta
#> 3 sjohnson 2020-11-04 Jacksonville
#> 4   asmith 2020-10-16    Cleavland
#> 5   asmith 2020-11-10     Elmhurst

数据

jdata <- jsonlite::fromJSON('{
   "user_id": {
    "sjohnson": {
       "date": {
        "2020-09-25": {
           "city": "Denver",
          "zip": "80014"
        },
        "2020-10-01": {
          "city": "Atlanta",
          "zip": "30301"
         },
        "2020-11-04": {
          "city": "Jacksonville",
          "zip": "14001"
        }
       }
    },
    "asmith": {
       "date": {
         "2020-10-16": {
           "city": "Cleavland",
           "zip": "34321"
         },
        "2020-11-10": {
           "city": "Elmhurst",
           "zip": "00013"
         },
         "2020-11-10 08:49:36": {
          "location": null,
          "timestamp": 1605016176013
        }
       }
     }
   }
}')