将带有嵌入式列表的 json 解析为半长数据帧

parse json with embedded lists into semi-long dataframe

我有一个包含多层嵌套的 json 文件,我正在努力将它放入一个可用的数据框中。我创建了一个基于真实结构的模拟数据的玩具示例:here 是要点。

这是我想要的输出。输出可能是 "longer" 或有来自原始 json 的附加变量,但我显示的是核心问题。

这是 json 的一部分,它显示了我想要进入半长格式的最深层次的嵌套,如上图白色所示(全宽格式会很好)。

我已经用这个对象尝试了很多东西:

myList <- jsonlite::fromJSON("example.json", flatten=TRUE)$results

从尝试对 [][[]]cbind() 进行子集化,到尝试解除嵌入列表的嵌套。没有什么是对的。我将从有关最佳方法的建议中受益匪浅。

这会让你更进一步吗? (这是一个粗糙的结构):

library(tidyverse)

x <- (jsonlite::fromJSON("/Users/hrbrmstr/r7/gh/labs-research/2018-11-portland-ciso-event/example.json"))

jsonlite::stream_out(x$results, con = gzfile("ex-res.json.gz"))

y <- ndjson::stream_in("ex-res.json.gz", "tbl")

gather(y, path, path_val, starts_with("path")) %>%
  gather(flow, flow_val, starts_with("flow")) %>%
  gather(name, name_val, starts_with("values.pdep")) %>%
  gather(intervention, interv_val, starts_with("values.inter")) %>%
  glimpse()
## Observations: 87,696
## Variables: 18
## $ contact.name <chr> "Person 1", "Person 2", "Person 1", "Person 2", "Person 1", "Person 2", "Person 1", "Person 2"...
## $ contact.uuid <chr> "k0dcjs", "rd3jfui", "k0dcjs", "rd3jfui", "k0dcjs", "rd3jfui", "k0dcjs", "rd3jfui", "k0dcjs", ...
## $ created_on   <chr> "2016-02-08T07:00:15.093813Z", "2016-02-08T07:00:15.093813Z", "2016-02-08T07:00:15.093813Z", "...
## $ id           <dbl> 1234, 1235, 1234, 1235, 1234, 1235, 1234, 1235, 1234, 1235, 1234, 1235, 1234, 1235, 1234, 1235...
## $ modified_on  <chr> "2016-02-09T04:42:54.812323Z", "2016-02-08T08:09:51.545160Z", "2016-02-09T04:42:54.812323Z", "...
## $ responded    <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE...
## $ start.uuid   <chr> "dnxh4g", "kfj4dsi", "dnxh4g", "kfj4dsi", "dnxh4g", "kfj4dsi", "dnxh4g", "kfj4dsi", "dnxh4g", ...
## $ uuid         <chr> "esn4dk", "qask9dj", "esn4dk", "qask9dj", "esn4dk", "qask9dj", "esn4dk", "qask9dj", "esn4dk", ...
## $ exit_type    <chr> NA, "completed", NA, "completed", NA, "completed", NA, "completed", NA, "completed", NA, "comp...
## $ exited_on    <chr> NA, "2016-02-08T08:09:51.544998Z", NA, "2016-02-08T08:09:51.544998Z", NA, "2016-02-08T08:09:51...
## $ path         <chr> "path.0.node", "path.0.node", "path.0.time", "path.0.time", "path.1.node", "path.1.node", "pat...
## $ path_val     <chr> "ecb4cb11-6cca-4791-a950-c448e9300846", "ecb4cb11-6cca-4791-a950-c448e9300846", "2016-02-08T07...
## $ flow         <chr> "flow.name", "flow.name", "flow.name", "flow.name", "flow.name", "flow.name", "flow.name", "fl...
## $ flow_val     <chr> "weeklyratings", "weeklyratings", "weeklyratings", "weeklyratings", "weeklyratings", "weeklyra...
## $ name         <chr> "values.pdeps1.category", "values.pdeps1.category", "values.pdeps1.category", "values.pdeps1.c...
## $ name_val     <chr> "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 -...
## $ intervention <chr> "values.intervention", "values.intervention", "values.intervention", "values.intervention", "v...
## $ interv_val   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

完整方法:

gather(y, path, path_val, starts_with("path")) %>%
  gather(flow, flow_val, starts_with("flow")) %>%
  gather(name, name_val, starts_with("values.pdep")) %>%
  gather(intervention, interv_val, starts_with("values.inter")) %>%
  filter(grepl(".value", name)) %>% 
  filter(grepl("node", path)) %>%
  mutate(variable = gsub("values.", "", name)) %>% 
  mutate(variable = gsub(".value", "", variable)) %>% 
  distinct(contact.name, uuid, name, .keep_all = TRUE) %>% 
  select(id, uuid, contact.uuid, variable, name_val, created_on, modified_on) %>% 
  arrange(id, created_on) # optional wide %>% spread(variable, name_val)