将带有嵌入式列表的 json 解析为半长数据帧
parse json with embedded lists into semi-long dataframe
我有一个包含多层嵌套的 json 文件,我正在努力将它放入一个可用的数据框中。我创建了一个基于真实结构的模拟数据的玩具示例:here 是要点。
这是我想要的输出。输出可能是 "longer" 或有来自原始 json 的附加变量,但我显示的是核心问题。
这是 json 的一部分,它显示了我想要进入半长格式的最深层次的嵌套,如上图白色所示(全宽格式会很好)。
我已经用这个对象尝试了很多东西:
myList <- jsonlite::fromJSON("example.json", flatten=TRUE)$results
从尝试对 [][[]]
和 cbind()
进行子集化,到尝试解除嵌入列表的嵌套。没有什么是对的。我将从有关最佳方法的建议中受益匪浅。
这会让你更进一步吗? (这是一个粗糙的结构):
library(tidyverse)
x <- (jsonlite::fromJSON("/Users/hrbrmstr/r7/gh/labs-research/2018-11-portland-ciso-event/example.json"))
jsonlite::stream_out(x$results, con = gzfile("ex-res.json.gz"))
y <- ndjson::stream_in("ex-res.json.gz", "tbl")
gather(y, path, path_val, starts_with("path")) %>%
gather(flow, flow_val, starts_with("flow")) %>%
gather(name, name_val, starts_with("values.pdep")) %>%
gather(intervention, interv_val, starts_with("values.inter")) %>%
glimpse()
## Observations: 87,696
## Variables: 18
## $ contact.name <chr> "Person 1", "Person 2", "Person 1", "Person 2", "Person 1", "Person 2", "Person 1", "Person 2"...
## $ contact.uuid <chr> "k0dcjs", "rd3jfui", "k0dcjs", "rd3jfui", "k0dcjs", "rd3jfui", "k0dcjs", "rd3jfui", "k0dcjs", ...
## $ created_on <chr> "2016-02-08T07:00:15.093813Z", "2016-02-08T07:00:15.093813Z", "2016-02-08T07:00:15.093813Z", "...
## $ id <dbl> 1234, 1235, 1234, 1235, 1234, 1235, 1234, 1235, 1234, 1235, 1234, 1235, 1234, 1235, 1234, 1235...
## $ modified_on <chr> "2016-02-09T04:42:54.812323Z", "2016-02-08T08:09:51.545160Z", "2016-02-09T04:42:54.812323Z", "...
## $ responded <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE...
## $ start.uuid <chr> "dnxh4g", "kfj4dsi", "dnxh4g", "kfj4dsi", "dnxh4g", "kfj4dsi", "dnxh4g", "kfj4dsi", "dnxh4g", ...
## $ uuid <chr> "esn4dk", "qask9dj", "esn4dk", "qask9dj", "esn4dk", "qask9dj", "esn4dk", "qask9dj", "esn4dk", ...
## $ exit_type <chr> NA, "completed", NA, "completed", NA, "completed", NA, "completed", NA, "completed", NA, "comp...
## $ exited_on <chr> NA, "2016-02-08T08:09:51.544998Z", NA, "2016-02-08T08:09:51.544998Z", NA, "2016-02-08T08:09:51...
## $ path <chr> "path.0.node", "path.0.node", "path.0.time", "path.0.time", "path.1.node", "path.1.node", "pat...
## $ path_val <chr> "ecb4cb11-6cca-4791-a950-c448e9300846", "ecb4cb11-6cca-4791-a950-c448e9300846", "2016-02-08T07...
## $ flow <chr> "flow.name", "flow.name", "flow.name", "flow.name", "flow.name", "flow.name", "flow.name", "fl...
## $ flow_val <chr> "weeklyratings", "weeklyratings", "weeklyratings", "weeklyratings", "weeklyratings", "weeklyra...
## $ name <chr> "values.pdeps1.category", "values.pdeps1.category", "values.pdeps1.category", "values.pdeps1.c...
## $ name_val <chr> "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 -...
## $ intervention <chr> "values.intervention", "values.intervention", "values.intervention", "values.intervention", "v...
## $ interv_val <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
完整方法:
gather(y, path, path_val, starts_with("path")) %>%
gather(flow, flow_val, starts_with("flow")) %>%
gather(name, name_val, starts_with("values.pdep")) %>%
gather(intervention, interv_val, starts_with("values.inter")) %>%
filter(grepl(".value", name)) %>%
filter(grepl("node", path)) %>%
mutate(variable = gsub("values.", "", name)) %>%
mutate(variable = gsub(".value", "", variable)) %>%
distinct(contact.name, uuid, name, .keep_all = TRUE) %>%
select(id, uuid, contact.uuid, variable, name_val, created_on, modified_on) %>%
arrange(id, created_on) # optional wide %>% spread(variable, name_val)
我有一个包含多层嵌套的 json 文件,我正在努力将它放入一个可用的数据框中。我创建了一个基于真实结构的模拟数据的玩具示例:here 是要点。
这是我想要的输出。输出可能是 "longer" 或有来自原始 json 的附加变量,但我显示的是核心问题。
这是 json 的一部分,它显示了我想要进入半长格式的最深层次的嵌套,如上图白色所示(全宽格式会很好)。
我已经用这个对象尝试了很多东西:
myList <- jsonlite::fromJSON("example.json", flatten=TRUE)$results
从尝试对 [][[]]
和 cbind()
进行子集化,到尝试解除嵌入列表的嵌套。没有什么是对的。我将从有关最佳方法的建议中受益匪浅。
这会让你更进一步吗? (这是一个粗糙的结构):
library(tidyverse)
x <- (jsonlite::fromJSON("/Users/hrbrmstr/r7/gh/labs-research/2018-11-portland-ciso-event/example.json"))
jsonlite::stream_out(x$results, con = gzfile("ex-res.json.gz"))
y <- ndjson::stream_in("ex-res.json.gz", "tbl")
gather(y, path, path_val, starts_with("path")) %>%
gather(flow, flow_val, starts_with("flow")) %>%
gather(name, name_val, starts_with("values.pdep")) %>%
gather(intervention, interv_val, starts_with("values.inter")) %>%
glimpse()
## Observations: 87,696
## Variables: 18
## $ contact.name <chr> "Person 1", "Person 2", "Person 1", "Person 2", "Person 1", "Person 2", "Person 1", "Person 2"...
## $ contact.uuid <chr> "k0dcjs", "rd3jfui", "k0dcjs", "rd3jfui", "k0dcjs", "rd3jfui", "k0dcjs", "rd3jfui", "k0dcjs", ...
## $ created_on <chr> "2016-02-08T07:00:15.093813Z", "2016-02-08T07:00:15.093813Z", "2016-02-08T07:00:15.093813Z", "...
## $ id <dbl> 1234, 1235, 1234, 1235, 1234, 1235, 1234, 1235, 1234, 1235, 1234, 1235, 1234, 1235, 1234, 1235...
## $ modified_on <chr> "2016-02-09T04:42:54.812323Z", "2016-02-08T08:09:51.545160Z", "2016-02-09T04:42:54.812323Z", "...
## $ responded <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE...
## $ start.uuid <chr> "dnxh4g", "kfj4dsi", "dnxh4g", "kfj4dsi", "dnxh4g", "kfj4dsi", "dnxh4g", "kfj4dsi", "dnxh4g", ...
## $ uuid <chr> "esn4dk", "qask9dj", "esn4dk", "qask9dj", "esn4dk", "qask9dj", "esn4dk", "qask9dj", "esn4dk", ...
## $ exit_type <chr> NA, "completed", NA, "completed", NA, "completed", NA, "completed", NA, "completed", NA, "comp...
## $ exited_on <chr> NA, "2016-02-08T08:09:51.544998Z", NA, "2016-02-08T08:09:51.544998Z", NA, "2016-02-08T08:09:51...
## $ path <chr> "path.0.node", "path.0.node", "path.0.time", "path.0.time", "path.1.node", "path.1.node", "pat...
## $ path_val <chr> "ecb4cb11-6cca-4791-a950-c448e9300846", "ecb4cb11-6cca-4791-a950-c448e9300846", "2016-02-08T07...
## $ flow <chr> "flow.name", "flow.name", "flow.name", "flow.name", "flow.name", "flow.name", "flow.name", "fl...
## $ flow_val <chr> "weeklyratings", "weeklyratings", "weeklyratings", "weeklyratings", "weeklyratings", "weeklyra...
## $ name <chr> "values.pdeps1.category", "values.pdeps1.category", "values.pdeps1.category", "values.pdeps1.c...
## $ name_val <chr> "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 - 7", "0 -...
## $ intervention <chr> "values.intervention", "values.intervention", "values.intervention", "values.intervention", "v...
## $ interv_val <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
完整方法:
gather(y, path, path_val, starts_with("path")) %>%
gather(flow, flow_val, starts_with("flow")) %>%
gather(name, name_val, starts_with("values.pdep")) %>%
gather(intervention, interv_val, starts_with("values.inter")) %>%
filter(grepl(".value", name)) %>%
filter(grepl("node", path)) %>%
mutate(variable = gsub("values.", "", name)) %>%
mutate(variable = gsub(".value", "", variable)) %>%
distinct(contact.name, uuid, name, .keep_all = TRUE) %>%
select(id, uuid, contact.uuid, variable, name_val, created_on, modified_on) %>%
arrange(id, created_on) # optional wide %>% spread(variable, name_val)