将 bigquery JSON 数据转储加载到 R tibble
load bigquery JSON data dump into R tibble
我已经从 Big Query 下载了一个 JSON 摘录,其中包含嵌套和重复的字段(类似于包 bigrquery),我正在尝试进一步操作生成的 tibble。
我有以下代码可以从 JSON 加载并转换为 tibble
library(tidyverse)
ga.list <- lapply(readLines("temp.json"), jsonlite::fromJSON, flatten = TRUE)
ga.df <- tibble(dat = ga.list) %>%
unnest_wider(dat) %>%
mutate(id = row_number()) %>%
unnest_wider(b_nested) %>%
unnest_wider(b3) %>%
unnest_wider(b33)
所以有两个列表列:
- b_nested,这一列是一个嵌套列表(我递归地取消了嵌套..也许有更自动化的方法,如果是,请指教!)
- rr1 和 rr2,这些列将始终具有相同数量的元素。所以rr1和rr2的元素1应该一起读。
我仍在研究如何提取 id、rr1 和 rr2 并制作成一个长 table,每个 id 行都有重复的行。
注意:随着我的进一步发展,这个问题已经被编辑了几次.. 最初我被困在从 JSON 到 tibble 直到我发现 unnest_wider()
temp.json:
{"a":"4000","b_nested":{"b1":"(not set)","b2":"some -
text","b3":{"b31":"1591558980","b32":"60259425255","b33":{"b3311":"133997175"},"b4":false},"b5":true},"rr1":[],"rr2":[]}
{"a":"4000","b_nested":{"b1":"asdfasdfa","b2":"some - text
more","b3":{"b31":"11111","b32":"2222","b33":{"b3311":"3333333"},"b4":true},"b5":true},
"rr1":["v1","v2","v3"],"rr2":["x1","x2","x3"]}
{"a":"6000","b_nested":{"b1":"asdfasdfa","b2":"some - text
more","b3":{"b31":"11111","b32":"2222","b33":{"b3311":"3333333"},"b4":true},"b5":true},"rr1":["v1","v2","v3","v4","v5"],"rr2":["aja1","aja2","aja3","aja14","aja5"]}
拼图的最后一块;为了获得重复记录的重复行
ga.df %>% select(id, rr1, rr2) %>%
unnest(cols = c(rr1, rr2))
仅供参考:Link 到大查询 Specifying nested and repeated columns
另一种解决方案(我的偏好)是从 rr1 和 rr1 创建一个 tibble 并保留为 ga.df 中的一列,以便可以使用 purrr 函数
ga.df %>%
mutate(rr = map2(rr1, rr2, function(x,y) {
tibble(rr1 = x, rr2 = y)
})) %>%
select(-rr1, -rr2) %>%
mutate(rr_length = map_int(rr, ~nrow(.x)))
我已经从 Big Query 下载了一个 JSON 摘录,其中包含嵌套和重复的字段(类似于包 bigrquery),我正在尝试进一步操作生成的 tibble。
我有以下代码可以从 JSON 加载并转换为 tibble
library(tidyverse)
ga.list <- lapply(readLines("temp.json"), jsonlite::fromJSON, flatten = TRUE)
ga.df <- tibble(dat = ga.list) %>%
unnest_wider(dat) %>%
mutate(id = row_number()) %>%
unnest_wider(b_nested) %>%
unnest_wider(b3) %>%
unnest_wider(b33)
所以有两个列表列:
- b_nested,这一列是一个嵌套列表(我递归地取消了嵌套..也许有更自动化的方法,如果是,请指教!)
- rr1 和 rr2,这些列将始终具有相同数量的元素。所以rr1和rr2的元素1应该一起读。
我仍在研究如何提取 id、rr1 和 rr2 并制作成一个长 table,每个 id 行都有重复的行。
注意:随着我的进一步发展,这个问题已经被编辑了几次.. 最初我被困在从 JSON 到 tibble 直到我发现 unnest_wider()
temp.json:
{"a":"4000","b_nested":{"b1":"(not set)","b2":"some - text","b3":{"b31":"1591558980","b32":"60259425255","b33":{"b3311":"133997175"},"b4":false},"b5":true},"rr1":[],"rr2":[]} {"a":"4000","b_nested":{"b1":"asdfasdfa","b2":"some - text more","b3":{"b31":"11111","b32":"2222","b33":{"b3311":"3333333"},"b4":true},"b5":true}, "rr1":["v1","v2","v3"],"rr2":["x1","x2","x3"]} {"a":"6000","b_nested":{"b1":"asdfasdfa","b2":"some - text more","b3":{"b31":"11111","b32":"2222","b33":{"b3311":"3333333"},"b4":true},"b5":true},"rr1":["v1","v2","v3","v4","v5"],"rr2":["aja1","aja2","aja3","aja14","aja5"]}
拼图的最后一块;为了获得重复记录的重复行
ga.df %>% select(id, rr1, rr2) %>%
unnest(cols = c(rr1, rr2))
仅供参考:Link 到大查询 Specifying nested and repeated columns
另一种解决方案(我的偏好)是从 rr1 和 rr1 创建一个 tibble 并保留为 ga.df 中的一列,以便可以使用 purrr 函数
ga.df %>%
mutate(rr = map2(rr1, rr2, function(x,y) {
tibble(rr1 = x, rr2 = y)
})) %>%
select(-rr1, -rr2) %>%
mutate(rr_length = map_int(rr, ~nrow(.x)))