将包含嵌入列表的 JSON 解析为扁平化 data.frame,忽略不需要的键
parse JSON containing embedded lists into flattened data.frame, ignoring unwanted key
一位同事向我发送了一个 Elasticsearch 查询结果(100000 条记录,数百个属性),如下所示:
pets_json <- paste0('[{"animal":"cat","attributes":{"intelligence":"medium","noises":[{"noise":"meow","code":4},{"noise":"hiss","code":2}]}},',
'{"animal":"dog","attributes":{"intelligence":"high","noises":{"noise":"bark","code":1}}},',
'{"animal":"snake","attributes":{"intelligence":"low","noises":{"noise":"hiss","code":2}}}]')
有一个多余的密钥,code
,我不需要捕获。
我想制作一个 data.frame 看起来像:
animal intelligence noises.bark noises.hiss noises.meow
cat medium 0 1 1
dog high 1 0 0
snake low 0 1 0
我可以阅读 json,但 flatten=TRUE
没有完全变平:
library(jsonlite)
str(df <- fromJSON(txt=pets_json, flatten=TRUE))
# 'data.frame': 3 obs. of 3 variables:
# $ animal : chr "cat" "dog" "snake"
# $ attributes.intelligence: chr "medium" "high" "low"
# $ attributes.noises :List of 3
# ..$ :'data.frame': 2 obs. of 2 variables: \
# .. ..$ noise : chr "meow" "hiss" \
# .. ..$ code: int 4 2 |
# ..$ :List of 2 |
# .. ..$ noise : chr "bark" |- need to remove code and flatten
# .. ..$ code: int 1 |
# ..$ :List of 2 |
# .. ..$ noise : chr "hiss" /
# .. ..$ code: int 2 /
因为扁平化不完整,我可以使用这个中间阶段在调用另一个 flatten()
之前摆脱不需要的键 code
,但我知道摆脱键的唯一方法是真的慢:
for( l in which(sapply(df, is.list)) ){
for( l2 in which(sapply(df[[l]], is.list))){
df[[l]][[l2]]['code'] <- NULL
}
}
( df <- data.frame(flatten(df)) )
# animal attributes.intelligence attributes.noises
# 1 cat medium meow, hiss
# 2 dog high bark
# 3 snake low hiss
然后……?我知道使用 tidyr::separate
我可能会想出一个 hacky 方法来 spread
将噪音值放入列中并设置标志。但这一次只适用于一个属性,我可能有数百个这样的属性。我事先并不知道所有可能的属性值。
我怎样才能有效地生产出想要的 data.frame?感谢您的宝贵时间!
我不认为有一种超级简单的方法可以将其设置为正确的格式,但这里有一个尝试:
out <- fromJSON(pets_json)
# drop the "code" data and do some initial cleaning
out$noises <- lapply(
out$attributes$noises,
function(x) unlist(x[-match("code",names(x))])
)
# extract the key part of the intelligence variable
out$intelligence <- out$attributes$intelligence
# set up a vector of all possible noises
unq_noises <- unique(unlist(out$noises))
# make the new separate noise variables
out[unq_noises] <- t(vapply(
out$noises,
function(x) unq_noises %in% x,
FUN.VALUE=logical(length(out$noises)))
)
# clean up no longer needed variables
out[c("attributes","noises")] <- list(NULL)
out
# animal intelligence meow hiss bark
#1 cat medium TRUE TRUE FALSE
#2 dog high FALSE FALSE TRUE
#3 snake low FALSE TRUE FALSE
带有 magrittr 和 data.table
的基本案例
这是另一个结合 magrittr
和 data.table
的提议,以获得额外的时代精神布朗尼分数:
# Do not simplify to data.frame
str(df <- fromJSON(txt=pets_json, simplifyDataFrame=F))
# The %<>% operator create a pipe and assigns back to the variable
df %<>%
lapply(. %>%
data.table(animal = .$animal,
intelligence = .$attributes$intelligence,
noises = unlist(.$attributes$noises)) %>% # Create a data.table
.[!noises %in% as.character(0:9)] ) %>% # Remove numeric values
rbindlist %>% # Combine into a single data.table
dcast(animal + intelligence ~ paste0("noises.", noises), # Cast the noises variables
value.var = "noises",
fill = 0, # Put 0 instead of NA
fun.aggregate = function(x) 1) # Put 1 instead of noise
结束格式符合您的要求:
df
# animal intelligence noises.bark noises.hiss noises.meow
# 1: cat medium 0 1 1
# 2: dog high 1 0 0
# 3: snake low 0 1 0
对于多个属性
现在,您似乎想要对多个属性进行泛化。假设您的数据也有一个 colors
属性,例如:
pets_json <- paste0('[{"animal":"cat","attributes":{"intelligence":"medium","noises":[{"noise":"meow","code":4},{"noise":"hiss","code":2}],"colors":[{"color":"brown","code":4},{"color":"white","code":2}]}},',
'{"animal":"dog","attributes":{"intelligence":"high","noises":{"noise":"bark","code":1},"colors":{"color":"brown","code":4}}},',
'{"animal":"snake","attributes":{"intelligence":"low","noises":{"noise":"hiss","code":2},"colors":[{"color":"green","code":4},{"color":"brown","code":4}]}}]')
然后你可以按照这个通用代码,它相当丑陋但应该可以正常工作:
# Do not simplify to data.frame
str(df <- fromJSON(txt=pets_json, simplifyDataFrame=F))
# Set up the attributes names
attr.names <- c("noises", "colors")
# The %<>% operator create a pipe and assigns back to the variable
df %<>%
lapply(function(.)
eval(parse(text=paste0(
"data.table(animal = .$animal, ",
"intelligence = .$attributes$intelligence, ",
paste0(attr.names, " = unlist(.$attributes$", attr.names, ")", collapse=", "),
")")))
%>%
.[eval(parse(text=paste("!", attr.names, "%in% as.character(0:9)", collapse = " & ")))] ) %>%
rbindlist
# Cast each variable and merge together
df <- dcast(melt(df, measure.vars=c(attr.names)),
animal + intelligence ~ variable + value, sep=".")
# animal intelligence noises.bark noises.hiss noises.meow colors.brown
# 1: cat medium 0 1 1 1
# 2: dog high 1 0 0 1
# 3: snake low 0 1 0 1
# colors.green colors.white
# 1: 0 1
# 2: 0 0
# 3: 1 0
此解决方案也适用于单个属性,例如attr.names <- c("noises")
.
一位同事向我发送了一个 Elasticsearch 查询结果(100000 条记录,数百个属性),如下所示:
pets_json <- paste0('[{"animal":"cat","attributes":{"intelligence":"medium","noises":[{"noise":"meow","code":4},{"noise":"hiss","code":2}]}},',
'{"animal":"dog","attributes":{"intelligence":"high","noises":{"noise":"bark","code":1}}},',
'{"animal":"snake","attributes":{"intelligence":"low","noises":{"noise":"hiss","code":2}}}]')
有一个多余的密钥,code
,我不需要捕获。
我想制作一个 data.frame 看起来像:
animal intelligence noises.bark noises.hiss noises.meow
cat medium 0 1 1
dog high 1 0 0
snake low 0 1 0
我可以阅读 json,但 flatten=TRUE
没有完全变平:
library(jsonlite)
str(df <- fromJSON(txt=pets_json, flatten=TRUE))
# 'data.frame': 3 obs. of 3 variables:
# $ animal : chr "cat" "dog" "snake"
# $ attributes.intelligence: chr "medium" "high" "low"
# $ attributes.noises :List of 3
# ..$ :'data.frame': 2 obs. of 2 variables: \
# .. ..$ noise : chr "meow" "hiss" \
# .. ..$ code: int 4 2 |
# ..$ :List of 2 |
# .. ..$ noise : chr "bark" |- need to remove code and flatten
# .. ..$ code: int 1 |
# ..$ :List of 2 |
# .. ..$ noise : chr "hiss" /
# .. ..$ code: int 2 /
因为扁平化不完整,我可以使用这个中间阶段在调用另一个 flatten()
之前摆脱不需要的键 code
,但我知道摆脱键的唯一方法是真的慢:
for( l in which(sapply(df, is.list)) ){
for( l2 in which(sapply(df[[l]], is.list))){
df[[l]][[l2]]['code'] <- NULL
}
}
( df <- data.frame(flatten(df)) )
# animal attributes.intelligence attributes.noises
# 1 cat medium meow, hiss
# 2 dog high bark
# 3 snake low hiss
然后……?我知道使用 tidyr::separate
我可能会想出一个 hacky 方法来 spread
将噪音值放入列中并设置标志。但这一次只适用于一个属性,我可能有数百个这样的属性。我事先并不知道所有可能的属性值。
我怎样才能有效地生产出想要的 data.frame?感谢您的宝贵时间!
我不认为有一种超级简单的方法可以将其设置为正确的格式,但这里有一个尝试:
out <- fromJSON(pets_json)
# drop the "code" data and do some initial cleaning
out$noises <- lapply(
out$attributes$noises,
function(x) unlist(x[-match("code",names(x))])
)
# extract the key part of the intelligence variable
out$intelligence <- out$attributes$intelligence
# set up a vector of all possible noises
unq_noises <- unique(unlist(out$noises))
# make the new separate noise variables
out[unq_noises] <- t(vapply(
out$noises,
function(x) unq_noises %in% x,
FUN.VALUE=logical(length(out$noises)))
)
# clean up no longer needed variables
out[c("attributes","noises")] <- list(NULL)
out
# animal intelligence meow hiss bark
#1 cat medium TRUE TRUE FALSE
#2 dog high FALSE FALSE TRUE
#3 snake low FALSE TRUE FALSE
带有 magrittr 和 data.table
的基本案例这是另一个结合 magrittr
和 data.table
的提议,以获得额外的时代精神布朗尼分数:
# Do not simplify to data.frame
str(df <- fromJSON(txt=pets_json, simplifyDataFrame=F))
# The %<>% operator create a pipe and assigns back to the variable
df %<>%
lapply(. %>%
data.table(animal = .$animal,
intelligence = .$attributes$intelligence,
noises = unlist(.$attributes$noises)) %>% # Create a data.table
.[!noises %in% as.character(0:9)] ) %>% # Remove numeric values
rbindlist %>% # Combine into a single data.table
dcast(animal + intelligence ~ paste0("noises.", noises), # Cast the noises variables
value.var = "noises",
fill = 0, # Put 0 instead of NA
fun.aggregate = function(x) 1) # Put 1 instead of noise
结束格式符合您的要求:
df
# animal intelligence noises.bark noises.hiss noises.meow
# 1: cat medium 0 1 1
# 2: dog high 1 0 0
# 3: snake low 0 1 0
对于多个属性
现在,您似乎想要对多个属性进行泛化。假设您的数据也有一个 colors
属性,例如:
pets_json <- paste0('[{"animal":"cat","attributes":{"intelligence":"medium","noises":[{"noise":"meow","code":4},{"noise":"hiss","code":2}],"colors":[{"color":"brown","code":4},{"color":"white","code":2}]}},',
'{"animal":"dog","attributes":{"intelligence":"high","noises":{"noise":"bark","code":1},"colors":{"color":"brown","code":4}}},',
'{"animal":"snake","attributes":{"intelligence":"low","noises":{"noise":"hiss","code":2},"colors":[{"color":"green","code":4},{"color":"brown","code":4}]}}]')
然后你可以按照这个通用代码,它相当丑陋但应该可以正常工作:
# Do not simplify to data.frame
str(df <- fromJSON(txt=pets_json, simplifyDataFrame=F))
# Set up the attributes names
attr.names <- c("noises", "colors")
# The %<>% operator create a pipe and assigns back to the variable
df %<>%
lapply(function(.)
eval(parse(text=paste0(
"data.table(animal = .$animal, ",
"intelligence = .$attributes$intelligence, ",
paste0(attr.names, " = unlist(.$attributes$", attr.names, ")", collapse=", "),
")")))
%>%
.[eval(parse(text=paste("!", attr.names, "%in% as.character(0:9)", collapse = " & ")))] ) %>%
rbindlist
# Cast each variable and merge together
df <- dcast(melt(df, measure.vars=c(attr.names)),
animal + intelligence ~ variable + value, sep=".")
# animal intelligence noises.bark noises.hiss noises.meow colors.brown
# 1: cat medium 0 1 1 1
# 2: dog high 1 0 0 1
# 3: snake low 0 1 0 1
# colors.green colors.white
# 1: 0 1
# 2: 0 0
# 3: 1 0
此解决方案也适用于单个属性,例如attr.names <- c("noises")
.