R JSON:快速创建一个json字段作为新变量

R JSON: Quickly create a json field as a new variable

我想向现有的 data.frame 添加一个 JSON 列,该列包含每行的大部分信息。下面的代码显示了我当前的方法并实现了我想要的结果。

# Establish test data
testData <- data.table::data.table(id = 1:10000,
                                   var1 = rep(letters[1:10], times = 1000),
                                   var2 = rep(letters[11:20], times = 1000))

# Establish which variables will be JSON-ed
jsonVars <- c("var1","var2")

# Initialize JSON column
testData[["json"]] <- as.character(NA)

# Loop through and populate JSON column
for(i in 1:nrow(testData)) {
  cat(paste0("Running ", i, " of ", nrow(testData), "\n"))
  testData[i,][["json"]] <- gsub('^.|.$','',jsonlite::toJSON(testData[i,jsonVars,with=F]))
}

# Keep only the identifier and JSON fields
testData <- testData[,c("id","json"),with=F]

如您所见,我想要的结果只是一个“id”字段和包含所有行信息的“json”字段。

虽然上述方法有效,但速度太慢,因为我的“真实”数据集有数百万条记录和约 300 列。

我已经尝试使用 apply()(见下文),但性能并没有提高

# Establish test data
testData <- data.table::data.table(id = 1:10000,
                                   var1 = rep(letters[1:10], times = 1000),
                                   var2 = rep(letters[11:20], times = 1000))

# Establish which variables will be JSON-ed
jsonVars <- c("var1","var2")

# Initialize JSON column
testData[["json"]] <- as.character(NA)

# Use apply to populate JSON column
jsonFUN <- function(x) { gsub('^.|.$','',jsonlite::toJSON(testData[,jsonVars,with=F])) }
testData$json <- apply(X = testData, MARGIN = 1, FUN = jsonFUN)

# Keep only the identifier and JSON fields
testData <- testData[,c("id","json"),with=F]

有谁知道更快完成这项任务的有效方法吗?

也许答案是更好地使用 data.table 工具。 jsonlite::toJSON 的文档也提到了向量、data.frame 和数组的处理,但我一直没能找到解决方案。

感谢您的帮助!

json精简版

这不是很快,但它有效并且是一种规范的 json 方法:

testData[, .(json = jsonlite::toJSON(.SD)), by = id
  ][, json := gsub("^\[|\]$", "", json)][]
#           id                    json
#        <int>                  <json>
#     1:     1 {"var1":"a","var2":"k"}
#     2:     2 {"var1":"b","var2":"l"}
#     3:     3 {"var1":"c","var2":"m"}
#     4:     4 {"var1":"d","var2":"n"}
#     5:     5 {"var1":"e","var2":"o"}
#     6:     6 {"var1":"f","var2":"p"}
#     7:     7 {"var1":"g","var2":"q"}
#     8:     8 {"var1":"h","var2":"r"}
#     9:     9 {"var1":"i","var2":"s"}
#    10:    10 {"var1":"j","var2":"t"}
#    ---                              
#  9991:  9991 {"var1":"a","var2":"k"}
#  9992:  9992 {"var1":"b","var2":"l"}
#  9993:  9993 {"var1":"c","var2":"m"}
#  9994:  9994 {"var1":"d","var2":"n"}
#  9995:  9995 {"var1":"e","var2":"o"}
#  9996:  9996 {"var1":"f","var2":"p"}
#  9997:  9997 {"var1":"g","var2":"q"}
#  9998:  9998 {"var1":"h","var2":"r"}
#  9999:  9999 {"var1":"i","var2":"s"}
# 10000: 10000 {"var1":"j","var2":"t"}

每一行的默认输出实际上是 "[{"var1":"a","var2":"k"}]"(长度为 1 的列表),我用 gsub 模式删除了它。如果您不介意长度为 1 的列表,那么您可以简化 gsub.

此外,这假设行在 id 中是唯一的;如果 id 有多行,那么这将产生不同的结果(对于这些行)。

sprintf

这是一个 hack,并没有那么普遍,但它可能要快得多:

testData[, .(id, json = sprintf('{"var1":%s,"var2":%s}', dQuote(var1, FALSE), dQuote(var2, FALSE)))]

我警告不要将此用于任何更复杂的事情,因为使用 jsonlite::toJSON 是更多 cautious/correct 方法来处理任何不平凡的事情。