R JSON:快速创建一个json字段作为新变量
R JSON: Quickly create a json field as a new variable
我想向现有的 data.frame 添加一个 JSON 列,该列包含每行的大部分信息。下面的代码显示了我当前的方法并实现了我想要的结果。
# Establish test data
testData <- data.table::data.table(id = 1:10000,
var1 = rep(letters[1:10], times = 1000),
var2 = rep(letters[11:20], times = 1000))
# Establish which variables will be JSON-ed
jsonVars <- c("var1","var2")
# Initialize JSON column
testData[["json"]] <- as.character(NA)
# Loop through and populate JSON column
for(i in 1:nrow(testData)) {
cat(paste0("Running ", i, " of ", nrow(testData), "\n"))
testData[i,][["json"]] <- gsub('^.|.$','',jsonlite::toJSON(testData[i,jsonVars,with=F]))
}
# Keep only the identifier and JSON fields
testData <- testData[,c("id","json"),with=F]
如您所见,我想要的结果只是一个“id”字段和包含所有行信息的“json”字段。
虽然上述方法有效,但速度太慢,因为我的“真实”数据集有数百万条记录和约 300 列。
我已经尝试使用 apply()
(见下文),但性能并没有提高
# Establish test data
testData <- data.table::data.table(id = 1:10000,
var1 = rep(letters[1:10], times = 1000),
var2 = rep(letters[11:20], times = 1000))
# Establish which variables will be JSON-ed
jsonVars <- c("var1","var2")
# Initialize JSON column
testData[["json"]] <- as.character(NA)
# Use apply to populate JSON column
jsonFUN <- function(x) { gsub('^.|.$','',jsonlite::toJSON(testData[,jsonVars,with=F])) }
testData$json <- apply(X = testData, MARGIN = 1, FUN = jsonFUN)
# Keep only the identifier and JSON fields
testData <- testData[,c("id","json"),with=F]
有谁知道更快完成这项任务的有效方法吗?
也许答案是更好地使用 data.table
工具。 jsonlite::toJSON
的文档也提到了向量、data.frame 和数组的处理,但我一直没能找到解决方案。
感谢您的帮助!
json精简版
这不是很快,但它有效并且是一种规范的 json 方法:
testData[, .(json = jsonlite::toJSON(.SD)), by = id
][, json := gsub("^\[|\]$", "", json)][]
# id json
# <int> <json>
# 1: 1 {"var1":"a","var2":"k"}
# 2: 2 {"var1":"b","var2":"l"}
# 3: 3 {"var1":"c","var2":"m"}
# 4: 4 {"var1":"d","var2":"n"}
# 5: 5 {"var1":"e","var2":"o"}
# 6: 6 {"var1":"f","var2":"p"}
# 7: 7 {"var1":"g","var2":"q"}
# 8: 8 {"var1":"h","var2":"r"}
# 9: 9 {"var1":"i","var2":"s"}
# 10: 10 {"var1":"j","var2":"t"}
# ---
# 9991: 9991 {"var1":"a","var2":"k"}
# 9992: 9992 {"var1":"b","var2":"l"}
# 9993: 9993 {"var1":"c","var2":"m"}
# 9994: 9994 {"var1":"d","var2":"n"}
# 9995: 9995 {"var1":"e","var2":"o"}
# 9996: 9996 {"var1":"f","var2":"p"}
# 9997: 9997 {"var1":"g","var2":"q"}
# 9998: 9998 {"var1":"h","var2":"r"}
# 9999: 9999 {"var1":"i","var2":"s"}
# 10000: 10000 {"var1":"j","var2":"t"}
每一行的默认输出实际上是 "[{"var1":"a","var2":"k"}]"
(长度为 1 的列表),我用 gsub
模式删除了它。如果您不介意长度为 1 的列表,那么您可以简化 gsub
.
此外,这假设行在 id
中是唯一的;如果 id
有多行,那么这将产生不同的结果(对于这些行)。
sprintf
这是一个 hack,并没有那么普遍,但它可能要快得多:
testData[, .(id, json = sprintf('{"var1":%s,"var2":%s}', dQuote(var1, FALSE), dQuote(var2, FALSE)))]
我警告不要将此用于任何更复杂的事情,因为使用 jsonlite::toJSON
是更多 cautious/correct 方法来处理任何不平凡的事情。
我想向现有的 data.frame 添加一个 JSON 列,该列包含每行的大部分信息。下面的代码显示了我当前的方法并实现了我想要的结果。
# Establish test data
testData <- data.table::data.table(id = 1:10000,
var1 = rep(letters[1:10], times = 1000),
var2 = rep(letters[11:20], times = 1000))
# Establish which variables will be JSON-ed
jsonVars <- c("var1","var2")
# Initialize JSON column
testData[["json"]] <- as.character(NA)
# Loop through and populate JSON column
for(i in 1:nrow(testData)) {
cat(paste0("Running ", i, " of ", nrow(testData), "\n"))
testData[i,][["json"]] <- gsub('^.|.$','',jsonlite::toJSON(testData[i,jsonVars,with=F]))
}
# Keep only the identifier and JSON fields
testData <- testData[,c("id","json"),with=F]
如您所见,我想要的结果只是一个“id”字段和包含所有行信息的“json”字段。
虽然上述方法有效,但速度太慢,因为我的“真实”数据集有数百万条记录和约 300 列。
我已经尝试使用 apply()
(见下文),但性能并没有提高
# Establish test data
testData <- data.table::data.table(id = 1:10000,
var1 = rep(letters[1:10], times = 1000),
var2 = rep(letters[11:20], times = 1000))
# Establish which variables will be JSON-ed
jsonVars <- c("var1","var2")
# Initialize JSON column
testData[["json"]] <- as.character(NA)
# Use apply to populate JSON column
jsonFUN <- function(x) { gsub('^.|.$','',jsonlite::toJSON(testData[,jsonVars,with=F])) }
testData$json <- apply(X = testData, MARGIN = 1, FUN = jsonFUN)
# Keep only the identifier and JSON fields
testData <- testData[,c("id","json"),with=F]
有谁知道更快完成这项任务的有效方法吗?
也许答案是更好地使用 data.table
工具。 jsonlite::toJSON
的文档也提到了向量、data.frame 和数组的处理,但我一直没能找到解决方案。
感谢您的帮助!
json精简版
这不是很快,但它有效并且是一种规范的 json 方法:
testData[, .(json = jsonlite::toJSON(.SD)), by = id
][, json := gsub("^\[|\]$", "", json)][]
# id json
# <int> <json>
# 1: 1 {"var1":"a","var2":"k"}
# 2: 2 {"var1":"b","var2":"l"}
# 3: 3 {"var1":"c","var2":"m"}
# 4: 4 {"var1":"d","var2":"n"}
# 5: 5 {"var1":"e","var2":"o"}
# 6: 6 {"var1":"f","var2":"p"}
# 7: 7 {"var1":"g","var2":"q"}
# 8: 8 {"var1":"h","var2":"r"}
# 9: 9 {"var1":"i","var2":"s"}
# 10: 10 {"var1":"j","var2":"t"}
# ---
# 9991: 9991 {"var1":"a","var2":"k"}
# 9992: 9992 {"var1":"b","var2":"l"}
# 9993: 9993 {"var1":"c","var2":"m"}
# 9994: 9994 {"var1":"d","var2":"n"}
# 9995: 9995 {"var1":"e","var2":"o"}
# 9996: 9996 {"var1":"f","var2":"p"}
# 9997: 9997 {"var1":"g","var2":"q"}
# 9998: 9998 {"var1":"h","var2":"r"}
# 9999: 9999 {"var1":"i","var2":"s"}
# 10000: 10000 {"var1":"j","var2":"t"}
每一行的默认输出实际上是 "[{"var1":"a","var2":"k"}]"
(长度为 1 的列表),我用 gsub
模式删除了它。如果您不介意长度为 1 的列表,那么您可以简化 gsub
.
此外,这假设行在 id
中是唯一的;如果 id
有多行,那么这将产生不同的结果(对于这些行)。
sprintf
这是一个 hack,并没有那么普遍,但它可能要快得多:
testData[, .(id, json = sprintf('{"var1":%s,"var2":%s}', dQuote(var1, FALSE), dQuote(var2, FALSE)))]
我警告不要将此用于任何更复杂的事情,因为使用 jsonlite::toJSON
是更多 cautious/correct 方法来处理任何不平凡的事情。