数据 table 具有多个 group by 变量集的操作
Data table operations with multiple group by variable sets
我有一个 data.table
,我想对其执行分组操作,但想保留空变量并使用不同的分组变量集。
玩具示例:
library(data.table)
set.seed(1)
DT <- data.table(
id = sample(c("US", "Other"), 25, replace = TRUE),
loc = sample(LETTERS[1:5], 25, replace = TRUE),
index = runif(25)
)
我想通过关键变量(包括空集)的所有组合找到 index
的总和。这个概念类似于 Oracle SQL 中的 "grouping sets",这是我当前解决方法的示例:
rbind(
DT[, list(id = "", loc = "", sindex = sum(index)), by = NULL],
DT[, list(loc = "", sindex = sum(index)), by = "id"],
DT[, list(id = "", sindex = sum(index)), by = "loc"],
DT[, list(sindex = sum(index)), by = c("id", "loc")]
)[order(id, loc)]
id loc sindex
1: 11.54218399
2: A 2.82172063
3: B 0.98639578
4: C 2.89149433
5: D 3.93292900
6: E 0.90964424
7: Other 6.19514146
8: Other A 1.12107080
9: Other B 0.43809711
10: Other C 2.80724742
11: Other D 1.58392886
12: Other E 0.24479728
13: US 5.34704253
14: US A 1.70064983
15: US B 0.54829867
16: US C 0.08424691
17: US D 2.34900015
18: US E 0.66484697
是否有首选的 "data table" 方式来完成此任务?
使用 dplyr
,如果我正确理解你的问题,对此的改编应该有效。
sum <- mtcars %>%
group_by(vs, am) %>%
summarise(Sum=sum(mpg))
虽然我没有检查它是如何处理缺失值的,但它应该只是制作另一组(最后一组)。
我有一个通用函数,您可以将其输入数据框和您希望作为分组依据的维度向量,它将 return 按这些维度分组的所有数字字段的总和。
rollSum = function(input, dimensions){
#cast dimension inputs to character in case a dimension input is numeric
for (x in 1:length(dimensions)){
input[[eval(dimensions[x])]] = as.character(input[[eval(dimensions[x])]])
}
numericColumns = which(lapply(input,class) %in% c("integer", "numeric"))
output = input[,lapply(.SD, sum, na.rm = TRUE), by = eval(dimensions),
.SDcols = numericColumns]
return(output)
}
然后您可以通过向量创建不同组的列表:
groupings = list(c("id"),c("loc"),c("id","loc"))
然后以lapply和rbindlist的方式使用它:
groupedSets = rbindlist(lapply(groupings, function(x){
return(rollSum(DT,x))}), fill = TRUE)
从 this 提交开始,现在可以使用 data.table
的开发版本 cube
或 groupingsets
:
library("data.table")
# data.table 1.10.5 IN DEVELOPMENT built 2017-08-08 18:31:51 UTC
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
cube(DT, list(sindex = sum(index)), by = c("id", "loc"))
# id loc sindex
# 1: US B 0.54829867
# 2: US A 1.70064983
# 3: Other B 0.43809711
# 4: Other E 0.24479728
# 5: Other C 2.80724742
# 6: Other A 1.12107080
# 7: US E 0.66484697
# 8: US D 2.34900015
# 9: Other D 1.58392886
# 10: US C 0.08424691
# 11: NA B 0.98639578
# 12: NA A 2.82172063
# 13: NA E 0.90964424
# 14: NA C 2.89149433
# 15: NA D 3.93292900
# 16: US NA 5.34704253
# 17: Other NA 6.19514146
# 18: NA NA 11.54218399
groupingsets(DT, j = list(sindex = sum(index)), by = c("id", "loc"), sets = list(character(), "id", "loc", c("id", "loc")))
# id loc sindex
# 1: NA NA 11.54218399
# 2: US NA 5.34704253
# 3: Other NA 6.19514146
# 4: NA B 0.98639578
# 5: NA A 2.82172063
# 6: NA E 0.90964424
# 7: NA C 2.89149433
# 8: NA D 3.93292900
# 9: US B 0.54829867
# 10: US A 1.70064983
# 11: Other B 0.43809711
# 12: Other E 0.24479728
# 13: Other C 2.80724742
# 14: Other A 1.12107080
# 15: US E 0.66484697
# 16: US D 2.34900015
# 17: Other D 1.58392886
# 18: US C 0.08424691
我有一个 data.table
,我想对其执行分组操作,但想保留空变量并使用不同的分组变量集。
玩具示例:
library(data.table)
set.seed(1)
DT <- data.table(
id = sample(c("US", "Other"), 25, replace = TRUE),
loc = sample(LETTERS[1:5], 25, replace = TRUE),
index = runif(25)
)
我想通过关键变量(包括空集)的所有组合找到 index
的总和。这个概念类似于 Oracle SQL 中的 "grouping sets",这是我当前解决方法的示例:
rbind(
DT[, list(id = "", loc = "", sindex = sum(index)), by = NULL],
DT[, list(loc = "", sindex = sum(index)), by = "id"],
DT[, list(id = "", sindex = sum(index)), by = "loc"],
DT[, list(sindex = sum(index)), by = c("id", "loc")]
)[order(id, loc)]
id loc sindex
1: 11.54218399
2: A 2.82172063
3: B 0.98639578
4: C 2.89149433
5: D 3.93292900
6: E 0.90964424
7: Other 6.19514146
8: Other A 1.12107080
9: Other B 0.43809711
10: Other C 2.80724742
11: Other D 1.58392886
12: Other E 0.24479728
13: US 5.34704253
14: US A 1.70064983
15: US B 0.54829867
16: US C 0.08424691
17: US D 2.34900015
18: US E 0.66484697
是否有首选的 "data table" 方式来完成此任务?
使用 dplyr
,如果我正确理解你的问题,对此的改编应该有效。
sum <- mtcars %>%
group_by(vs, am) %>%
summarise(Sum=sum(mpg))
虽然我没有检查它是如何处理缺失值的,但它应该只是制作另一组(最后一组)。
我有一个通用函数,您可以将其输入数据框和您希望作为分组依据的维度向量,它将 return 按这些维度分组的所有数字字段的总和。
rollSum = function(input, dimensions){
#cast dimension inputs to character in case a dimension input is numeric
for (x in 1:length(dimensions)){
input[[eval(dimensions[x])]] = as.character(input[[eval(dimensions[x])]])
}
numericColumns = which(lapply(input,class) %in% c("integer", "numeric"))
output = input[,lapply(.SD, sum, na.rm = TRUE), by = eval(dimensions),
.SDcols = numericColumns]
return(output)
}
然后您可以通过向量创建不同组的列表:
groupings = list(c("id"),c("loc"),c("id","loc"))
然后以lapply和rbindlist的方式使用它:
groupedSets = rbindlist(lapply(groupings, function(x){
return(rollSum(DT,x))}), fill = TRUE)
从 this 提交开始,现在可以使用 data.table
的开发版本 cube
或 groupingsets
:
library("data.table")
# data.table 1.10.5 IN DEVELOPMENT built 2017-08-08 18:31:51 UTC
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
cube(DT, list(sindex = sum(index)), by = c("id", "loc"))
# id loc sindex
# 1: US B 0.54829867
# 2: US A 1.70064983
# 3: Other B 0.43809711
# 4: Other E 0.24479728
# 5: Other C 2.80724742
# 6: Other A 1.12107080
# 7: US E 0.66484697
# 8: US D 2.34900015
# 9: Other D 1.58392886
# 10: US C 0.08424691
# 11: NA B 0.98639578
# 12: NA A 2.82172063
# 13: NA E 0.90964424
# 14: NA C 2.89149433
# 15: NA D 3.93292900
# 16: US NA 5.34704253
# 17: Other NA 6.19514146
# 18: NA NA 11.54218399
groupingsets(DT, j = list(sindex = sum(index)), by = c("id", "loc"), sets = list(character(), "id", "loc", c("id", "loc")))
# id loc sindex
# 1: NA NA 11.54218399
# 2: US NA 5.34704253
# 3: Other NA 6.19514146
# 4: NA B 0.98639578
# 5: NA A 2.82172063
# 6: NA E 0.90964424
# 7: NA C 2.89149433
# 8: NA D 3.93292900
# 9: US B 0.54829867
# 10: US A 1.70064983
# 11: Other B 0.43809711
# 12: Other E 0.24479728
# 13: Other C 2.80724742
# 14: Other A 1.12107080
# 15: US E 0.66484697
# 16: US D 2.34900015
# 17: Other D 1.58392886
# 18: US C 0.08424691