如何在 R 中按 Rollup 分组? (点赞SQL)
How to do Group By Rollup in R? (Like SQL)
我有一个数据集,我想执行类似 Group By Rollup 的操作,就像我们在 SQL 中对聚合值所做的那样。
下面是一个可重现的例子。我知道 aggregate 确实如解释的那样工作得很好 here 但不适合我的情况。
year<- c('2016','2016','2016','2016','2017','2017','2017','2017')
month<- c('1','1','1','1','2','2','2','2')
region<- c('east','west','east','west','east','west','east','west')
sales<- c(100,200,300,400,200,400,600,800)
df<- data.frame(year,month,region,sales)
df
year month region sales
1 2016 1 east 100
2 2016 1 west 200
3 2016 1 east 300
4 2016 1 west 400
5 2017 2 east 200
6 2017 2 west 400
7 2017 2 east 600
8 2017 2 west 800
现在我要做的是聚合(按年月区域求和)并在现有数据框中添加新的聚合行
例如应该有两个额外的行,如下所示,对于聚合行
,区域的新名称为'USA'
year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 USA 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 USA 2000
我已经找到了一种方法(如下),但我非常确定存在针对此问题的最佳解决方案或比我的更好的解决方法
df1<- setNames(aggregate(df$sales, by=list(df$year,df$month, df$region), FUN=sum),
c('year','month','region', 'sales'))
df2<- setNames(aggregate(df$sales, by=list(df$year,df$month), FUN=sum),
c('year','month', 'sales'))
df2$region<- 'USA' ## added a new column- region- for total USA
df2<- df2[, c('year','month','region', 'sales')] ## reordering the columns of df2
df3<- rbind(df1,df2)
df3<- df3[order(df3$year,df3$month,df3$region),] ## order by
rownames(df3)<- NULL ## renumbered the rows after order by
df3
感谢支持!
plyr::ddply(df, c("year", "month", "region"), plyr::summarise, sales = sum(sales))
reshape2包中的melt
/dcast
可以做小计。在 运行 dcast
之后,我们使用 zoo 包中的 na.locf
将月份列中的 "(all)"
替换为月份:
library(reshape2)
library(zoo)
m <- melt(df, measure.vars = "sales")
dout <- dcast(m, year + month + region ~ variable, fun.aggregate = sum, margins = "month")
dout$month <- na.locf(replace(dout$month, dout$month == "(all)", NA))
给予:
> dout
year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 (all) 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 (all) 2000
在最近开发的 data.table 1.10.5 中,您可以使用名为 "grouping sets" 的新功能来生成小计:
library(data.table)
setDT(df)
res = groupingsets(df, .(sales=sum(sales)), sets=list(c("year","month"), c("year","month","region")), by=c("year","month","region"))
setorder(res, na.last=TRUE)
res
# year month region sales
#1: 2016 1 east 400
#2: 2016 1 west 600
#3: 2016 1 NA 1000
#4: 2017 2 east 800
#5: 2017 2 west 1200
#6: 2017 2 NA 2000
您可以使用 res[is.na(region), region := "USA"]
.
将 NA
替换为 USA
我有一个数据集,我想执行类似 Group By Rollup 的操作,就像我们在 SQL 中对聚合值所做的那样。
下面是一个可重现的例子。我知道 aggregate 确实如解释的那样工作得很好 here 但不适合我的情况。
year<- c('2016','2016','2016','2016','2017','2017','2017','2017')
month<- c('1','1','1','1','2','2','2','2')
region<- c('east','west','east','west','east','west','east','west')
sales<- c(100,200,300,400,200,400,600,800)
df<- data.frame(year,month,region,sales)
df
year month region sales
1 2016 1 east 100
2 2016 1 west 200
3 2016 1 east 300
4 2016 1 west 400
5 2017 2 east 200
6 2017 2 west 400
7 2017 2 east 600
8 2017 2 west 800
现在我要做的是聚合(按年月区域求和)并在现有数据框中添加新的聚合行 例如应该有两个额外的行,如下所示,对于聚合行
,区域的新名称为'USA'year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 USA 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 USA 2000
我已经找到了一种方法(如下),但我非常确定存在针对此问题的最佳解决方案或比我的更好的解决方法
df1<- setNames(aggregate(df$sales, by=list(df$year,df$month, df$region), FUN=sum),
c('year','month','region', 'sales'))
df2<- setNames(aggregate(df$sales, by=list(df$year,df$month), FUN=sum),
c('year','month', 'sales'))
df2$region<- 'USA' ## added a new column- region- for total USA
df2<- df2[, c('year','month','region', 'sales')] ## reordering the columns of df2
df3<- rbind(df1,df2)
df3<- df3[order(df3$year,df3$month,df3$region),] ## order by
rownames(df3)<- NULL ## renumbered the rows after order by
df3
感谢支持!
plyr::ddply(df, c("year", "month", "region"), plyr::summarise, sales = sum(sales))
melt
/dcast
可以做小计。在 运行 dcast
之后,我们使用 zoo 包中的 na.locf
将月份列中的 "(all)"
替换为月份:
library(reshape2)
library(zoo)
m <- melt(df, measure.vars = "sales")
dout <- dcast(m, year + month + region ~ variable, fun.aggregate = sum, margins = "month")
dout$month <- na.locf(replace(dout$month, dout$month == "(all)", NA))
给予:
> dout
year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 (all) 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 (all) 2000
在最近开发的 data.table 1.10.5 中,您可以使用名为 "grouping sets" 的新功能来生成小计:
library(data.table)
setDT(df)
res = groupingsets(df, .(sales=sum(sales)), sets=list(c("year","month"), c("year","month","region")), by=c("year","month","region"))
setorder(res, na.last=TRUE)
res
# year month region sales
#1: 2016 1 east 400
#2: 2016 1 west 600
#3: 2016 1 NA 1000
#4: 2017 2 east 800
#5: 2017 2 west 1200
#6: 2017 2 NA 2000
您可以使用 res[is.na(region), region := "USA"]
.
NA
替换为 USA