在 data.table 中重新创建 dplyr 摘要
Recreate dplyr summarise in data.table
出于好奇,有没有一种方法可以使用 data.table
而不是 dplyr
来重新创建 summary
输出?
dt1 <- data.table(
uid=c("A00111", "A00112","A00113","A00211","A00212","A00213","A00214","A00311","A00312"),
area=c("A001", "A001","A001","A002","A002","A002","A002","A003","A003"),
price=c(325147,NA,596020,257409,241206,248371,261076,595218,596678),
type=c("Type1","Type2","Type3","Type2","Type3","Type2","Type2","Type2","Type3"))
summary <- dt1 %>% group_by(area) %>% summarise(
Total_Number = length(uid),
Total_Number_Check = unique(length(uid)),
Number_of_Type_1 = length(uid[type=="Type1"]),
Mean_Price_Type_1 = mean(price[type=="Type1"],na.rm = TRUE),
Number_of_Type_2 = length(uid[type=="Type2"]),
Mean_Price_Type_2 = mean(price[type=="Type2"],na.rm = TRUE),
Number_of_Type_3 = length(uid[type=="Type3"]),
Mean_Price_Type_3 = mean(price[type=="Type3"],na.rm = TRUE))
下面是 data.table
上面@DavidArenburg 的评论是默认的总结方式data.table
。
但是,我没有一次性创建摘要,因为您可能有超过 3 个 type
变量。如果是这样,则(手动)总结 >10 种类型是行不通的。它会变成一个长(无聊)的代码。
所以我先按地区汇总(DT1
),然后再按地区AND按类型汇总。然后将第二次总结的结果转为宽格式(DT2
),并将left-joined DT2转为DT1。
所以下面的代码适用于任意数量的区域和任意数量的类型。
library( data.table )
#summarise by area
DT1 <- dt1[ , .( Total_Number = .N,
Total_Number_Check = uniqueN( uid ) ),
by = .(area)]
#summarise by area AND type and cast to wide format
DT2 <- dcast( dt1[ , .( Number_of = .N,
Mean_Price = mean( price, na.rm = TRUE ) ),
by = .(area, type)],
area ~ type,
value.var = c("Number_of", "Mean_Price") )
#join
DT1[DT2, on = .(area)]
# area Total_Number Total_Number_Check Number_of_Type1 Number_of_Type2 Number_of_Type3 Mean_Price_Type1
# 1: A001 3 3 1 1 1 325147
# 2: A002 4 4 NA 3 1 NA
# 3: A003 2 2 NA 1 1 NA
# Mean_Price_Type2 Mean_Price_Type3
# 1: NA 596020
# 2: 255618.7 241206
# 3: 595218.0 596678
出于好奇,有没有一种方法可以使用 data.table
而不是 dplyr
来重新创建 summary
输出?
dt1 <- data.table(
uid=c("A00111", "A00112","A00113","A00211","A00212","A00213","A00214","A00311","A00312"),
area=c("A001", "A001","A001","A002","A002","A002","A002","A003","A003"),
price=c(325147,NA,596020,257409,241206,248371,261076,595218,596678),
type=c("Type1","Type2","Type3","Type2","Type3","Type2","Type2","Type2","Type3"))
summary <- dt1 %>% group_by(area) %>% summarise(
Total_Number = length(uid),
Total_Number_Check = unique(length(uid)),
Number_of_Type_1 = length(uid[type=="Type1"]),
Mean_Price_Type_1 = mean(price[type=="Type1"],na.rm = TRUE),
Number_of_Type_2 = length(uid[type=="Type2"]),
Mean_Price_Type_2 = mean(price[type=="Type2"],na.rm = TRUE),
Number_of_Type_3 = length(uid[type=="Type3"]),
Mean_Price_Type_3 = mean(price[type=="Type3"],na.rm = TRUE))
下面是 data.table
上面@DavidArenburg 的评论是默认的总结方式data.table
。
但是,我没有一次性创建摘要,因为您可能有超过 3 个 type
变量。如果是这样,则(手动)总结 >10 种类型是行不通的。它会变成一个长(无聊)的代码。
所以我先按地区汇总(DT1
),然后再按地区AND按类型汇总。然后将第二次总结的结果转为宽格式(DT2
),并将left-joined DT2转为DT1。
所以下面的代码适用于任意数量的区域和任意数量的类型。
library( data.table )
#summarise by area
DT1 <- dt1[ , .( Total_Number = .N,
Total_Number_Check = uniqueN( uid ) ),
by = .(area)]
#summarise by area AND type and cast to wide format
DT2 <- dcast( dt1[ , .( Number_of = .N,
Mean_Price = mean( price, na.rm = TRUE ) ),
by = .(area, type)],
area ~ type,
value.var = c("Number_of", "Mean_Price") )
#join
DT1[DT2, on = .(area)]
# area Total_Number Total_Number_Check Number_of_Type1 Number_of_Type2 Number_of_Type3 Mean_Price_Type1
# 1: A001 3 3 1 1 1 325147
# 2: A002 4 4 NA 3 1 NA
# 3: A003 2 2 NA 1 1 NA
# Mean_Price_Type2 Mean_Price_Type3
# 1: NA 596020
# 2: 255618.7 241206
# 3: 595218.0 596678