以长格式聚合数据时的正确计算方法
Proper way of calculating means when aggregating data in long format
对于一个简单的数据框:
client_id<-c("111","111","111","112","113","113","114")
transactions<-c(1,2,2,2,3,17,100)
transactions_sum<-c(5,5,5,2,20,20,100) ##precalculated sums of transaction counts for each client_id
segment<-c("low","low","low","low","low","low","high")
test<-data.frame(client_id,transactions,transactions_sum,segment)
client_id transactions transactions_sum segment
1 111 1 5 low
2 111 2 5 low
3 111 2 5 low
4 112 2 2 low
5 113 3 20 low
6 113 17 20 low
7 114 100 100 high
我正在尝试按分段汇总并计算分段均值。
我期待以下结果:
segment transactions_mean
1 low 9
2 high 100
由于计算平均值应该考虑重复 client_ids,我们应该将每个段的单个交易计数相加(低段为 1+2+2+2+3+17)并除以唯一 client_ids(低段为 3),得到 27/3 = 9 为低段。对每个 client_id 使用预先计算的总和:(5+2+20)/3 = 9
然而,当我尝试 运行 "dcast" 或 "aggregate" 时,我得到了错误的数字,因为显然他们将每一行视为一个独特的观察:
dcast(test, segment ~ ., mean, value.var="transactions")
给予
segment .
1 low 4.5
2 high 100.0
这有效地表明它对每个细分市场的交易计数求和(低细分市场为 1+2+2+2+3+17)并除以每个细分市场的观察数量(低细分市场为 6)而不是唯一的 client_ids。
在这种情况下,正确的均值计算方法是什么?
我们可以使用data.table
library(data.table)
setDT(test)[, .(transactions_mean = sum(transactions)/uniqueN(client_id)), by = segment]
# segment transactions_mean
#1: low 9
#2: high 100
你可以使用这个:
meanLow <- mean(test$segment == "low")
meanHigh <- mean(test$segment == "high")
您也可以使用dplyr
library(dplyr)
test_2 <- test %>%
group_by(segment) %>%
summarise (meanTransactions=sum(transactions)/n_distinct(client_id))
test_2
# A tibble: 2 × 2
segment transactions
<chr> <dbl>
1 high 100
2 low 9
对于一个简单的数据框:
client_id<-c("111","111","111","112","113","113","114")
transactions<-c(1,2,2,2,3,17,100)
transactions_sum<-c(5,5,5,2,20,20,100) ##precalculated sums of transaction counts for each client_id
segment<-c("low","low","low","low","low","low","high")
test<-data.frame(client_id,transactions,transactions_sum,segment)
client_id transactions transactions_sum segment
1 111 1 5 low
2 111 2 5 low
3 111 2 5 low
4 112 2 2 low
5 113 3 20 low
6 113 17 20 low
7 114 100 100 high
我正在尝试按分段汇总并计算分段均值。
我期待以下结果:
segment transactions_mean
1 low 9
2 high 100
由于计算平均值应该考虑重复 client_ids,我们应该将每个段的单个交易计数相加(低段为 1+2+2+2+3+17)并除以唯一 client_ids(低段为 3),得到 27/3 = 9 为低段。对每个 client_id 使用预先计算的总和:(5+2+20)/3 = 9
然而,当我尝试 运行 "dcast" 或 "aggregate" 时,我得到了错误的数字,因为显然他们将每一行视为一个独特的观察:
dcast(test, segment ~ ., mean, value.var="transactions")
给予
segment .
1 low 4.5
2 high 100.0
这有效地表明它对每个细分市场的交易计数求和(低细分市场为 1+2+2+2+3+17)并除以每个细分市场的观察数量(低细分市场为 6)而不是唯一的 client_ids。
在这种情况下,正确的均值计算方法是什么?
我们可以使用data.table
library(data.table)
setDT(test)[, .(transactions_mean = sum(transactions)/uniqueN(client_id)), by = segment]
# segment transactions_mean
#1: low 9
#2: high 100
你可以使用这个:
meanLow <- mean(test$segment == "low")
meanHigh <- mean(test$segment == "high")
您也可以使用dplyr
library(dplyr)
test_2 <- test %>%
group_by(segment) %>%
summarise (meanTransactions=sum(transactions)/n_distinct(client_id))
test_2
# A tibble: 2 × 2
segment transactions
<chr> <dbl>
1 high 100
2 low 9