使用 data.table 聚合组的汇总统计数据
Summary statistics from aggregated groups using data.table
我有一个具有这种结构的数据集:
library(data.table)
dt <- data.table(
record=c(1:20),
area=rep(LETTERS[1:4], c(4, 6, 3, 7)),
score=c(1,1:3,2:3,1,1,1,2,2,1,2,1,1,1,1,1:3),
cluster=c("X", "Y", "Z")[c(1,1:3,3,2,1,1:3,1,1:3,3,3,3,1:3)]
)
我想汇总数据,以便我可以针对给定分数(例如 1)确定每个区域中最常见的集群。我还希望使用如下所示的输出来计算一些基本频率和百分比:
dt_summary_for_1_score <- data.table(
area=c("A","B","C","D"),
cluster_mode=c("X","X","X","Z"),
cluster_pct = c(100,66.6,100,80),
cluster_freq = c(2,2,1,4),
record_freq = c(2,3,1,5)
)
理想情况下,我想要一个使用 data.table
的解决方案。谢谢
我会利用 frank
,尽管 sort(table(cluster))
的解决方案也是可能的。
dt_summary =
dt[ , .N, keyby = .(area, score, cluster)
][ , {
idx = frank(-N, ties.method = 'min') == 1
NN = sum(N)
.(
cluster_mode = cluster[idx],
cluster_pct = 100*N[idx]/NN,
cluster_freq = N[idx],
record_freq = NN
)
}, by = .(area, score)]
要获得 score == 1
的示例,我们可以对其进行子集化:
dt_summary[score == 1]
# area score cluster_mode cluster_pct cluster_freq record_freq
# 1: A 1 X 100.00000 2 2
# 2: B 1 X 66.66667 2 3
# 3: C 1 X 100.00000 1 1
# 4: D 1 Z 80.00000 4 5
这 returns 不同 行 在平局的情况下。您可以尝试 cluster_mode = paste(cluster[idx], collapse = '|')
或 cluster_mode = list(cluster[idx])
之类的替代方案。
分解逻辑:
# Count how many times each cluster shows up with each area/score
dt[ , .N, keyby = .(area, score, cluster)
][ , {
# Rank each cluster's count within each area/score & take the top;
# ties.method = 'min' guarantees that if there's
# a tie for "winner", _both_ will get rank 1
# (by default, ties.method = 'average')
# Note that it is slightly inefficient to negate N
# in order to sort in descending order, especially
# if there are a large number of groups. We could
# either vectorize negation by using -.N in the
# previous step or by using frankv (a lower-level
# version of frank) which has an 'order' argument
idx = frank(-N, ties.method = 'min') == 1
# calculate here since it's used twice
NN = sum(N)
.(
# use [idx] to subset and make sure there are
# only as many rows on output as there are
# top-ranked clusters for this area/score
cluster_mode = cluster[idx],
cluster_pct = 100*N[idx]/NN,
cluster_freq = N[idx],
record_freq = NN
)
}, by = .(area, score)]
我有一个具有这种结构的数据集:
library(data.table)
dt <- data.table(
record=c(1:20),
area=rep(LETTERS[1:4], c(4, 6, 3, 7)),
score=c(1,1:3,2:3,1,1,1,2,2,1,2,1,1,1,1,1:3),
cluster=c("X", "Y", "Z")[c(1,1:3,3,2,1,1:3,1,1:3,3,3,3,1:3)]
)
我想汇总数据,以便我可以针对给定分数(例如 1)确定每个区域中最常见的集群。我还希望使用如下所示的输出来计算一些基本频率和百分比:
dt_summary_for_1_score <- data.table(
area=c("A","B","C","D"),
cluster_mode=c("X","X","X","Z"),
cluster_pct = c(100,66.6,100,80),
cluster_freq = c(2,2,1,4),
record_freq = c(2,3,1,5)
)
理想情况下,我想要一个使用 data.table
的解决方案。谢谢
我会利用 frank
,尽管 sort(table(cluster))
的解决方案也是可能的。
dt_summary =
dt[ , .N, keyby = .(area, score, cluster)
][ , {
idx = frank(-N, ties.method = 'min') == 1
NN = sum(N)
.(
cluster_mode = cluster[idx],
cluster_pct = 100*N[idx]/NN,
cluster_freq = N[idx],
record_freq = NN
)
}, by = .(area, score)]
要获得 score == 1
的示例,我们可以对其进行子集化:
dt_summary[score == 1]
# area score cluster_mode cluster_pct cluster_freq record_freq
# 1: A 1 X 100.00000 2 2
# 2: B 1 X 66.66667 2 3
# 3: C 1 X 100.00000 1 1
# 4: D 1 Z 80.00000 4 5
这 returns 不同 行 在平局的情况下。您可以尝试 cluster_mode = paste(cluster[idx], collapse = '|')
或 cluster_mode = list(cluster[idx])
之类的替代方案。
分解逻辑:
# Count how many times each cluster shows up with each area/score
dt[ , .N, keyby = .(area, score, cluster)
][ , {
# Rank each cluster's count within each area/score & take the top;
# ties.method = 'min' guarantees that if there's
# a tie for "winner", _both_ will get rank 1
# (by default, ties.method = 'average')
# Note that it is slightly inefficient to negate N
# in order to sort in descending order, especially
# if there are a large number of groups. We could
# either vectorize negation by using -.N in the
# previous step or by using frankv (a lower-level
# version of frank) which has an 'order' argument
idx = frank(-N, ties.method = 'min') == 1
# calculate here since it's used twice
NN = sum(N)
.(
# use [idx] to subset and make sure there are
# only as many rows on output as there are
# top-ranked clusters for this area/score
cluster_mode = cluster[idx],
cluster_pct = 100*N[idx]/NN,
cluster_freq = N[idx],
record_freq = NN
)
}, by = .(area, score)]