Return R 中每 n rows/observations 的多数因子水平向量
Return vector of majority factor levels per n rows/observations in R
例如,我有一个包含两个因子变量和 1000 行的数据框。我想通过返回每 5 行给出最常出现级别的向量,将观察次数减少到 200。
df <- data.frame(test=factor(sample(c("A","B", "C" ),1000,replace=TRUE)))
df$test2 <- factor(sample(c("dog", "cat", "fish"), 1000, replace=TRUE))
head(df, 15)
test test2
1 C fish
2 B dog
3 A fish
4 B fish
5 B dog
6 A cat
7 B cat
8 C fish
9 C fish
10 C cat
11 B dog
12 A fish
13 B dog
14 B cat
15 C dog
我希望输出给出如下两列:
testANS test2ANS
B fish
C cat
B dog
我发现了一些示例,其中最常见的分类是在一行中跨列而不是向下列和按行数找到的。在此先感谢您的任何建议。将不胜感激
我们可以试试data.table
。将 'data.frame' 转换为 'data.table' (setDT(df)
),按 'test'、'test2' 和通过复制 200 乘以 5 的序列创建的变量分组 ('grp'),按 'grp' 分组,我们得到 Data.table (.SD
) 的子集,其中 'N' 最大 (which.max(N)
)。如果需要,我们可以将 'grp' 和 'N' 列分配给 'NULL'.
library(data.table)
res <- setDT(df)[, .N, by = .(test, test2, grp = rep(1:200, each = 5))
][, .SD[which.max(N)], by = grp][, c("grp", "N") := NULL][]
dim(res)
#[1] 200 2
由于 OP 没有使用 set.seed
创建 sample
,输出将不一样。通过使用 OP post
中显示的前 15 行
setnames(setDT(df1)[, .N, by = .(test, test2, grp= rep(1:3, each = 5))
][, .SD[which.max(N)] , grp][, c("grp", "N") := NULL][], paste0(names(df1), "ANS"))[]
# testANS test2ANS
#1: B dog
#2: C fish
#3: B dog
更新
根据评论,好像列频应该分开做
setDT(df1)[, grp:= rep(1:3, each = 5)][,
testN := .N ,by = .(grp, test)][, test2N := .N, by = .(grp, test2)
][, .(testANS = test[which.max(testN)], test2ANS = test2[which.max(test2N)]), grp]
# grp testANS test2ANS
#1: 1 B fish
#2: 2 C cat
#3: 3 B dog
注意:在原始数据集中,将 rep(1:3, each = 5)
更改为 rep(1:200, each = 5)
数据
df1 <- structure(list(test = c("C", "B", "A", "B", "B", "A", "B", "C",
"C", "C", "B", "A", "B", "B", "C"), test2 = c("fish", "dog",
"fish", "fish", "dog", "cat", "cat", "fish", "fish", "cat", "dog",
"fish", "dog", "cat", "dog")), .Names = c("test", "test2"),
class = "data.frame", row.names = c(NA, -15L))
例如,我有一个包含两个因子变量和 1000 行的数据框。我想通过返回每 5 行给出最常出现级别的向量,将观察次数减少到 200。
df <- data.frame(test=factor(sample(c("A","B", "C" ),1000,replace=TRUE)))
df$test2 <- factor(sample(c("dog", "cat", "fish"), 1000, replace=TRUE))
head(df, 15)
test test2
1 C fish
2 B dog
3 A fish
4 B fish
5 B dog
6 A cat
7 B cat
8 C fish
9 C fish
10 C cat
11 B dog
12 A fish
13 B dog
14 B cat
15 C dog
我希望输出给出如下两列:
testANS test2ANS
B fish
C cat
B dog
我发现了一些示例,其中最常见的分类是在一行中跨列而不是向下列和按行数找到的。在此先感谢您的任何建议。将不胜感激
我们可以试试data.table
。将 'data.frame' 转换为 'data.table' (setDT(df)
),按 'test'、'test2' 和通过复制 200 乘以 5 的序列创建的变量分组 ('grp'),按 'grp' 分组,我们得到 Data.table (.SD
) 的子集,其中 'N' 最大 (which.max(N)
)。如果需要,我们可以将 'grp' 和 'N' 列分配给 'NULL'.
library(data.table)
res <- setDT(df)[, .N, by = .(test, test2, grp = rep(1:200, each = 5))
][, .SD[which.max(N)], by = grp][, c("grp", "N") := NULL][]
dim(res)
#[1] 200 2
由于 OP 没有使用 set.seed
创建 sample
,输出将不一样。通过使用 OP post
setnames(setDT(df1)[, .N, by = .(test, test2, grp= rep(1:3, each = 5))
][, .SD[which.max(N)] , grp][, c("grp", "N") := NULL][], paste0(names(df1), "ANS"))[]
# testANS test2ANS
#1: B dog
#2: C fish
#3: B dog
更新
根据评论,好像列频应该分开做
setDT(df1)[, grp:= rep(1:3, each = 5)][,
testN := .N ,by = .(grp, test)][, test2N := .N, by = .(grp, test2)
][, .(testANS = test[which.max(testN)], test2ANS = test2[which.max(test2N)]), grp]
# grp testANS test2ANS
#1: 1 B fish
#2: 2 C cat
#3: 3 B dog
注意:在原始数据集中,将 rep(1:3, each = 5)
更改为 rep(1:200, each = 5)
数据
df1 <- structure(list(test = c("C", "B", "A", "B", "B", "A", "B", "C",
"C", "C", "B", "A", "B", "B", "C"), test2 = c("fish", "dog",
"fish", "fish", "dog", "cat", "cat", "fish", "fish", "cat", "dog",
"fish", "dog", "cat", "dog")), .Names = c("test", "test2"),
class = "data.frame", row.names = c(NA, -15L))