Return R 中每 n rows/observations 的多数因子水平向量

Question

例如，我有一个包含两个因子变量和 1000 行的数据框。我想通过返回每 5 行给出最常出现级别的向量，将观察次数减少到 200。

 df <- data.frame(test=factor(sample(c("A","B", "C" ),1000,replace=TRUE)))
 df$test2 <- factor(sample(c("dog", "cat", "fish"), 1000, replace=TRUE))
 head(df, 15)

     test test2
1     C  fish
2     B   dog
3     A  fish
4     B  fish
5     B   dog
6     A   cat
7     B   cat
8     C  fish
9     C  fish
10    C   cat
11    B   dog
12    A  fish
13    B   dog
14    B   cat
15    C   dog

我希望输出给出如下两列：

testANS      test2ANS
B            fish
C            cat
B            dog

我发现了一些示例，其中最常见的分类是在一行中跨列而不是向下列和按行数找到的。在此先感谢您的任何建议。将不胜感激

Answer 1

我们可以试试data.table。将 'data.frame' 转换为 'data.table' (setDT(df))，按 'test'、'test2' 和通过复制 200 乘以 5 的序列创建的变量分组 ('grp')，按 'grp' 分组，我们得到 Data.table (.SD) 的子集，其中 'N' 最大 (which.max(N))。如果需要，我们可以将 'grp' 和 'N' 列分配给 'NULL'.

library(data.table)
res <- setDT(df)[, .N, by = .(test, test2, grp = rep(1:200, each = 5))
             ][, .SD[which.max(N)], by = grp][, c("grp", "N") := NULL][]
dim(res)
#[1] 200   2

由于 OP 没有使用 set.seed 创建 sample，输出将不一样。通过使用 OP post

中显示的前 15 行

setnames(setDT(df1)[, .N, by = .(test, test2, grp= rep(1:3, each = 5))
   ][, .SD[which.max(N)] , grp][,  c("grp", "N") := NULL][], paste0(names(df1), "ANS"))[]
#    testANS test2ANS
#1:       B      dog
#2:       C     fish
#3:       B      dog

更新

根据评论，好像列频应该分开做

setDT(df1)[,  grp:= rep(1:3, each = 5)][,
     testN := .N ,by = .(grp, test)][, test2N := .N, by = .(grp, test2)
       ][, .(testANS = test[which.max(testN)], test2ANS = test2[which.max(test2N)]), grp]
#   grp testANS  test2ANS
#1:   1       B      fish
#2:   2       C       cat
#3:   3       B       dog

注意：在原始数据集中，将 rep(1:3, each = 5) 更改为 rep(1:200, each = 5)

数据

df1 <- structure(list(test = c("C", "B", "A", "B", "B", "A", "B", "C", 
"C", "C", "B", "A", "B", "B", "C"), test2 = c("fish", "dog", 
"fish", "fish", "dog", "cat", "cat", "fish", "fish", "cat", "dog", 
"fish", "dog", "cat", "dog")), .Names = c("test", "test2"),
 class = "data.frame", row.names = c(NA, -15L))

Return R 中每 n rows/observations 的多数因子水平向量

Return vector of majority factor levels per n rows/observations in R

r

mode

rows

更新

数据