问题:根据具有data.table的其他元素的数量随机选择行?

Question: Randomly pick rows based on the number of other elements with data.table?

我有一个数据集如下 data.table。

想要比较值和 select 基于最少元素数的行。

为了更好的图片,请找玩具样品。

set.seed(100)
DT <- data.table(CATA = sample(LETTERS[1:4], 4489, replace = T),
                 ITEM = sample(LETTERS[24:26], 4489, prob = c(0.4, 0.3, 0.3),replace = T),
                 VAL  = sample(100:999, 4489, replace = T)
      )

# 4489 is a random-picked number

如果根据 CATA

计算项目
DT[,.N, by = .(CATA, ITEM)][order(CATA)]
#     CATA ITEM   N
#  1:    A    X 433
#  2:    A    Y 323
#  3:    A    Z 342
#  4:    B    X 452
#  5:    B    Y 333
#  6:    B    Z 358
#  7:    C    X 461
#  8:    C    Y 302
#  9:    C    Z 359
# 10:    D    X 461
# 11:    D    Y 344
# 12:    D    Z 321

我可以找到每个类别的最小值。

DTmin <- DT[,.N, by = .(CATA, ITEM)][order(CATA)][,.(MIN = min(N)), by = CATA]

>DTmin
#    CATA MIN
# 1:    A 323
# 2:    B 333
# 3:    C 302
# 4:    D 321

我需要的是传递 DTmin 值以使每个类别下的所有项目都具有相同的编号,例如,

DT[,.N, by = .(CATA, ITEM)][order(CATA)]
#     CATA ITEM   N
#  1:    A    X 323 # was 433
#  2:    A    Y 323
#  3:    A    Z 323 # was 342
#  4:    B    X 333 # was 452
#  5:    B    Y 333
#  6:    B    Z 333 # was 358
#  7:    C    X 302 # was 461
#  8:    C    Y 302
#  9:    C    Z 302 # was 359
# 10:    D    X 321 # was 461
# 11:    D    Y 321 # was 344
# 12:    D    Z 321

最终DT的行号为3837 ( sum(DTmin$MIN)*3 )

这是我的方法,请在不破坏链条或引入新列的情况下提供更流畅的方法。

# create a new index column, and remove it.
DTmin[DT, on = .(CATA)][,tmpV:= sample(.N), by = .(CATA, ITEM)][tmpV<MIN][,tmpV:=NULL]

#check
DTmin[DT, on = .(CATA)][,tmpV:= sample(.N), by = .(CATA, ITEM)][tmpV<MIN][,tmpV:=NULL][,.N, by = .(CATA, ITEM)][order(CATA, ITEM)]

#     CATA ITEM   N
#  1:    A    X 322
#  2:    A    Y 322
#  3:    A    Z 322
#  4:    B    X 332
#  5:    B    Y 332
#  6:    B    Z 332
#  7:    C    X 301
#  8:    C    Y 301
#  9:    C    Z 301
# 10:    D    X 320
# 11:    D    Y 320
# 12:    D    Z 320

感谢@ronak-shan

DT[DTmin, on = 'CATA'][, .SD[sample(.N, first(MIN))], .(CATA, ITEM)]

我的解释是,DT[DTmin, on = 'CATA']作为一个DT,被发送到下一个块,selectMIN.N,限制在[=定义的框中=17=] & ITEM.

我们可以加入 DTDTmin 以便我们在数据框中获得 MIN 值,对于每个 CATAITEM 我们可以 select MIN 行。

library(data.table)
DT[DTmin, on = 'CATA'][, .SD[sample(.N, first(MIN))], .(CATA, ITEM)]

dplyr类似:

library(dplyr)
DT %>%
  left_join(DTmin, by = 'CATA') %>%
  group_by(CATA, ITEM) %>%
  sample_n(first(MIN))

所有MIN值在整个组中都是相同的,我们可以使用任何一个,我使用first一个。