有没有办法转换 data.table 使唯一的行元素成为列名,然后显示元素计数?

Is there a way to transform a data.table so that unique row elements become column names and then show element counts?

我有以下 data.table:

structure(list(index = structure(c(1571270400, 1571356800, 1571616000, 
1571702400, 1571788800, 1571875200, 1571961600, 1572220800, 1572307200, 
1572393600), tzone = "", tclass = c("POSIXct", "POSIXt"), class = c("POSIXct", 
"POSIXt")), A = structure(c(10L, 10L, 7L, 7L, 9L, 9L, 4L, 4L, 
4L, 4L), .Label = c("12", "13", "14", "21", "24", "31", "34", 
"41", "42", "43"), class = "factor"), AA = structure(c(2L, 2L, 
2L, 2L, 2L, 7L, 7L, 7L, 7L, 7L), .Label = c("12", "13", "14", 
"21", "23", "24", "31", "32", "34", "41", "42", "43"), class = "factor"), 
    AAC = structure(c(6L, 11L, 7L, 7L, 7L, 7L, 7L, NA, NA, 7L
    ), .Label = c("12", "13", "14", "21", "23", "24", "31", "34", 
    "41", "42", "43"), class = "factor"), AAL = structure(c(2L, 
    2L, 2L, 2L, 2L, 7L, 7L, 7L, 7L, 7L), .Label = c("12", "13", 
    "14", "21", "23", "24", "31", "32", "34", "41", "42", "43"
    ), class = "factor")), class = c("data.table", "data.frame"
), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x5614347b5790>, sorted = "index")

以下是此数据在 table -

中的样子
         index  A B    C   D
 1: 2019-10-17 43 13   24  13
 2: 2019-10-18 43 13   43  13
 3: 2019-10-21 34 13   31  13
 4: 2019-10-22 34 13   31  13
 5: 2019-10-23 42 13   31  13
 6: 2019-10-24 42 31   31  31
 7: 2019-10-25 21 31   31  31
 8: 2019-10-28 21 31 <NA>  31
 9: 2019-10-29 21 31 <NA>  31
10: 2019-10-30 21 31   31  31

我想对其进行转换,使行中的唯一元素成为列名,然后这些列显示这些元素的频率。

         index  13 21  24 31 34 42 43 <NA>
 1: 2019-10-17   2  0   1  0  0  0  1  0
 2: 2019-10-18   2  0   0  0  0  0  2  0
 3: 2019-10-21   2  0   0  1  1  0  0  0
 4: 2019-10-22   2  0   0  1  1  0  0  0
 5: 2019-10-23   2  0   0  1  0  1  0  0
 6: 2019-10-24   3  0   0  0  0  1  0  0
 7: 2019-10-25   3  1   0  0  0  0  0  0
 8: 2019-10-28   2  1   0  0  0  0  0  1
 9: 2019-10-29   2  1   0  0  0  0  0  1
10: 2019-10-30   3  1   0  0  0  0  0  0

我相信应该有一个聪明的方法来使用 reshape 或 data.table 函数来做到这一点。指向正确方向的指针将非常有帮助。

这是一个使用较新的 tidyverse 函数的解决方案。不过,它也适用于 data.tables。

  1. 首先我们从宽格式转换为长格式
    • cols 参数接受 tidyselect 帮助程序按名称选择列。 matches() 根据正则表达式选择列。您可以在此处的手册中阅读更多关于它们的信息:?tidyselect::select_helpers
  2. 然后我们展开回宽形式
  3. 我们使用 values_fn 对值应用 length 函数。这将给出唯一计数的数量
  4. 然后我们可以选择将所有数字列中的 NA 替换为 0

示例如下

library(tidyverse)
df %>%
  pivot_longer(cols = matches('^A'))) %>%              #convert to long form
  pivot_wider(id_cols = 'index', names_from = 'value', # Then spread wide again
              values_fn = list(value = length)) %>%    # return length of vals
  mutate_if(is.numeric, ~ ifelse(is.na(.), 0, .))      # replace NA with 0

# A tibble: 10 x 9
   index                `43`  `13`  `24`  `34`  `31`  `42`  `21`  `NA`
   <dttm>              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 2019-10-16 17:00:00     1     2     1     0     0     0     0     0
 2 2019-10-17 17:00:00     2     2     0     0     0     0     0     0
 3 2019-10-20 17:00:00     0     2     0     1     1     0     0     0
 4 2019-10-21 17:00:00     0     2     0     1     1     0     0     0
 5 2019-10-22 17:00:00     0     2     0     0     1     1     0     0
 6 2019-10-23 17:00:00     0     0     0     0     3     1     0     0
 7 2019-10-24 17:00:00     0     0     0     0     3     0     1     0
 8 2019-10-27 17:00:00     0     0     0     0     2     0     1     1
 9 2019-10-28 17:00:00     0     0     0     0     2     0     1     1
10 2019-10-29 17:00:00     0     0     0     0     3     0     1     0

我们可以通过指定 id.var 将数据集 melt 转换为 'long' 格式,然后在指定fun.aggregate 作为 length

library(data.table)
dcast(melt(dt, id.var = 'index'), as.IDate(index) ~ value, length)
#          index NA 13 21 24 31 34 42 43
# 1: 2019-10-16  0  2  0  1  0  0  0  1
# 2: 2019-10-17  0  2  0  0  0  0  0  2
# 3: 2019-10-20  0  2  0  0  1  1  0  0
# 4: 2019-10-21  0  2  0  0  1  1  0  0
# 5: 2019-10-22  0  2  0  0  1  0  1  0
# 6: 2019-10-23  0  0  0  0  3  0  1  0
# 7: 2019-10-24  0  0  1  0  3  0  0  0
# 8: 2019-10-27  1  0  1  0  2  0  0  0
# 9: 2019-10-28  1  0  1  0  2  0  0  0
#10: 2019-10-29  0  0  1  0  3  0  0  0

注意:如果我们不想要 NA 列,请在 melt

中指定 na.rm = TRUE