从现有数据框或数据创建多个虚拟对象 table

Question

我正在寻找 here 发布的以下解决方案的快速扩展。在其中，Frank 展示了一个示例数据 table

test <- data.table("index"=rep(letters[1:10],100),"var1"=rnorm(1000,0,1))

您可以使用以下代码快速制作假人

inds <- unique(test$index) ; test[,(inds):=lapply(inds,function(x)index==x)]

现在我想将此解决方案扩展为具有多行索引的 data.table，例如

new <- data.table("id" = rep(c("Jan","James","Dirk","Harry","Cindy","Leslie","John","Frank"),125), "index1"=rep(letters[1:5],200),"index2" = rep(letters[6:15],100),"index3" = rep(letters[16:19],250))

我需要为很多傻瓜做这件事，理想情况下，这个解决方案能让我得到 4 个东西：

每个索引的总数
每个索引出现的平均次数
每个id每个索引的计数
每个 id 的每个索引的平均值

在我的真实情况下，索引的命名方式不同，因此解决方案需要能够遍历我认为的列名。

谢谢

西蒙

Answer 1

如果您只需要该列表中的四个项目，您应该制表：

indcols <- paste0('index',1:3)
lapply(new[,indcols,with=FALSE],table) # counts
lapply(new[,indcols,with=FALSE],function(x)prop.table(table(x))) # means

# or...

lapply(
  new[,indcols,with=FALSE],
  function(x){
    z<-table(x)
    rbind(count=z,mean=prop.table(z))
  })

这给

$index1
          a     b     c     d     e
count 200.0 200.0 200.0 200.0 200.0
mean    0.2   0.2   0.2   0.2   0.2

$index2
          f     g     h     i     j     k     l     m     n     o
count 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
mean    0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1

$index3
           p      q      r      s
count 250.00 250.00 250.00 250.00
mean    0.25   0.25   0.25   0.25

以前的方法适用于 data.frame 或 data.table，但相当复杂。对于 data.table，可以使用 melt 语法：

melt(new, id="id")[,.(
  N=.N, 
  mean=.N/nrow(new)
), by=.(variable,value)]

这给出了

    variable value   N mean
 1:   index1     a 200 0.20
 2:   index1     b 200 0.20
 3:   index1     c 200 0.20
 4:   index1     d 200 0.20
 5:   index1     e 200 0.20
 6:   index2     f 100 0.10
 7:   index2     g 100 0.10
 8:   index2     h 100 0.10
 9:   index2     i 100 0.10
10:   index2     j 100 0.10
11:   index2     k 100 0.10
12:   index2     l 100 0.10
13:   index2     m 100 0.10
14:   index2     n 100 0.10
15:   index2     o 100 0.10
16:   index3     p 250 0.25
17:   index3     q 250 0.25
18:   index3     r 250 0.25
19:   index3     s 250 0.25

@Arun 在评论中提到了这种方法（我认为他也实现了......？）。要了解它是如何工作的，首先看一下 melt(new, id="id")，它转换了原始的 data.table。

如评论中所述，熔化 data.table 需要为 data.table 软件包的某些版本安装和加载 reshape2。

如果您还需要假人，可以像链接问题中那样循环制作它们：

newcols <- list()
for (i in indcols){
    vals = unique(new[[i]])
    newcols[[i]] = paste(vals,i,sep='_')
    new[,(newcols[[i]]):=lapply(vals,function(x)get(i)==x)]
}

为了方便起见，这将与每个变量关联的列组存储在 newcols 中。如果你只想用这些虚拟变量（而不是上面解决方案中的基础变量）做表格，你可以做

lapply(
  indcols,
  function(i) new[,lapply(.SD,function(x){
    z <- sum(x)
    list(z,z/.N)
  }),.SDcols=newcols[[i]] ])

这给出了类似的结果。我只是这样写的，以说明如何使用 data.table 语法。您可以在此处再次避免使用方括号和 .SD：

lapply(
  indcols,
  function(i) sapply(
    new[, newcols[[i]], with=FALSE],
    function(x){
      z<-sum(x)
      rbind(z,z/length(x))
    }))

但无论如何：如果你能抓住基础变量，就使用 table。

从现有数据框或数据创建多个虚拟对象 table

Creating multiple dummies from an existing data frame or data table

loops

r

lapply

data.table