从列联表到 R 中的 data.frame

Question

我的出发点是有几个包含我从文本中提取的 POS 标签的字符向量。例如：

c("NNS", "VBP", "JJ",  "CC",  "DT")
c("NNS", "PRP", "JJ",  "RB",  "VB")

我使用 table() 或 ftable() 来计算每个标签的出现次数。

 CC  DT  JJ NNS VBP 
 1   1   1   1   1

最终目标是 data.frame 看起来像这样：

   NNS VBP PRP JJ CC RB DT VB
1  1   1   0   1  1  0  1  0
2  1   0   1   1  0  1  0  1

在这里使用 plyr::rbind.fill 对我来说似乎是合理的，但它需要 data.frame 个对象作为输入。但是，当使用 as.data.frame.matrix(table(POS_vector)) 时会发生错误。

Error in seq_len(ncols) : 
argument must be coercible to non-negative integer

使用 as.data.frame.matrix(ftable(POS_vector)) 实际上会生成 data.frame，但没有 colnames。

V1 V2 V3 V4 V5 ...
1  1  1  1  1

非常感谢任何帮助。

Answer 1

这可能是一种解决方法，但这可能是一个解决方案。

我们假设所有向量都在一个列表中：

dat <- list(c("NNS", "VBP", "JJ",  "CC",  "DT"),
c("NNS", "PRP", "JJ",  "RB",  "VB"))

然后我们将 table 转换为转置矩阵，我们将其转换为 data.table:

library(data.table)
temp <- lapply(dat,function(x){
  data.table(t(as.matrix(table(x))))
})

然后我们使用 rbindlist 创建所需的输出：

rbindlist(temp,fill=T)

我们也可以选择先将所有数据放在一个data.table中，然后再进行聚合。请注意，这假定向量长度相等。

temp <- as.data.table(dat)
#turn to long format
temp_m <- melt(temp, measure.vars=colnames(temp))

#count values for each variable/value-combination, then reshape to wide
res <- dcast(temp_m[,.N,by=.(variable,value)], variable~value,value.var="N", fill=0)

Answer 2

在基础R中，你可以试试：

table(rev(stack(setNames(dat, seq_along(dat)))))

您还可以使用 "qdapTools" 中的 mtabulate:

library(qdapTools)
mtabulate(dat)
#   CC DT JJ NNS PRP RB VB VBP
# 1  1  1  1   1   0  0  0   1
# 2  0  0  1   1   1  1  1   0

dat 与@Heroka 的回答中的定义相同：

dat <- list(c("NNS", "VBP", "JJ",  "CC",  "DT"),
            c("NNS", "PRP", "JJ",  "RB",  "VB"))

从列联表到 R 中的 data.frame

From contingency tables to data.frame in R

r

contingency

dataframe