文档 R 中的统计属性

Question

我有一个table，比如：

test_data <- data.frame(
  doc = c(1,1,2,2,2,3,3),
  word = c("person", "grand", "person", "moment", "bout", "person", "moment"),
   frenq= c(9,8,5,4,3,5,3))

我想计算每个 "word" 的均值和标准差并创建一个新的 table 例如。

    word   freq (number of docs)  mean    std 
 personn     19                3  6.33  2.309
  moment      7                2  2.33  2.081

而主要问题是sdt，例如，单词"person"是sd(c(9,5,5))，但是单词"moment"是sd(c( 0,4,3))。零是第一个数字，因为这个词不在文档 1 中。

Answer 1

一种简单的方法是首先获取数据中的唯一单词列表 (d)：

uw <- unique(d$word)

然后你可以遍历 uw 获取匹配单词 (w) 的所有数据：

for (w in uw){
    numdoc <- max(d$doc[d$word==w])
    freqs <- d$freq[d$word==w]
    m <- mean(freqs)
    ## etc ...
}

我确信使用 apply 有更优雅的方法，但以上内容应该让您对如何继续进行有更好的了解。

Answer 2

你可以试试dplyr。通过 "doc" 和 "test_data" (expand.grid(..)) 的 "word" 列的唯一组合创建新数据集 ("d1")。将 "d1" 加入 "test_data" (left_join)，将 "frenq" 中的 NA 值替换为“0” (replace(frenq,..))，得到汇总统计信息按 "word".

分组后使用 mutate_each

library(dplyr)
d1 <- expand.grid(doc=unique(test_data$doc), word=unique(test_data$word))
res <- left_join(d1, test_data) %>%
                   mutate(frenq=replace(frenq, is.na(frenq), 0)) %>%
                   group_by(word) %>% 
                   summarise_each(funs(freq=sum,NumberOfdocs= sum(.!=0),
                         mean, std=sd), frenq)
  res
  #    word freq Numberofdocs     mean      std
  #1   bout    3            1 1.000000 1.732051
  #2  grand    8            1 2.666667 4.618802
  #3 moment    7            2 2.333333 2.081666
  #4 person   19            3 6.333333 2.309401

或使用 data.table 中的类似方法。将"data.frame"转换为"data.table"（setDT），将"doc"、"word"设置为键列（setkey），交叉连接[=的唯一元素30=] 和 "word" (CJ(doc=...,))，为 "frenq" (is.na(frenq), frenq:=0) 中的 NA 个元素分配 '0'，并得到汇总统计信息 (list(freq=..)) 按 "word" 分组。

  library(data.table)
  setkey(setDT(test_data), doc, word)[CJ(doc=unique(doc), 
        word=unique(word))][is.na(frenq), frenq:=0][,
           list(freq=sum(frenq), Numberofdocs=sum(frenq!=0), 
                  mean=mean(frenq), std=sd(frenq)) , by = word]
   #    word freq Numberofdocs     mean      std
   #1:   bout    3            1 1.000000 1.732051
   #2:  grand    8            1 2.666667 4.618802
   #3: moment    7            2 2.333333 2.081666
   #4: person   19            3 6.333333 2.309401

文档 R 中的统计属性

Stats properties among documents R

statistics

r

std

text-mining