具体文件数量及以上

Question

来自量化

我用这个选项做一个dfm

library(quanteda)

df <- data.frame(text = c("only a small text","only a small text","only a small text","only a small text","only a small text","only a small text","remove this word lower frequency"))
tdfm <- df$text %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
  dfm()

如何设置同时保留6个以上文档中出现频率最高的词？每一行都是一个文档

Answer 1

dfm_keep 与 docfreq 结合使用可提供您正在寻找的内容。我选择 select 超过 5，因此它适用于您的示例。否则 dfm 将是空的。

dfm_keep(tdfm, pattern = featnames(tdfm)[docfreq(tdfm) > 5])
Document-feature matrix of: 7 documents, 4 features (14.3% sparse).
       features
docs    only a small text
  text1    1 1     1    1
  text2    1 1     1    1
  text3    1 1     1    1
  text4    1 1     1    1
  text5    1 1     1    1
  text6    1 1     1    1
  text7    0 0     0    0

Answer 2

最简单的方法是将 dfm_trim() 参数与 min_docfreq = 6 一起使用。在运行你上面的代码之后使用 v2.1.0：

> dfm_trim(tdfm, min_docfreq = 6) %>%
      print(max_ndoc = -1)
Document-feature matrix of: 7 documents, 4 features (14.3% sparse).
       features
docs    only a small text
  text1    1 1     1    1
  text2    1 1     1    1
  text3    1 1     1    1
  text4    1 1     1    1
  text5    1 1     1    1
  text6    1 1     1    1
  text7    0 0     0    0

具体文件数量及以上

Number of specific documents and greater

r

quanteda