删除频率为零的文档

Question

在这个过程之后

library(quanteda)

df <- data.frame(text = c("only a small text","only a small text","only a small text","only a small text","only a small text","only a small text","remove this word lower frequency"))
tdfm <- df$text %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
  dfm()
dfm_keep(tdfm, pattern = featnames(tdfm)[docfreq(tdfm) > 5])

如何从 dfm 中删除总字数为零的文档？

Answer 1

选择后，可以使用dfm_subset去除空行：

dfm_keep(tdfm, pattern = featnames(tdfm)[docfreq(tdfm) > 5]) %>% 
  dfm_subset(ntoken(.) > 0)

Document-feature matrix of: 6 documents, 4 features (0.0% sparse).
       features
docs    only a small text
  text1    1 1     1    1
  text2    1 1     1    1
  text3    1 1     1    1
  text4    1 1     1    1
  text5    1 1     1    1
  text6    1 1     1    1

删除频率为零的文档

Remove documents with zero frequency

r

quanteda