删除频率为零的文档
Remove documents with zero frequency
在这个过程之后
library(quanteda)
df <- data.frame(text = c("only a small text","only a small text","only a small text","only a small text","only a small text","only a small text","remove this word lower frequency"))
tdfm <- df$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
dfm()
dfm_keep(tdfm, pattern = featnames(tdfm)[docfreq(tdfm) > 5])
如何从 dfm 中删除总字数为零的文档?
选择后,可以使用dfm_subset
去除空行:
dfm_keep(tdfm, pattern = featnames(tdfm)[docfreq(tdfm) > 5]) %>%
dfm_subset(ntoken(.) > 0)
Document-feature matrix of: 6 documents, 4 features (0.0% sparse).
features
docs only a small text
text1 1 1 1 1
text2 1 1 1 1
text3 1 1 1 1
text4 1 1 1 1
text5 1 1 1 1
text6 1 1 1 1
在这个过程之后
library(quanteda)
df <- data.frame(text = c("only a small text","only a small text","only a small text","only a small text","only a small text","only a small text","remove this word lower frequency"))
tdfm <- df$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
dfm()
dfm_keep(tdfm, pattern = featnames(tdfm)[docfreq(tdfm) > 5])
如何从 dfm 中删除总字数为零的文档?
选择后,可以使用dfm_subset
去除空行:
dfm_keep(tdfm, pattern = featnames(tdfm)[docfreq(tdfm) > 5]) %>%
dfm_subset(ntoken(.) > 0)
Document-feature matrix of: 6 documents, 4 features (0.0% sparse).
features
docs only a small text
text1 1 1 1 1
text2 1 1 1 1
text3 1 1 1 1
text4 1 1 1 1
text5 1 1 1 1
text6 1 1 1 1