具体文件数量及以上
Number of specific documents and greater
来自量化
我用这个选项做一个dfm
library(quanteda)
df <- data.frame(text = c("only a small text","only a small text","only a small text","only a small text","only a small text","only a small text","remove this word lower frequency"))
tdfm <- df$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
dfm()
如何设置同时保留6个以上文档中出现频率最高的词?每一行都是一个文档
dfm_keep
与 docfreq
结合使用可提供您正在寻找的内容。我选择 select 超过 5,因此它适用于您的示例。否则 dfm 将是空的。
dfm_keep(tdfm, pattern = featnames(tdfm)[docfreq(tdfm) > 5])
Document-feature matrix of: 7 documents, 4 features (14.3% sparse).
features
docs only a small text
text1 1 1 1 1
text2 1 1 1 1
text3 1 1 1 1
text4 1 1 1 1
text5 1 1 1 1
text6 1 1 1 1
text7 0 0 0 0
最简单的方法是将 dfm_trim()
参数与 min_docfreq = 6
一起使用。在 运行 你上面的代码之后使用 v2.1.0:
> dfm_trim(tdfm, min_docfreq = 6) %>%
print(max_ndoc = -1)
Document-feature matrix of: 7 documents, 4 features (14.3% sparse).
features
docs only a small text
text1 1 1 1 1
text2 1 1 1 1
text3 1 1 1 1
text4 1 1 1 1
text5 1 1 1 1
text6 1 1 1 1
text7 0 0 0 0
来自量化
我用这个选项做一个dfm
library(quanteda)
df <- data.frame(text = c("only a small text","only a small text","only a small text","only a small text","only a small text","only a small text","remove this word lower frequency"))
tdfm <- df$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
dfm()
如何设置同时保留6个以上文档中出现频率最高的词?每一行都是一个文档
dfm_keep
与 docfreq
结合使用可提供您正在寻找的内容。我选择 select 超过 5,因此它适用于您的示例。否则 dfm 将是空的。
dfm_keep(tdfm, pattern = featnames(tdfm)[docfreq(tdfm) > 5])
Document-feature matrix of: 7 documents, 4 features (14.3% sparse).
features
docs only a small text
text1 1 1 1 1
text2 1 1 1 1
text3 1 1 1 1
text4 1 1 1 1
text5 1 1 1 1
text6 1 1 1 1
text7 0 0 0 0
最简单的方法是将 dfm_trim()
参数与 min_docfreq = 6
一起使用。在 运行 你上面的代码之后使用 v2.1.0:
> dfm_trim(tdfm, min_docfreq = 6) %>%
print(max_ndoc = -1)
Document-feature matrix of: 7 documents, 4 features (14.3% sparse).
features
docs only a small text
text1 1 1 1 1
text2 1 1 1 1
text3 1 1 1 1
text4 1 1 1 1
text5 1 1 1 1
text6 1 1 1 1
text7 0 0 0 0