(R) 关于 DocumentTermMatrix 中的停用词

(R) About stopwords in DocumentTermMatrix

我对 DocumentTermMatrix() 及其停用词有一些疑问。 我输入如下,但无法得到我想要的结果。

text <- "text is my text but also his text."
mycorpus <- VCorpus(VectorSource(text))
mydtm <- DocumentTermMatrix(mycorpus, control=list(stopwords=F))
lapply(mycorpus, function(x){str_extract_all(x, boundary("word"))}) %>% unlist() %>% table()
also  but  his   is   my text 
   1    1    1    1    1    3 
apply(mydtm, 2, sum)
 also   but   his  text text. 
    1     1     1     2     1 

首先,即使我使用了 stopwords=F,dtm 仍然删除了一些停用词,例如 "is." 但是,它没有删除 "his",尽管它在两个 stopwords("en")stopwords("SMART")。 所以我真的不明白 DTM 使用什么停用词以及为什么 stopwords=F 不起作用。我应该怎么做才能让它发挥作用?

您可以尝试替代软件包:quanteda。它允许您在分词后或创建文档特征矩阵后删除停用词。下面,我使用 pad = TRUE 只是为了显示已删除与停用词匹配的标记的位置。

text <- "text is my text but also his text."

tokens(text) %>%
  tokens_remove(stopwords("en"), pad = TRUE)
## tokens from 1 document.
## text1 :
## [1] "text" ""     ""     "text" ""     "also" ""     "text" "."


## Document-feature matrix of: 1 document, 7 features (0.0% sparse).
## 1 x 7 sparse Matrix of class "dfm"
##        features
## docs    text is my but also his .
##   text1    3  1  1   1    1   1 1

dfm(text, remove_punct = TRUE) %>%
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## 1 x 2 sparse Matrix of class "dfm"
##        features
## docs    text also
##   text1    3    1

英语停用词列表只是stopwords()函数返回的字符向量(实际上来自stopwords包)。默认英文列表与 tm::stopwords("en") 相同,除了 tm 包包括 "will"。 (如果你想要 SMART 列表,它是 stopwords("en", source = "smart")。)

