(R) 关于 DocumentTermMatrix 中的停用词
(R) About stopwords in DocumentTermMatrix
我对 DocumentTermMatrix()
及其停用词有一些疑问。
我输入如下,但无法得到我想要的结果。
text <- "text is my text but also his text."
mycorpus <- VCorpus(VectorSource(text))
mydtm <- DocumentTermMatrix(mycorpus, control=list(stopwords=F))
lapply(mycorpus, function(x){str_extract_all(x, boundary("word"))}) %>% unlist() %>% table()
.
also but his is my text
1 1 1 1 1 3
apply(mydtm, 2, sum)
also but his text text.
1 1 1 2 1
首先,即使我使用了 stopwords=F
,dtm 仍然删除了一些停用词,例如 "is." 但是,它没有删除 "his",尽管它在两个 stopwords("en")
和 stopwords("SMART")
。
所以我真的不明白 DTM 使用什么停用词以及为什么 stopwords=F
不起作用。我应该怎么做才能让它发挥作用?
您可以尝试替代软件包:quanteda。它允许您在分词后或创建文档特征矩阵后删除停用词。下面,我使用 pad = TRUE
只是为了显示已删除与停用词匹配的标记的位置。
library("quanteda")
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
text <- "text is my text but also his text."
tokens(text) %>%
tokens_remove(stopwords("en"), pad = TRUE)
## tokens from 1 document.
## text1 :
## [1] "text" "" "" "text" "" "also" "" "text" "."
或者:
dfm(text)
## Document-feature matrix of: 1 document, 7 features (0.0% sparse).
## 1 x 7 sparse Matrix of class "dfm"
## features
## docs text is my but also his .
## text1 3 1 1 1 1 1 1
dfm(text, remove_punct = TRUE) %>%
dfm_remove(stopwords("en"))
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## 1 x 2 sparse Matrix of class "dfm"
## features
## docs text also
## text1 3 1
英语停用词列表只是stopwords()
函数返回的字符向量(实际上来自stopwords包)。默认英文列表与 tm::stopwords("en")
相同,除了 tm 包包括 "will"。 (如果你想要 SMART 列表,它是 stopwords("en", source = "smart")
。)
stopwords("en")
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very" "will"
我对 DocumentTermMatrix()
及其停用词有一些疑问。
我输入如下,但无法得到我想要的结果。
text <- "text is my text but also his text."
mycorpus <- VCorpus(VectorSource(text))
mydtm <- DocumentTermMatrix(mycorpus, control=list(stopwords=F))
lapply(mycorpus, function(x){str_extract_all(x, boundary("word"))}) %>% unlist() %>% table()
.
also but his is my text
1 1 1 1 1 3
apply(mydtm, 2, sum)
also but his text text.
1 1 1 2 1
首先,即使我使用了 stopwords=F
,dtm 仍然删除了一些停用词,例如 "is." 但是,它没有删除 "his",尽管它在两个 stopwords("en")
和 stopwords("SMART")
。
所以我真的不明白 DTM 使用什么停用词以及为什么 stopwords=F
不起作用。我应该怎么做才能让它发挥作用?
您可以尝试替代软件包:quanteda。它允许您在分词后或创建文档特征矩阵后删除停用词。下面,我使用 pad = TRUE
只是为了显示已删除与停用词匹配的标记的位置。
library("quanteda")
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
text <- "text is my text but also his text."
tokens(text) %>%
tokens_remove(stopwords("en"), pad = TRUE)
## tokens from 1 document.
## text1 :
## [1] "text" "" "" "text" "" "also" "" "text" "."
或者:
dfm(text)
## Document-feature matrix of: 1 document, 7 features (0.0% sparse).
## 1 x 7 sparse Matrix of class "dfm"
## features
## docs text is my but also his .
## text1 3 1 1 1 1 1 1
dfm(text, remove_punct = TRUE) %>%
dfm_remove(stopwords("en"))
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## 1 x 2 sparse Matrix of class "dfm"
## features
## docs text also
## text1 3 1
英语停用词列表只是stopwords()
函数返回的字符向量(实际上来自stopwords包)。默认英文列表与 tm::stopwords("en")
相同,除了 tm 包包括 "will"。 (如果你想要 SMART 列表,它是 stopwords("en", source = "smart")
。)
stopwords("en")
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very" "will"