从 R 中的语料库中删除短语(停止短语)?
removing phrases (stopphrases) from corpus in R?
我可以使用 tm 包轻松删除停用词,但是有没有一种简单的方法可以删除特定的短语?我希望能够删除短语,"good morning",但不删除早上没有紧随其后的情况。
示例:
x <- "Good morning. Great question...I'd say we had a good time."
doc.vec <- VectorSource(x)
doc.corpus <- Corpus(doc.vec)
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
doc.corpus <- tm_map(doc.corpus, removePunctuation)
doc.corpus <- tm_map(doc.corpus, content_transformer(tolower))
doc.corpus <- tm_map(doc.corpus, removeWords, c(stopwords("english"), "good"))
dtm <- DocumentTermMatrix(doc.corpus, control=list())
inspect(dtm)
不是很懂,可能只是简单的问题gsub
gsub("[Gg]ood.morning", "", x)
[1] ". Great question...I'd say we had a good time."
只需将 "good morning" 添加到要删除的单词列表中。
doc.corpus <- tm_map(doc.corpus, removeWords, c(stopwords("english"), "good morning"))
如果你检查 dtm,你会发现你只剩下 1 "good" 而没有 "morning"
我可以使用 tm 包轻松删除停用词,但是有没有一种简单的方法可以删除特定的短语?我希望能够删除短语,"good morning",但不删除早上没有紧随其后的情况。
示例:
x <- "Good morning. Great question...I'd say we had a good time."
doc.vec <- VectorSource(x)
doc.corpus <- Corpus(doc.vec)
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
doc.corpus <- tm_map(doc.corpus, removePunctuation)
doc.corpus <- tm_map(doc.corpus, content_transformer(tolower))
doc.corpus <- tm_map(doc.corpus, removeWords, c(stopwords("english"), "good"))
dtm <- DocumentTermMatrix(doc.corpus, control=list())
inspect(dtm)
不是很懂,可能只是简单的问题gsub
gsub("[Gg]ood.morning", "", x)
[1] ". Great question...I'd say we had a good time."
只需将 "good morning" 添加到要删除的单词列表中。
doc.corpus <- tm_map(doc.corpus, removeWords, c(stopwords("english"), "good morning"))
如果你检查 dtm,你会发现你只剩下 1 "good" 而没有 "morning"