防止 tm 从双词中删除停用词
Prevent tm from removing stopwords from double words
我正在尝试从字符向量中删除停用词。但我面临的问题是有一个词 "king kond"。由于 'king' 是停用词之一,因此 "king kong" 中的 "king" 被删除。
有没有办法避免双字被删除?
我的代码是:
text <- VCorpus(VectorSource(newmnt1$form))
#(newmnt1$form is chr [1:4] "king kong lives" "foot" "island" "skull")
#Normal standardization of text.
text <- tm_map(text, content_transformer(tolower))
text <- tm_map(text, removeWords, custom_stopwords)
text <- tm_map(text, stripWhitespace)
newmnt2 <- text[[1]]$content
一个快速的技巧是将您的 "king kong" 模式转换为 "king_kong"。
a <- gsub("king kong", "king_kong", "This is a pattern with king and king kong")
a
[1] "This is a pattern with king and king_kong"
tm::removeWords(a, "king")
[1] "This is a pattern with and king_kong"
最佳,
科林
如果您愿意使用其他软件包,这行得通:
> text <- c("king kong lives", "foot", "island", "skull", "This is a pattern with king and king kong")
> corpus::term_matrix(text, drop = "king", combine = "king kong", transpose = TRUE)
11 x 5 sparse Matrix of class "dgCMatrix"
a . . . . 1
and . . . . 1
foot . 1 . . .
is . . . . 1
island . . 1 . .
king kong 1 . . . 1
lives 1 . . . .
pattern . . . . 1
skull . . . 1 .
this . . . . 1
with . . . . 1
combine
参数指示 corpus 将 king kong
解释为单个标记。
我正在尝试从字符向量中删除停用词。但我面临的问题是有一个词 "king kond"。由于 'king' 是停用词之一,因此 "king kong" 中的 "king" 被删除。
有没有办法避免双字被删除? 我的代码是:
text <- VCorpus(VectorSource(newmnt1$form))
#(newmnt1$form is chr [1:4] "king kong lives" "foot" "island" "skull")
#Normal standardization of text.
text <- tm_map(text, content_transformer(tolower))
text <- tm_map(text, removeWords, custom_stopwords)
text <- tm_map(text, stripWhitespace)
newmnt2 <- text[[1]]$content
一个快速的技巧是将您的 "king kong" 模式转换为 "king_kong"。
a <- gsub("king kong", "king_kong", "This is a pattern with king and king kong")
a
[1] "This is a pattern with king and king_kong"
tm::removeWords(a, "king")
[1] "This is a pattern with and king_kong"
最佳,
科林
如果您愿意使用其他软件包,这行得通:
> text <- c("king kong lives", "foot", "island", "skull", "This is a pattern with king and king kong")
> corpus::term_matrix(text, drop = "king", combine = "king kong", transpose = TRUE)
11 x 5 sparse Matrix of class "dgCMatrix"
a . . . . 1
and . . . . 1
foot . 1 . . .
is . . . . 1
island . . 1 . .
king kong 1 . . . 1
lives 1 . . . .
pattern . . . . 1
skull . . . 1 .
this . . . . 1
with . . . . 1
combine
参数指示 corpus 将 king kong
解释为单个标记。