从 DocumentTermMatrix 中删除停止短语
Removing Stop Phrases from DocumentTermMatrix
下面,我对 "crude" 数据进行了基本的主题建模。我知道我可以使用 tm_map 删除停用词,但我不知道如何在 双字母标记化发生后这样做。
library(topicmodels)
library(tm)
library(RWeka)
library(ggplot2)
library(dplyr)
library(tidytext)
data("crude")
words <- tm_map(crude, content_transformer(tolower))
words <- tm_map(words, removePunctuation)
words <- tm_map(words, stripWhitespace)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
#bigram tokenization
dtm <- DocumentTermMatrix(words,control = list(tokenize = BigramTokenizer))
ui = unique(dtm$i)
dtm = dtm[ui,] #remove "empty" tweets
lda <- LDA(dtm, k = 2,control = list(seed = 7272))
topics <- tidy(lda, matrix = "beta")
##Graphs
top_terms <- topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
#single
stopwords1<- stopwords("english") ##I actually use a custom list: read.csv("stopwords.txt", header = FALSE)
adnlstopwords1<-c("ny","new","york","yorks","state","nyc","nys")
#doubles
stopwords2<-levels(interaction(stopwords1,stopwords1,sep=' '))
adnlstopwords2<-c(stopwords2,c("new york", "york state", "in ny", "in new",
"new yorks"))
stopwords<-c(stopwords,adnlstopwords1,stopwords2,adnlstopwords2)
我的问题是如何从 dtm 中删除这些二元语法而不使用 tm_map 或者可能有什么解决方法。请注意,基于 "new york" 的二元组可能不会出现在原始数据中,但对我的其他数据很重要。
我从 R 中的 "gofastR" 包中找到了这个解决方案:
dtm2 <- remove_stopwords(dtm, stopwords = stopwords)
但是,我仍然在结果中看到了停止短语。查看文档后,remove_stopwords 假设它有一个排序列表——您可以使用同一包中的 prep_stopwords() 函数准备您的 stopwords/phrases。
stopwords<-prep_stopwords(stopwords)
dtm2 <- remove_stopwords(dtm, stopwords = stopwords)
为了做到这一点和干。我们可以在代码的tm_map部分进行词干提取,去除stepwords如下:
stopwords<-prep_stopwords(stemDocument(stopwords))
dtm2 <- remove_stopwords(dtm, stopwords = stopwords)
因为这将阻止停用词,然后匹配 dtm 中已经被阻止的词。
下面,我对 "crude" 数据进行了基本的主题建模。我知道我可以使用 tm_map 删除停用词,但我不知道如何在 双字母标记化发生后这样做。
library(topicmodels)
library(tm)
library(RWeka)
library(ggplot2)
library(dplyr)
library(tidytext)
data("crude")
words <- tm_map(crude, content_transformer(tolower))
words <- tm_map(words, removePunctuation)
words <- tm_map(words, stripWhitespace)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
#bigram tokenization
dtm <- DocumentTermMatrix(words,control = list(tokenize = BigramTokenizer))
ui = unique(dtm$i)
dtm = dtm[ui,] #remove "empty" tweets
lda <- LDA(dtm, k = 2,control = list(seed = 7272))
topics <- tidy(lda, matrix = "beta")
##Graphs
top_terms <- topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
#single
stopwords1<- stopwords("english") ##I actually use a custom list: read.csv("stopwords.txt", header = FALSE)
adnlstopwords1<-c("ny","new","york","yorks","state","nyc","nys")
#doubles
stopwords2<-levels(interaction(stopwords1,stopwords1,sep=' '))
adnlstopwords2<-c(stopwords2,c("new york", "york state", "in ny", "in new",
"new yorks"))
stopwords<-c(stopwords,adnlstopwords1,stopwords2,adnlstopwords2)
我的问题是如何从 dtm 中删除这些二元语法而不使用 tm_map 或者可能有什么解决方法。请注意,基于 "new york" 的二元组可能不会出现在原始数据中,但对我的其他数据很重要。
我从 R 中的 "gofastR" 包中找到了这个解决方案:
dtm2 <- remove_stopwords(dtm, stopwords = stopwords)
但是,我仍然在结果中看到了停止短语。查看文档后,remove_stopwords 假设它有一个排序列表——您可以使用同一包中的 prep_stopwords() 函数准备您的 stopwords/phrases。
stopwords<-prep_stopwords(stopwords)
dtm2 <- remove_stopwords(dtm, stopwords = stopwords)
为了做到这一点和干。我们可以在代码的tm_map部分进行词干提取,去除stepwords如下:
stopwords<-prep_stopwords(stemDocument(stopwords))
dtm2 <- remove_stopwords(dtm, stopwords = stopwords)
因为这将阻止停用词,然后匹配 dtm 中已经被阻止的词。