删除停用词后,当我进一步清理 R 中的推文时,我的输出不会保存
After removing stopwords, my output is not saved when I futher clean up my tweets in R
我正在进行情绪分析,我的目录中有两个文档
语料库 1 是正面推文,其他是负面推文,但在
比较 wordcloud 我有那些是停用词的词。这意味着它不是
删除停用词 ("english").
我创建了自定义停用词,但也未能保留该输出。之后我搜索并找到了一个 stopwords.txt 停用词文件,我从 github 下载并用它来删除停用词。为此,我必须将语料库(原子向量)转换为 table,然后再转换为向量(数据帧)以读取此文件。我已将它与 tm
库的停用词结合起来。
输出符合预期,但是当我尝试删除标点符号并检查语料库时,输出只是根据 removePunctuation 输出而不保留停用词的输出。
然后,我尝试了 removeNumbers 并检查了语料库,但它没有保留停用词的输出,而是保留了 removePunctuation 的输出。
那么,这里的问题是什么?
我在这里缺少什么?
[这是代码]
[1][这是使用 R 从推文中删除停用词后的输出]
[2][这是应用 removePunctuation 等其他清理后的输出,
removeNumbers、stipwhitespace、stemDocument 但它不保留删除的停用词输出]
[3]
[1]: https://i.stack.imgur.com/RMbvD.png
[2]: https://i.stack.imgur.com/18H3P.png
[3]: https://i.stack.imgur.com/SxaJE.png
这是我用过的代码。我把这两个文本文件放在
目录并将其转换为语料库。
library(tm)
tweets_corpus <- Corpus(DirSource(directory = "D:/New-RStudio-
Project/tweets"))
summary(tweets_corpus)
##cleaning the tweets_corpus ##
clean_tweets_corpus <- tm_map(tweets_corpus, tolower)
##removing stopwords##
clean_tweets_corpus <- tm_map(tweets_corpus, removeWords,
stopwords("english"))
inspect(clean_tweets_corpus)
##having stopwords.txt (collection of stopwords) to remove the stopwords##
stop = read.table("stopwords.txt", header = TRUE)
class(stop)
stop
stop_vec = as.vector(stop$CUSTOM_STOP_WORDS)
class(stop_vec)
stop_vec
clean_tweets_corpus <- tm_map(tweets_corpus, removeWords,
c(stopwords("english"), stop_vec))
inspect(clean_tweets_corpus)
## remove to have single characters ##
remove_multiplechar<-function(x) gsub("\b[A-z]\b{1}"," ",x)
clean_tweets_corpus<-tm_map(tweets_corpus,
content_transformer(remove_multiplechar))
inspect(clean_tweets_corpus)
clean_tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
clean_tweets_corpus <- tm_map(tweets_corpus,removeNumbers)
clean_tweets_corpus <- tm_map(tweets_corpus, stripWhitespace)
clean_tweets_corpus <- tm_map(tweets_corpus, stemDocument)
inspect(clean_tweets_corpus)
str(clean_tweets_corpus)
这是更正后的代码,在对 tm_map
的所有调用中将 "tweets_corpus" 替换为 "clean_tweets_corpus",第一个除外:
library(tm)
tweets_corpus <- Corpus(DirSource(directory = "D:/New-RStudio-Project/tweets"))
summary(tweets_corpus)
##cleaning the tweets_corpus ##
clean_tweets_corpus <- tm_map(tweets_corpus, tolower)
##removing stopwords##
##having stopwords.txt (collection of stopwords) to remove the stopwords##
stop = read.table("stopwords.txt", header = TRUE)
stop_vec = as.vector(stop$CUSTOM_STOP_WORDS)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, removeWords,
c(stopwords("english"), stop_vec))
## remove to have single characters ##
remove_multiplechar<-function(x) gsub("\b[A-z]\b{1}"," ",x)
clean_tweets_corpus<-tm_map(clean_tweets_corpus,
content_transformer(remove_multiplechar))
clean_tweets_corpus <- tm_map(clean_tweets_corpus, removePunctuation)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, removeNumbers)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, stripWhitespace)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, stemDocument)
我正在进行情绪分析,我的目录中有两个文档
语料库 1 是正面推文,其他是负面推文,但在
比较 wordcloud 我有那些是停用词的词。这意味着它不是
删除停用词 ("english").
我创建了自定义停用词,但也未能保留该输出。之后我搜索并找到了一个 stopwords.txt 停用词文件,我从 github 下载并用它来删除停用词。为此,我必须将语料库(原子向量)转换为 table,然后再转换为向量(数据帧)以读取此文件。我已将它与 tm
库的停用词结合起来。
输出符合预期,但是当我尝试删除标点符号并检查语料库时,输出只是根据 removePunctuation 输出而不保留停用词的输出。
然后,我尝试了 removeNumbers 并检查了语料库,但它没有保留停用词的输出,而是保留了 removePunctuation 的输出。
那么,这里的问题是什么?
我在这里缺少什么?
[这是代码]
[1][这是使用 R 从推文中删除停用词后的输出]
[2][这是应用 removePunctuation 等其他清理后的输出,
removeNumbers、stipwhitespace、stemDocument 但它不保留删除的停用词输出]
[3]
[1]: https://i.stack.imgur.com/RMbvD.png
[2]: https://i.stack.imgur.com/18H3P.png
[3]: https://i.stack.imgur.com/SxaJE.png
这是我用过的代码。我把这两个文本文件放在 目录并将其转换为语料库。
library(tm)
tweets_corpus <- Corpus(DirSource(directory = "D:/New-RStudio-
Project/tweets"))
summary(tweets_corpus)
##cleaning the tweets_corpus ##
clean_tweets_corpus <- tm_map(tweets_corpus, tolower)
##removing stopwords##
clean_tweets_corpus <- tm_map(tweets_corpus, removeWords,
stopwords("english"))
inspect(clean_tweets_corpus)
##having stopwords.txt (collection of stopwords) to remove the stopwords##
stop = read.table("stopwords.txt", header = TRUE)
class(stop)
stop
stop_vec = as.vector(stop$CUSTOM_STOP_WORDS)
class(stop_vec)
stop_vec
clean_tweets_corpus <- tm_map(tweets_corpus, removeWords,
c(stopwords("english"), stop_vec))
inspect(clean_tweets_corpus)
## remove to have single characters ##
remove_multiplechar<-function(x) gsub("\b[A-z]\b{1}"," ",x)
clean_tweets_corpus<-tm_map(tweets_corpus,
content_transformer(remove_multiplechar))
inspect(clean_tweets_corpus)
clean_tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
clean_tweets_corpus <- tm_map(tweets_corpus,removeNumbers)
clean_tweets_corpus <- tm_map(tweets_corpus, stripWhitespace)
clean_tweets_corpus <- tm_map(tweets_corpus, stemDocument)
inspect(clean_tweets_corpus)
str(clean_tweets_corpus)
这是更正后的代码,在对 tm_map
的所有调用中将 "tweets_corpus" 替换为 "clean_tweets_corpus",第一个除外:
library(tm)
tweets_corpus <- Corpus(DirSource(directory = "D:/New-RStudio-Project/tweets"))
summary(tweets_corpus)
##cleaning the tweets_corpus ##
clean_tweets_corpus <- tm_map(tweets_corpus, tolower)
##removing stopwords##
##having stopwords.txt (collection of stopwords) to remove the stopwords##
stop = read.table("stopwords.txt", header = TRUE)
stop_vec = as.vector(stop$CUSTOM_STOP_WORDS)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, removeWords,
c(stopwords("english"), stop_vec))
## remove to have single characters ##
remove_multiplechar<-function(x) gsub("\b[A-z]\b{1}"," ",x)
clean_tweets_corpus<-tm_map(clean_tweets_corpus,
content_transformer(remove_multiplechar))
clean_tweets_corpus <- tm_map(clean_tweets_corpus, removePunctuation)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, removeNumbers)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, stripWhitespace)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, stemDocument)