制作词云,但使用组合词?
Making a wordcloud, but with combined words?
我正在尝试制作出版物关键字的词云。例如:
教育数据挖掘;协作学习;计算机科学等
我目前的代码如下:
KeywordsCorpus <- Corpus(VectorSource(subset(Words$Author.Keywords, Words$Year==2012)))
KeywordsCorpus <- tm_map(KeywordsCorpus, removePunctuation)
KeywordsCorpus <- tm_map(KeywordsCorpus, removeNumbers)
# added tolower
KeywordsCorpus <- tm_map(KeywordsCorpus, tolower)
KeywordsCorpus <- tm_map(KeywordsCorpus, removeWords, stopwords("english"))
# moved stripWhitespace
KeywordsCorpus <- tm_map(KeywordsCorpus, stripWhitespace)
KeywordsCorpus <- tm_map(KeywordsCorpus, PlainTextDocument)
dtm4 <- TermDocumentMatrix(KeywordsCorpus)
m4 <- as.matrix(dtm4)
v4 <- sort(rowSums(m4),decreasing=TRUE)
d4 <- data.frame(word = names(v4),freq=v4)
但是,使用此代码,它会单独拆分每个单词,但我需要的是组合 words/phrases。例如:教育数据挖掘是我需要展示的 1 个短语,而不是正在发生的事情:"Educational" "Data" "Mining"。有没有办法将每个单词组合在一起?分号可能有助于作为分隔符。
谢谢。
好的..经过大量研究,我找到了完美的答案。
首先,如果你想 wordcloud 多个单词,这叫做 bigrams。有 R 可用的包可以这样做,例如 "tau" 和 "Rweka".
本文 link 将帮助您:
This
这是一个使用不同文本包的解决方案,它允许您从统计检测到的搭配或仅通过形成所有二元语法来形成多词表达。该包名为 quanteda.
library(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.5.14’
首先,检测前 1,500 个双字母搭配并将文本中的这些搭配替换为单标记版本(由 "_"
字符连接)的方法。这里我使用的是软件包内置的美国总统就职演说文本语料库。
### for just the top 1500 collocations
# detect the collocations
colls <- collocations(inaugCorpus, n = 1500, size = 2)
# remove collocations containing stopwords
colls <- removeFeatures(colls, stopwords("SMART"))
## Removed 1,224 (81.6%) of 1,500 collocations containing one of 570 stopwords.
# replace the phrases with single-token versions
inaugCorpusColl2 <- phrasetotoken(inaugCorpus, colls)
# create the document-feature matrix
inaugColl2dfm <- dfm(inaugCorpusColl2, ignoredFeatures = stopwords("SMART"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 57 documents
## ... indexing features: 9,741 feature types
## ... removed 430 features, from 570 supplied (glob) feature types
## ... complete.
## ... created a 57 x 9311 sparse dfm
## Elapsed time: 0.163 seconds.
# plot the wordcloud
set.seed(1000)
png("~/Desktop/wcloud1.png", width = 800, height = 800)
plot(inaugColl2dfm["2013-Obama", ], min.freq = 2, random.order = FALSE,
colors = sample(colors()[2:128]))
dev.off()
这导致了以下情节。注意搭配,例如 "generation's_task" 和 "fellow_americans".
所有二元组形成的版本更容易,但会产生大量低频二元组特征。对于词云,我选择了一组更大的文本,而不仅仅是 2013 年奥巴马的演讲。
### version with all bi-grams
inaugbigramsDfm <- dfm(inaugCorpusColl2, ngrams = 2, ignoredFeatures = stopwords("SMART"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 57 documents
## ... removed 54,200 features, from 570 supplied (glob) feature types
## ... indexing features: 64,108 feature types
## ... created a 57 x 9908 sparse dfm
## ... complete.
## Elapsed time: 3.254 seconds.
# plot the bigram wordcloud - more texts because for a single speech,
# almost none occur more than once
png("~/Desktop/wcloud2.png", width = 800, height = 800)
plot(inaugbigramsDfm[40:57, ], min.freq = 2, random.order = FALSE,
colors = sample(colors()[2:128]))
dev.off()
这会产生:
最好的建议是观看五分钟的短视频(下方 link):
如果你想要直接R代码,这里是:
mycorpus <- Corpus(VectorSource(subset(Words$Author.Keywords,Words$Year==2012)))
文本清理
将文本转换为小写
mycorpus <- tm_map(mycorpus, content_transformer(tolower))
删除数字
mycorpus <- tm_map(mycorpus, removeNumbers)
删除英语常见停用词
mycorpus <- tm_map(mycorpus, removeWords, stopwords("english"))
删除标点符号
mycorpus <- tm_map(mycorpus, removePunctuation)
消除多余的空格
mycorpus <- tm_map(mycorpus, stripWhitespace)
as.character(mycorpus[[1]])
二元组
minfreq_bigram<-2
token_delim <- " \t\r\n.!?,;\"()"
bitoken <- NGramTokenizer(mycorpus, Weka_control(min=2,max=2, delimiters = token_delim))
two_word <- data.frame(table(bitoken))
sort_two <- two_word[order(two_word$Freq,decreasing=TRUE),]
wordcloud(sort_two$bitoken,sort_two$Freq,random.order=FALSE,scale = c(2,0.35),min.freq = minfreq_bigram,colors = brewer.pal(8,"Dark2"),max.words=150)
我正在尝试制作出版物关键字的词云。例如: 教育数据挖掘;协作学习;计算机科学等
我目前的代码如下:
KeywordsCorpus <- Corpus(VectorSource(subset(Words$Author.Keywords, Words$Year==2012)))
KeywordsCorpus <- tm_map(KeywordsCorpus, removePunctuation)
KeywordsCorpus <- tm_map(KeywordsCorpus, removeNumbers)
# added tolower
KeywordsCorpus <- tm_map(KeywordsCorpus, tolower)
KeywordsCorpus <- tm_map(KeywordsCorpus, removeWords, stopwords("english"))
# moved stripWhitespace
KeywordsCorpus <- tm_map(KeywordsCorpus, stripWhitespace)
KeywordsCorpus <- tm_map(KeywordsCorpus, PlainTextDocument)
dtm4 <- TermDocumentMatrix(KeywordsCorpus)
m4 <- as.matrix(dtm4)
v4 <- sort(rowSums(m4),decreasing=TRUE)
d4 <- data.frame(word = names(v4),freq=v4)
但是,使用此代码,它会单独拆分每个单词,但我需要的是组合 words/phrases。例如:教育数据挖掘是我需要展示的 1 个短语,而不是正在发生的事情:"Educational" "Data" "Mining"。有没有办法将每个单词组合在一起?分号可能有助于作为分隔符。
谢谢。
好的..经过大量研究,我找到了完美的答案。 首先,如果你想 wordcloud 多个单词,这叫做 bigrams。有 R 可用的包可以这样做,例如 "tau" 和 "Rweka".
本文 link 将帮助您: This
这是一个使用不同文本包的解决方案,它允许您从统计检测到的搭配或仅通过形成所有二元语法来形成多词表达。该包名为 quanteda.
library(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.5.14’
首先,检测前 1,500 个双字母搭配并将文本中的这些搭配替换为单标记版本(由 "_"
字符连接)的方法。这里我使用的是软件包内置的美国总统就职演说文本语料库。
### for just the top 1500 collocations
# detect the collocations
colls <- collocations(inaugCorpus, n = 1500, size = 2)
# remove collocations containing stopwords
colls <- removeFeatures(colls, stopwords("SMART"))
## Removed 1,224 (81.6%) of 1,500 collocations containing one of 570 stopwords.
# replace the phrases with single-token versions
inaugCorpusColl2 <- phrasetotoken(inaugCorpus, colls)
# create the document-feature matrix
inaugColl2dfm <- dfm(inaugCorpusColl2, ignoredFeatures = stopwords("SMART"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 57 documents
## ... indexing features: 9,741 feature types
## ... removed 430 features, from 570 supplied (glob) feature types
## ... complete.
## ... created a 57 x 9311 sparse dfm
## Elapsed time: 0.163 seconds.
# plot the wordcloud
set.seed(1000)
png("~/Desktop/wcloud1.png", width = 800, height = 800)
plot(inaugColl2dfm["2013-Obama", ], min.freq = 2, random.order = FALSE,
colors = sample(colors()[2:128]))
dev.off()
这导致了以下情节。注意搭配,例如 "generation's_task" 和 "fellow_americans".
所有二元组形成的版本更容易,但会产生大量低频二元组特征。对于词云,我选择了一组更大的文本,而不仅仅是 2013 年奥巴马的演讲。
### version with all bi-grams
inaugbigramsDfm <- dfm(inaugCorpusColl2, ngrams = 2, ignoredFeatures = stopwords("SMART"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 57 documents
## ... removed 54,200 features, from 570 supplied (glob) feature types
## ... indexing features: 64,108 feature types
## ... created a 57 x 9908 sparse dfm
## ... complete.
## Elapsed time: 3.254 seconds.
# plot the bigram wordcloud - more texts because for a single speech,
# almost none occur more than once
png("~/Desktop/wcloud2.png", width = 800, height = 800)
plot(inaugbigramsDfm[40:57, ], min.freq = 2, random.order = FALSE,
colors = sample(colors()[2:128]))
dev.off()
这会产生:
最好的建议是观看五分钟的短视频(下方 link):
如果你想要直接R代码,这里是:
mycorpus <- Corpus(VectorSource(subset(Words$Author.Keywords,Words$Year==2012)))
文本清理 将文本转换为小写
mycorpus <- tm_map(mycorpus, content_transformer(tolower))
删除数字
mycorpus <- tm_map(mycorpus, removeNumbers)
删除英语常见停用词
mycorpus <- tm_map(mycorpus, removeWords, stopwords("english"))
删除标点符号
mycorpus <- tm_map(mycorpus, removePunctuation)
消除多余的空格
mycorpus <- tm_map(mycorpus, stripWhitespace)
as.character(mycorpus[[1]])
二元组
minfreq_bigram<-2
token_delim <- " \t\r\n.!?,;\"()"
bitoken <- NGramTokenizer(mycorpus, Weka_control(min=2,max=2, delimiters = token_delim))
two_word <- data.frame(table(bitoken))
sort_two <- two_word[order(two_word$Freq,decreasing=TRUE),]
wordcloud(sort_two$bitoken,sort_two$Freq,random.order=FALSE,scale = c(2,0.35),min.freq = minfreq_bigram,colors = brewer.pal(8,"Dark2"),max.words=150)