从 R 中的数据框创建词云

Question

我做了一个示例数据框。我尝试从“项目”栏制作一个词云。

Hours<-c(2,3,4,2,1,1,3)
Project<-c("a","b","b","a","c","c","c")
Period<-c("2014-11-22","2014-11-23","2014-11-24","2014-11-22", "2014-11-23", "2014-11-23", "2014-11-24")
cd=data.frame(Project,Hours,Period)

这是我的代码：

cd$Project<-as.character(cd$Project)
wordcloud(cd$Project,min.freq=1)

但我收到以下错误：

Error in strwidth(words[i], cex = size[i], ...) : invalid 'cex' value
In addition: Warning messages:
1: In max(freq) : no non-missing arguments to max; returning -Inf
2: In max(freq) : no non-missing arguments to max; returning -Inf

我做错了什么？

Answer 1

我认为您遗漏了 freq 论点。您想要创建一个列来指示每个项目发生的频率。因此，我使用 dplyr 包中的 count 转换了您的数据。

library(dplyr)
library(wordcloud)

cd <- data.frame(Hours = c(2,3,4,2,1,1,3),
                 Project = c("a","b","b","a","c","c","c"),             
                 Period = c("2014-11-22","2014-11-23","2014-11-24",
                            "2014-11-22", "2014-11-23", "2014-11-23",
                            "2014-11-24"),
                 stringsAsFactors = FALSE)

cd2 <- count(cd, Project)

#  Project n
#1       a 2
#2       b 2
#3       c 3

wordcloud(words = cd2$Project, freq = cd2$n, min.freq = 1)

Answer 2

如果您指定字符列，则该函数会在幕后为您创建语料库和文档术语矩阵。问题是 tm pacakge 中 TermDocumentMatrix 函数的默认行为是只跟踪长度超过三个字符的单词（此外，它删除 "stop words" 所以像 "a" 这样的值会被删除）。因此，如果您将示例更改为

Project<-c("aaa","bbb","bbb","aaa","ccc","ccc","ccc")

它会工作得很好。似乎没有办法更改发送到 TermDocumentMatrix 的控件选项。如果你想像默认的wordcloud函数一样自己计算频率，你可以这样做

corpus <- Corpus(VectorSource(cd$Project))
corpus <- tm_map(corpus, removePunctuation)
# corpus <- tm_map(corpus, function(x) removeWords(x, stopwords()))
tdm <-TermDocumentMatrix(corpus, control=list(wordLengths=c(1,Inf)))
freq <- slam::row_sums(tdm)
words <- names(freq)    

wordcloud(words, freq, min.freq=1)

但是，对于简单的情况，您可以使用 table()

计算频率

tbl <- table(cd$Project)
wordcloud(names(tbl), tbl, min.freq=1)

从 R 中的数据框创建词云

Create wordcloud from a data frame in R

r

frequency