来自 R 中数据 Table 的词云
Wordcloud from Data Table in R
我有一个数据 table 由正面和负面的单词联想组成。我想创建两个词云,一个用于正面词,一个用于负面词。
sentiment_words
table的例子:
element_id sentence_id negative positive
1115: 1 1115 limits agree,available
1116: 1 1116 slow strongly,agree
1117: 1 1117 management
1118: 1 1118
1119: 1 1119 concerns strongly,agree,better,
我正在使用 library(wordcloud)
和 library(sentimentr)
例如,如何只从 "positive" 列中提取单词来创建词云?我不确定如何解决每行关联多个单词的事实(例如,"agree, available" 应视为两个条目)
我对 wordcloud()
函数做了不同的尝试,例如
wordcloud(words = sentiment_words$positive, freq = 3, min.freq = 1, max.words = 200, random.order = FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))
但这只是 returns 第一个词条中的云
编辑:我尝试了下面的 tidyverse
答案,得到的结果是:
words n
<chr> <int>
1 " \"ability\"" 3
2 " \"ability\")" 1
3 " \"acceptable\")" 1
4 " \"accomplish\"" 1
5 " \"accomplished\")" 1
6 " \"accountability\"" 1
7 " \"accountability\")" 1
8 " \"accountable\"" 2
9 " \"accountable\")" 1
我已经尝试了 gsub()
和 apply
的乘法变体来删除额外的 )
和 c(
但还没有找到任何有效的方法。结果是应该一起计数的词被单独计数(例如,"acceptable" 和 "acceptable)" 在词云中是两个不同的词)
编辑:为了让它正常工作,我必须先按照下面的建议清理我的 sentiment_words
。
for (j in seq(sentiment_words)) {
sentiment_words[[j]] <- gsub("character(0)", "", sentiment_words[[j]])
sentiment_words[[j]] <- gsub('"', "", sentiment_words[[j]])
sentiment_words[[j]] <- gsub("c\(", "", sentiment_words[[j]])
sentiment_words[[j]] <- gsub(" ", "", sentiment_words[[j]])
sentiment_words[[j]] <- gsub("\)", "", sentiment_words[[j]])
}
而且我还必须过滤掉 count_words
函数中剩余的 "character(0" 字符串。请注意,它过滤 "character(0" 而不是 "character(0)" 因为我删除了上面的右括号
filter(!!var != "character(0") %>%
执行上述操作给出了基于文本极性的最干净的词云
这是一个基于 tidyverse
的方法,应该可以帮助您入门。我同意 Mr_Z,因为我不完全清楚问题出在哪里。
让我们定义一个函数,该函数根据源数据 df
的特定列 var
中以逗号分隔的单词生成带有单词计数的 data.frame
.
library(tidyverse)
count_words <- function(df, var) {
var <- enquo(var)
df %>%
separate_rows(!!var, sep = ",") %>%
filter(!!var != "") %>%
group_by(!!var) %>%
summarise(n = n()) %>%
rename(words = !!var)
}
然后我们可以为 positive
和 negative
列生成字数统计
df.pos <- count_words(df, positive)
df.neg <- count_words(df, negative)
让我们检查 data.frame
s
df.pos
# A tibble: 5 x 2
words n
<chr> <int>
1 agree 3
2 available 1
3 better 1
4 management 1
5 strongly 2
df.neg
# A tibble: 3 x 2
words n
<chr> <int>
1 concerns 1
2 limits 1
3 slow 1
让我们绘制词云
library(wordcloud)
wordcloud(words = df.pos$words, freq = df.pos$n, min.freq = 1,
max.words = 200, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
wordcloud(words = df.neg$words, freq = df.neg$n, min.freq = 1,
max.words = 200, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
我强烈建议不要在此处使用已接受的答案,因为它忽略了 sentimentr 已经 returns 为您计算的计数(通过 attributes(sentiment_words)$counts
)。 documentation for extract_sentiment_terms
shows examples that makes this more clear (there's was room for improving the documentation about what is returned and has been added in the dev version: https://github.com/trinker/sentimentr/blob/master/R/extract_sentiment_terms.R)。下面我展示了如何提取用于词云和一些潜在布局的计数:
library(sentimentr)
library(wordcloud)
library(data.table)
set.seed(10)
x <- get_sentences(sample(hu_liu_cannon_reviews[[2]], 1000, TRUE))
sentiment_words <- extract_sentiment_terms(x)
sentiment_counts <- attributes(sentiment_words)$counts
sentiment_counts[polarity > 0,]
par(mfrow = c(1, 3), mar = c(0, 0, 0, 0))
## Positive Words
with(
sentiment_counts[polarity > 0,],
wordcloud(words = words, freq = n, min.freq = 1,
max.words = 200, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"), scale = c(4.5, .75)
)
)
mtext("Positive Words", side = 3, padj = 5)
## Negative Words
with(
sentiment_counts[polarity < 0,],
wordcloud(words = words, freq = n, min.freq = 1,
max.words = 200, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"), scale = c(4.5, 1)
)
)
mtext("Negative Words", side = 3, padj = 5)
sentiment_counts[,
color := ifelse(polarity > 0, 'red',
ifelse(polarity < 0, 'blue', 'gray70')
)]
## Together
with(
sentiment_counts[polarity != 0,],
wordcloud(words = words, freq = n, min.freq = 1,
max.words = 200, random.order = FALSE, rot.per = 0.35,
colors = color, ordered.colors = TRUE, scale = c(5, .75)
)
)
mtext("Positive (red) & Negative (blue) Words", side = 3, padj = 5)
我有一个数据 table 由正面和负面的单词联想组成。我想创建两个词云,一个用于正面词,一个用于负面词。
sentiment_words
table的例子:
element_id sentence_id negative positive
1115: 1 1115 limits agree,available
1116: 1 1116 slow strongly,agree
1117: 1 1117 management
1118: 1 1118
1119: 1 1119 concerns strongly,agree,better,
我正在使用 library(wordcloud)
和 library(sentimentr)
例如,如何只从 "positive" 列中提取单词来创建词云?我不确定如何解决每行关联多个单词的事实(例如,"agree, available" 应视为两个条目)
我对 wordcloud()
函数做了不同的尝试,例如
wordcloud(words = sentiment_words$positive, freq = 3, min.freq = 1, max.words = 200, random.order = FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))
但这只是 returns 第一个词条中的云
编辑:我尝试了下面的 tidyverse
答案,得到的结果是:
words n
<chr> <int>
1 " \"ability\"" 3
2 " \"ability\")" 1
3 " \"acceptable\")" 1
4 " \"accomplish\"" 1
5 " \"accomplished\")" 1
6 " \"accountability\"" 1
7 " \"accountability\")" 1
8 " \"accountable\"" 2
9 " \"accountable\")" 1
我已经尝试了 gsub()
和 apply
的乘法变体来删除额外的 )
和 c(
但还没有找到任何有效的方法。结果是应该一起计数的词被单独计数(例如,"acceptable" 和 "acceptable)" 在词云中是两个不同的词)
编辑:为了让它正常工作,我必须先按照下面的建议清理我的 sentiment_words
。
for (j in seq(sentiment_words)) {
sentiment_words[[j]] <- gsub("character(0)", "", sentiment_words[[j]])
sentiment_words[[j]] <- gsub('"', "", sentiment_words[[j]])
sentiment_words[[j]] <- gsub("c\(", "", sentiment_words[[j]])
sentiment_words[[j]] <- gsub(" ", "", sentiment_words[[j]])
sentiment_words[[j]] <- gsub("\)", "", sentiment_words[[j]])
}
而且我还必须过滤掉 count_words
函数中剩余的 "character(0" 字符串。请注意,它过滤 "character(0" 而不是 "character(0)" 因为我删除了上面的右括号
filter(!!var != "character(0") %>%
执行上述操作给出了基于文本极性的最干净的词云
这是一个基于 tidyverse
的方法,应该可以帮助您入门。我同意 Mr_Z,因为我不完全清楚问题出在哪里。
让我们定义一个函数,该函数根据源数据
df
的特定列var
中以逗号分隔的单词生成带有单词计数的data.frame
.library(tidyverse) count_words <- function(df, var) { var <- enquo(var) df %>% separate_rows(!!var, sep = ",") %>% filter(!!var != "") %>% group_by(!!var) %>% summarise(n = n()) %>% rename(words = !!var) }
然后我们可以为
positive
和negative
列生成字数统计df.pos <- count_words(df, positive) df.neg <- count_words(df, negative)
让我们检查
data.frame
sdf.pos # A tibble: 5 x 2 words n <chr> <int> 1 agree 3 2 available 1 3 better 1 4 management 1 5 strongly 2 df.neg # A tibble: 3 x 2 words n <chr> <int> 1 concerns 1 2 limits 1 3 slow 1
让我们绘制词云
library(wordcloud) wordcloud(words = df.pos$words, freq = df.pos$n, min.freq = 1, max.words = 200, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))
wordcloud(words = df.neg$words, freq = df.neg$n, min.freq = 1, max.words = 200, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))
我强烈建议不要在此处使用已接受的答案,因为它忽略了 sentimentr 已经 returns 为您计算的计数(通过 attributes(sentiment_words)$counts
)。 documentation for extract_sentiment_terms
shows examples that makes this more clear (there's was room for improving the documentation about what is returned and has been added in the dev version: https://github.com/trinker/sentimentr/blob/master/R/extract_sentiment_terms.R)。下面我展示了如何提取用于词云和一些潜在布局的计数:
library(sentimentr)
library(wordcloud)
library(data.table)
set.seed(10)
x <- get_sentences(sample(hu_liu_cannon_reviews[[2]], 1000, TRUE))
sentiment_words <- extract_sentiment_terms(x)
sentiment_counts <- attributes(sentiment_words)$counts
sentiment_counts[polarity > 0,]
par(mfrow = c(1, 3), mar = c(0, 0, 0, 0))
## Positive Words
with(
sentiment_counts[polarity > 0,],
wordcloud(words = words, freq = n, min.freq = 1,
max.words = 200, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"), scale = c(4.5, .75)
)
)
mtext("Positive Words", side = 3, padj = 5)
## Negative Words
with(
sentiment_counts[polarity < 0,],
wordcloud(words = words, freq = n, min.freq = 1,
max.words = 200, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"), scale = c(4.5, 1)
)
)
mtext("Negative Words", side = 3, padj = 5)
sentiment_counts[,
color := ifelse(polarity > 0, 'red',
ifelse(polarity < 0, 'blue', 'gray70')
)]
## Together
with(
sentiment_counts[polarity != 0,],
wordcloud(words = words, freq = n, min.freq = 1,
max.words = 200, random.order = FALSE, rot.per = 0.35,
colors = color, ordered.colors = TRUE, scale = c(5, .75)
)
)
mtext("Positive (red) & Negative (blue) Words", side = 3, padj = 5)