在 Tidytext 中计数 + 组合 bing 情绪评分变量时出错?
Errors in counting + combining bing sentiment score variables in Tidytext?
我正在对大量文本进行情感分析。我在 tidytext 中使用 bing 词典来获得简单的二进制 pos/neg 分类,但我想计算文档中正面词与总词数(正面和负面)的比率。我对 dplyr 工作流程生疏了,但我想计算编码为“积极”的单词数量,并将其除以分类为情绪的单词总数。
我试过这种方法,使用示例代码和替代数据。 . .
library(tidyverse)
library(tidytext)
#Creating a fake tidytext corpus
df_tidytext <- data.frame(
doc_id = c("Iraq_Report_2001", "Iraq_Report_2002"),
text = c("xxxx", "xxxx") #Placeholder for text
)
#Creating a fake set of scored words with bing sentiments
#for each doc in corpus
df_sentiment_bing <- data.frame(
doc_id = c((rep("Iraq_Report_2001", each = 3)),
rep("Iraq_Report_2002", each = 3)),
word = c("improve", "democratic", "violence",
"sectarian", "conflict", "insurgency"),
bing_sentiment = c("positive", "positive", "negative",
"negative", "negative", "negative") #Stand-ins for sentiment classification
)
#Summarizing count of positive and negative words
# (number of positive words out of total scored words in each doc)
df_sentiment_scored <- df_tidytext %>%
left_join(df_sentiment_bing) %>%
group_by(doc_id) %>%
count(bing_sentiment) %>%
pivot_wider(names_from = bing_sentiment, values_from = n) %>%
summarise(bing_score = count(positive)/(count(negative) + count(positive)))
但我收到以下错误:
"Error: Problem with `summarise()` input `bing_score`.
x no applicable method for 'count' applied to an object of class "c('integer', 'numeric')"
ℹ Input `bing_score` is `count(positive)/(count(negative) + count(positive))`.
ℹ The error occurred in group 1: doc_id = "Iraq_Report_2001".
希望深入了解我在此处总结工作流程时做错了什么。
如果列是数字,我不明白在那里计数有什么意义。顺便说一句,这也是您遇到错误的原因。
一个解决方案可能是:
#Summarizing count of positive and negative words
# (number of positive words out of total scored words in each doc)
df_tidytext %>%
left_join(df_sentiment_bing) %>%
group_by(doc_id) %>%
dplyr::count(bing_sentiment) %>%
pivot_wider(names_from = bing_sentiment, values_from = n) %>%
replace(is.na(.), 0) %>%
summarise(bing_score = sum(positive)/(sum(negative) + sum(positive)))
你应该得到的结果:
Joining, by = "doc_id"
# A tibble: 2 × 2
doc_id bing_score
<fct> <dbl>
1 Iraq_Report_2001 0.667
2 Iraq_Report_2002 0
我正在对大量文本进行情感分析。我在 tidytext 中使用 bing 词典来获得简单的二进制 pos/neg 分类,但我想计算文档中正面词与总词数(正面和负面)的比率。我对 dplyr 工作流程生疏了,但我想计算编码为“积极”的单词数量,并将其除以分类为情绪的单词总数。
我试过这种方法,使用示例代码和替代数据。 . .
library(tidyverse)
library(tidytext)
#Creating a fake tidytext corpus
df_tidytext <- data.frame(
doc_id = c("Iraq_Report_2001", "Iraq_Report_2002"),
text = c("xxxx", "xxxx") #Placeholder for text
)
#Creating a fake set of scored words with bing sentiments
#for each doc in corpus
df_sentiment_bing <- data.frame(
doc_id = c((rep("Iraq_Report_2001", each = 3)),
rep("Iraq_Report_2002", each = 3)),
word = c("improve", "democratic", "violence",
"sectarian", "conflict", "insurgency"),
bing_sentiment = c("positive", "positive", "negative",
"negative", "negative", "negative") #Stand-ins for sentiment classification
)
#Summarizing count of positive and negative words
# (number of positive words out of total scored words in each doc)
df_sentiment_scored <- df_tidytext %>%
left_join(df_sentiment_bing) %>%
group_by(doc_id) %>%
count(bing_sentiment) %>%
pivot_wider(names_from = bing_sentiment, values_from = n) %>%
summarise(bing_score = count(positive)/(count(negative) + count(positive)))
但我收到以下错误:
"Error: Problem with `summarise()` input `bing_score`.
x no applicable method for 'count' applied to an object of class "c('integer', 'numeric')"
ℹ Input `bing_score` is `count(positive)/(count(negative) + count(positive))`.
ℹ The error occurred in group 1: doc_id = "Iraq_Report_2001".
希望深入了解我在此处总结工作流程时做错了什么。
如果列是数字,我不明白在那里计数有什么意义。顺便说一句,这也是您遇到错误的原因。
一个解决方案可能是:
#Summarizing count of positive and negative words
# (number of positive words out of total scored words in each doc)
df_tidytext %>%
left_join(df_sentiment_bing) %>%
group_by(doc_id) %>%
dplyr::count(bing_sentiment) %>%
pivot_wider(names_from = bing_sentiment, values_from = n) %>%
replace(is.na(.), 0) %>%
summarise(bing_score = sum(positive)/(sum(negative) + sum(positive)))
你应该得到的结果:
Joining, by = "doc_id"
# A tibble: 2 × 2
doc_id bing_score
<fct> <dbl>
1 Iraq_Report_2001 0.667
2 Iraq_Report_2002 0