通过文本分析 inner_join 删除了 R 中的一千多个单词

Question

我正在分析 most_used_words 数据框中包含单词的列。用2180个字。

most_used_words

        word times_used
       <chr>      <int>
 1    people         70
 2      news         69
 3      fake         68
 4   country         54
 5     media         44
 6       u.s         42
 7  election         40
 8      jobs         37
 9       bad         36
10 democrats         35
# ... with 2,170 more rows

当我 inner_join 使用 AFINN 词典时，2180 个单词中只有 364 个被评分。这是因为 AFINN 词典中的单词没有出现在我的数据框中吗？我担心如果是这样的话，这可能会在我的分析中引入偏见。我应该使用不同的词典吗？还有其他事情正在发生吗？

library(tidytext) library(tidyverse) afinn <- get_sentiments("afinn") most_used_words %>% inner_join(afinn) word times_used score <chr> <int> <int> 1 fake 68 -3 2 bad 36 -3 3 win 24 4 4 failing 21 -2 5 hard 20 -1 6 united 19 1 7 illegal 17 -3 8 cuts 15 -1 9 badly 13 -3 10 strange 13 -1 # ... with 354 more rows

Answer 1

"Is this because the words in the in the AFINN lexicon don't appear in my dataframe?"

是的。

内部联接只会 return 匹配来自每个 data.frame 的行（词）。当然，你可以尝试不同的词典，但这可能对你处理名词没有帮助。名词标识人、动物、地点、事物或想法。在您上面的示例中，"u.s."、"people"、"country"、"news"、"democrats" 都是 afinn 中不存在的名词。 None 其中有任何没有上下文的情绪。欢迎来到文本分析的世界。

但是，根据您的分析显示的输出，我认为您可以得出结论，您的词栏的情绪是压倒性的 "negative"。单词 "fake" 的出现次数几乎是下一个最常用单词 "bad".

的两倍

如果您有完整的句子，您可以使用 sentimentr r 包获取上下文。看看：

install.packages("sentimentr")
library(sentimentr)
?sentiment

这将需要比您在此处完成的工作更多的工作，并且会产生更丰富的结果。但最终，它们很可能是一样的。祝你好运。

通过文本分析 inner_join 删除了 R 中的一千多个单词

With text analysis inner_join removes more than a thousand words in R

r

text-analysis

lexicon

tidyverse

tidytext