推特数据情感分析

Twitter Data Sentiment Analysis

我是新手,如果我的问题很琐碎,我深表歉意。 我正在尝试对我下载但遇到问题的一些推特数据进行情绪分析。我正在尝试遵循此示例:

创建一个显示 positive/negative 情绪计数的条形图。该示例的代码在这里**

original_books %>% 

  unnest_tokens(output = word,input = text) %>%   

  inner_join(get_sentiments("bing")) %>%  
  count(book, index, sentiment) %>% 

  pivot_wider(names_from = sentiment,
              values_from = n) %>% 

  mutate(sent_score = positive - negative) %>% 

  ggplot() + 

  geom_col(aes(x = index, y = sent_score,
               fill = book),
           show.legend = F) +

  facet_wrap(~book,scales = "free_x")

这是我目前用于自己分析的代码:

#twitter scraping
ref <- search_tweets(
  "#refugee", n = 18000, include_rts = FALSE,lang = "en"
)

 
data(stop_words)


new_stops <- tibble(word = c("https", "t.co", "1", "refugee", "#refugee", "amp", "refugees",
                             "day", "2022", "dont", "0", "2", "@refugees", "4", "2021") ,lexicon = "sabs")
 
full_stop <- stop_words %>% 
  bind_rows(new_stops) #bind_rows adds more rows (way to merge data)

现在我想制作一个与上面类似的条形图,但出现错误,因为我没有名为“索引”的列。我试着做了一个,但没有成功。这是我尝试使用的代码:

ref %>% 

  unnest_tokens(word,text,token = "tweets") %>% 

  anti_join(full_stop) %>% 

  inner_join(get_sentiments("bing")) %>% 

  count(word, index, sentiment) %>% 

  pivot_wider(names_from = sentiment,
              values_from = n) %>% 

  mutate(sent_score = positive - negative) %>% 

  ggplot() + #plot the overall sentiment (pos - neg) versus index, 
  geom_col(aes(x = index, y = sent_score), show.legend = F) 

这是错误的图片

非常感谢任何建议!谢谢

ref 的内容 enter image description here enter image description here

在示例中,index 仅指书中的一组行,顺序为(即 1、2、3...)。这是一种对文本进行分组的方法——您可以将它想象成一个页面,它也是按数字顺序排列的。文本只是被分成某种类型的组,以便计算每个组内的情绪。推文是自然的词组,您想要计算单个推文中的情绪——您不需要将其进一步拆分。在该示例中,图中的每个“页面”都有一个条形图。每条推文都有一个栏。您需要为推文分配连续的编号,因为它们没有自然顺序。我在下面使用 rowid_to_column() 进行了此操作,并将新列命名为“tweet”。它只包含推文的行号,所以一旦 ref 数据帧按单词拆分,每个单词仍然与该编号所属的原始推文相关联。

请注意,许多推文没有足够的带有相关情感的词来计算它们的情感分数,所以我 re-assigned 一个连续的数字给那些有的 - 这个被称为“索引” .

我还在 pivot_wider() 行中添加了参数 values_fill = 0 因为只有正面(或负面)情绪的推文没有被包括在内,因为另一个值是 NA 而不是 0。

一路上有几个地方我只是停下来查看数据——这对理解错误很有帮助。

library(tidyverse)
library(rtweet)
library(tidytext)

#twitter scraping
ref <- search_tweets(
  "#refugee", n = 18000, include_rts = FALSE,lang = "en"
)

data(stop_words)

new_stops <- tibble(word = c("https", "t.co", "1", "refugee", "#refugee", "amp", "refugees",
                             "day", "2022", "dont", "0", "2", "@refugees", "4", "2021") ,lexicon = "sabs")

full_stop <- stop_words %>% 
  bind_rows(new_stops) #bind_rows adds more rows (way to merge data)

ref_w_sentiments <- ref %>% 
  rowid_to_column("tweet") %>% 
  unnest_tokens(word, text, token = "tweets") %>% 
  anti_join(full_stop) %>% 
  inner_join(get_sentiments("bing")) 

# look at what the data looks like
select(ref_w_sentiments, tweet, word, sentiment)

#> # A tibble: 811 × 3
#>   tweet word      sentiment
#>   <int> <chr>     <chr>    
#> 1     2 helping   positive 
#> 2     3 inspiring positive 
#> 3     4 support   positive

ref_w_scores <- ref_w_sentiments %>% 
  group_by(tweet) %>% 
  count(sentiment) %>% 
  pivot_wider(names_from = sentiment,
              values_from = n, values_fill = 0) %>% 
  mutate(sent_score = positive - negative) %>% 
  # not all tweets were scored, so create a new index
  rowid_to_column("index")

# look at the data again
ref_w_scores

#> # A tibble: 418 × 5
#> # Groups:   tweet [418]
#>   index tweet positive negative sent_score
#>   <int> <int>    <int>    <int>      <int>
#> 1     1     2        1        0          1
#> 2     2     3        1        0          1
#> 3     3     4        1        0          1

ggplot(ref_w_scores) + #plot the overall sentiment (pos - neg) versus index, 
  geom_col(aes(x = index, y = sent_score), show.legend = F)