在每行 1 个字符串中使用多个句子的 Dataframe 上应用 Sentimentr

Apply Sentimentr on Dataframe with Multiple Sentences in 1 String Per Row

我有一个数据集,我试图通过文章获取情绪。我有大约 1000 篇文章。每篇文章都是一个字符串。该字符串中包含多个句子。理想情况下,我想添加另一列来总结每篇文章的观点。有没有一种有效的方法可以使用 dplyr 来做到这一点?

下面是一个只有 2 篇文章的示例数据集。

date<- as.Date(c('2020-06-24', '2020-06-24'))
text <- c('3 more cops recover as PNP COVID-19 infections soar to 519', 'QC suspends processing of PWD IDs after reports of abuse in issuance of cards')
link<- c('https://newsinfo.inquirer.net/1296981/3-more-cops-recover-as-pnps-covid-19-infections-soar-to-519,3,10,4,11,9,8', 'https://newsinfo.inquirer.net/1296974/qc-suspends-processing-of-pwd-ids-after-reports-of-abuse-in-issuance-of-cards')
V4 <-c('MANILA, Philippines — Three more police officers have recovered from the new coronavirus disease, increasing the total number of recoveries in the Philippine National Police to (PNP) 316., This developed as the total number of COVID-19 cases in the PNP rose to 519 with one new infection and nine deaths recorded., In a Facebook post on Wednesday, the PNP also recorded 676 probable and 876 suspects for the disease., PNP chief Gen. Archie Gamboa previously said the force would will intensify its health protocols among its personnel after recording a recent increase in deaths., The latest fatality of the ailment is a police officer in Cebu City, which is under enhanced community quarantine as COVID-19 cases continued to surge there., ATM, \r\n\r\nFor more news about the novel coronavirus click here.\r\nWhat you need to know about Coronavirus.\r\n\r\n\r\n\r\nFor more information on COVID-19, call the DOH Hotline: (02) 86517800 local 1149/1150.\r\n\r\n \r\n \r\n \r\n\r\n  \r\n , The Inquirer Foundation supports our healthcare frontliners and is still accepting cash donations to be deposited at Banco de Oro (BDO) current account #007960018860 or donate through PayMaya using this  link  .',
   'MANILA, Philippines — Quezon City will halt the processing of identification cards to persons with disability for two days starting Thursday, June 25, so it could tweak its guidelines after reports that unqualified persons had issued with the said IDs., In a statement on Wednesday, Quezon City Mayor Joy Belmonte said the suspension would the individual who issued PWD ID cards to six members of a family who were not qualified but who paid P2,000 each to get the IDs., Belmonte said the suspect, who is a local government employee, was already issued with a show-cause order to respond to the allegation., According to city government lawyer Nino Casimir, the suspect could face a grave misconduct case that could result in dismissal., The IDs are issued to only to persons qualified under the Act Expanding the Benefits and Privileges of Persons with Disability (Republic Act No. 10754)., The IDs entitle PWDs to a 20 percent discount and VAT exemption on goods and services., /atm')

df<-data.frame(date, text, link, V4)

head(df)

所以我一直在寻找如何使用下面创建的 sentimentr 包来做到这一点。但是,这只会输出每个句子的情绪(我通过 ., 的 strsplit 来做到这一点),我想在应用此 strsplit 后聚合整篇文章级别的所有内容。

library(sentimentr)
full<-df %>%
  group_by(V4) %>%
  mutate(V2 = strsplit(as.character(V4), "[.],")) %>% 
  unnest(V2) %>%
  get_sentences() %>%
  sentiment()

我正在寻找的所需输出是简单地为我的 df 数据框添加一个额外的列,其中包含每篇文章的总和(情绪)。

基于以下答案的附加信息:

date<- as.Date(c('2020-06-24', '2020-06-24'))
text <- c('3 more cops recover as PNP COVID-19 infections soar to 519', 'QC suspends processing of PWD IDs after reports of abuse in issuance of cards')
link<- c('https://newsinfo.inquirer.net/1296981/3-more-cops-recover-as-pnps-covid-19-infections-soar-to-519,3,10,4,11,9,8', 'https://newsinfo.inquirer.net/1296974/qc-suspends-processing-of-pwd-ids-after-reports-of-abuse-in-issuance-of-cards')
V4 <-c('MANILA, Philippines — Three more police officers have recovered from the new coronavirus disease, increasing the total number of recoveries in the Philippine National Police to (PNP) 316., This developed as the total number of COVID-19 cases in the PNP rose to 519 with one new infection and nine deaths recorded., In a Facebook post on Wednesday, the PNP also recorded 676 probable and 876 suspects for the disease., PNP chief Gen. Archie Gamboa previously said the force would will intensify its health protocols among its personnel after recording a recent increase in deaths., The latest fatality of the ailment is a police officer in Cebu City, which is under enhanced community quarantine as COVID-19 cases continued to surge there., ATM, \r\n\r\nFor more news about the novel coronavirus click here.\r\nWhat you need to know about Coronavirus.\r\n\r\n\r\n\r\nFor more information on COVID-19, call the DOH Hotline: (02) 86517800 local 1149/1150.\r\n\r\n \r\n \r\n \r\n\r\n  \r\n , The Inquirer Foundation supports our healthcare frontliners and is still accepting cash donations to be deposited at Banco de Oro (BDO) current account #007960018860 or donate through PayMaya using this  link  .',
   'MANILA, Philippines — Quezon City will halt the processing of identification cards to persons with disability for two days starting Thursday, June 25, so it could tweak its guidelines after reports that unqualified persons had issued with the said IDs., In a statement on Wednesday, Quezon City Mayor Joy Belmonte said the suspension would the individual who issued PWD ID cards to six members of a family who were not qualified but who paid P2,000 each to get the IDs., Belmonte said the suspect, who is a local government employee, was already issued with a show-cause order to respond to the allegation., According to city government lawyer Nino Casimir, the suspect could face a grave misconduct case that could result in dismissal., The IDs are issued to only to persons qualified under the Act Expanding the Benefits and Privileges of Persons with Disability (Republic Act No. 10754)., The IDs entitle PWDs to a 20 percent discount and VAT exemption on goods and services., /atm')

df<-data.frame(date, text, link, V4)

df %>%
  group_by(V4) %>% # group by not really needed
  mutate(V4 = gsub("[.],", ".", V4), 
         sentiment_score = sentiment_by(V4)) 

# A tibble: 2 x 5
# Groups:   V4 [2]
  date       text                      link                                V4                                                  sentiment_score$e~ $word_count   $sd $ave_sentiment
  <date>     <chr>                     <chr>                               <chr>                                                            <int>       <int> <dbl>          <dbl>
1 2020-06-24 3 more cops recover as P~ https://newsinfo.inquirer.net/1296~ "MANILA, Philippines — Three more police officers ~                  1         172 0.204       -0.00849
2 2020-06-24 QC suspends processing o~ https://newsinfo.inquirer.net/1296~ "MANILA, Philippines — Quezon City will halt the p~                  1         161 0.329       -0.174  
Warning message:
Can't combine <sentiment_by> and <sentiment_by>; falling back to <data.frame>.
x Some attributes are incompatible.
i The author of the class should implement vctrs methods.
i See <https://vctrs.r-lib.org/reference/faq-error-incompatible-attributes.html>. 

如果您需要整个文本的情感,则无需先将文本拆分为句子,情感函数可以解决这个问题。我将您文本中的 ., 替换回句号,因为这是情绪功能所必需的。情绪函数识别“先生”。因为不是句子的结尾。如果你先使用 get_sentences(),你会得到每个句子的情绪,而不是整个文本。

函数 sentiment_by 处理整个文本的情绪并很好地对其进行平均。如果您需要更改它,请查看 averaging.function 选项的帮助。函数的 by 部分可以处理您要应用的任何分组。

df %>%
  group_by(V4) %>% # group by not really needed
  mutate(V4 = gsub("[.],", ".", V4), 
         sentiment_score = sentiment_by(V4)) 

# A tibble: 2 x 5
# Groups:   V4 [2]
  date       text               link                      V4                            sentiment_score$~ $word_count   $sd $ave_sentiment
  <date>     <chr>              <chr>                     <chr>                                     <int>       <int> <dbl>          <dbl>
1 2020-06-24 3 more cops recov~ https://newsinfo.inquire~ "MANILA, Philippines — Three~                 1         172 0.204       -0.00849
2 2020-06-24 QC suspends proce~ https://newsinfo.inquire~ "MANILA, Philippines — Quezo~                 1         161 0.329       -0.174