从对话文本中找到谁说了前 10 个愤怒词的更好、更简单的方法

Question

我有一个包含变量 'AgentID'、'Type'、'Date' 和 'Text' 的数据框，子集如下：

structure(list(AgentID = c("AA0101", "AA0101", "AA0101", "AA0101", 
                            "AA0101"), Type = c("PS", "PS", "PS", "PS", "PS"), Date = c("4/1/2019", "4/1/2019", "4/1/2019", "4/1/2019", "4/1/2019"),  Text = c("I am on social security XXXX and I understand it can not be garnished by Paypal credit because it's federally protected.I owe paypal {00.00} I would like them to cancel this please.", 
                        "My XXXX account is being reported late 6 times for XXXX per each loan I was under the impression that I was paying one loan but it's split into three so one payment = 3 or one missed payment would be three missed on my credit,. \n\nMy account is being reported wrong by all credit bureaus because I was in forbearance at the time that these late payments have been reported Section 623 ( a ) ( 2 ) States : If at any time a person who regularly and in the ordinary course of business furnishes information to one or more CRAs determines that the information provided is not complete or accurate, the furnisher must promptly provide complete and accurate information to the CRA. In addition, the furnisher must notify all CRAs that received the information of any corrections, and must thereafter report only the complete and accurate information. \n\nIn this case, I was in forbearance during that tie and document attached proves this. By law, credit need to be reported as of this time with all information and documentation",
                        "A few weeks ago I started to care for my credit and trying to build it up since I have never used my credit in the past, while checking my I discover some derogatory remarks in my XXXX credit report stating the amount owed of {00.00} to XXXX from XX/XX/2015 and another one owed to XXXX for {00.00} I would like to address this immediately and either pay off this debt or get this negative remark remove from my report.", 
                        "I disputed this XXXX  account with all three credit bureaus, the reported that it was closed in XXXX, now its reflecting closed XXXX once I paid the {0.00} which I dont believe I owed this amount since it was an fee for a company trying to take money out of my account without my permission, I was charged the fee and my account was closed. I have notified all 3 bureaus to have this removed but they keep saying its correct. One bureau is showing XXXX closed and the other on shows XXXX according to XXXX XXXX, XXXX shows a XXXX, this account has been on my report for seven years", 
                        "On XX/XX/XXXX I went on XXXX XXXX  and noticed my score had gone down, went to check out why and seen something from XXXX XXXX  and enhanced recovery company ... I also seen that it had come from XXXX and XXXX dated XX/XX/XXXX, XX/XX/XXXX, and XX/XX/XXXX ... I didnt have neither one before, I called and it the rep said it had come from an address Im XXXX XXXX, Florida I have never lived in Florida ever ... .I have also never had XXXX XXXX  nor XXXX XXXX  ... I need this taken off because it if affecting my credit score ... This is obviously identify theft and fraud..I have never received bills from here which proves that is was not done by me, I havent received any notifications ... if it was not for me checking my score I wouldnt have known nothing of this" )), row.names = c(NA, 5L), class = "data.frame")

首先，我使用以下方法找出了前 10 个令人愤怒的词：

library(tm)
library(tidytext)
library(tidyverse)
library(sentimentr)
library(wordcloud)
library(ggplot2)

CS <- function(txt){
  MC <- Corpus(VectorSource(txt))
  SW <- stopwords('english')
  MC <- tm_map(MC, tolower)
  MC<- tm_map(MC,removePunctuation)
  MC <- tm_map(MC, removeNumbers)
  MC <- tm_map(MC, removeWords, SW)
  MC <- tm_map(MC, stripWhitespace)
  myTDM <- as.matrix(TermDocumentMatrix(MC))
  v <- sort(rowSums(myTDM), decreasing=TRUE)
  FM <- data.frame(word = names(v), freq=v)
  row.names(FM) <- NULL
  FM <- FM %>%
    mutate(word = tolower(word)) %>%
    filter(str_count(word, "x") <= 1)
  return(FM)
}

DF <- CS(df$Text)

# using nrc
nrc <- get_sentiments("nrc")
# create final dataset
DF_nrc = DF %>% inner_join(nrc)

然后我创建了一个包含前 10 个愤怒词的向量，如下所示：

TAW <- DF_nrc %>%
  filter(sentiment=="anger") %>%
  group_by(word) %>%
  summarize(freq = mean(freq)) %>%
  arrange(desc(freq)) %>% 
  top_n(10) %>%
  select(word)

接下来我想做的是找出经常说这些词的 'Agent'(s) 是哪些人，然后对他们进行排名。但我很困惑我们怎么能做到这一点？我应该一个一个地搜索单词并按代理分组，还是有其他更好的方法。我正在查看的结果如下所示：

AgentID  Words_Spoken             Rank
A0001  theft, dispute, money    1
A0001  theft, fraud,            2
.......

Answer 1

这不是最优雅的解决方案，但这里是您可以根据行号计算字数的方法：

library(stringr)

# write a new data.frame retaining the AgentID and Date from the original table
new.data <- data.frame(Agent = df$AgentID, Date = df$Date) 

# using a for-loop to go through every row of text in the df provided.  

for(i in seq(nrow(new.data))){ # i represent row number of the original df

  # write a temporary object (e101) that:
        ## do a boolean check to see if the text from row i df[i, "Text"] the TAW$Word with stringr::str_detect function
        ## loop the str_detect with sapply so that the str_detect do a boolean check on each TAW$Word
        ## return the TAW$Word with TAW$Word[...]

  e101 <- TAW$word[sapply(TAW$word, function(x) str_detect(df[i, "Text"], x))] 

  # write the number of returned words in e101 as a corresponding value in new data.frame
  new.data[i, "number_of_TAW"] <- length(e101)

  # concatenate the returned words in e101 as a corresponding value in new data.frame
  new.data[i, "Words_Spoken"] <- ifelse(length(e101)==0, "", paste(e101, collapse=","))
}

new.data

#    Agent     Date number_of_TAW      Words_Spoken
# 1 AA0101 4/1/2019             0                  
# 2 AA0101 4/1/2019             0                  
# 3 AA0101 4/1/2019             2 derogatory,remove
# 4 AA0101 4/1/2019             3  fee,money,remove
# 5 AA0101 4/1/2019             1             theft

Answer 2

如果您更像 dplyr/tidyverse 人，则可以在将文本数据转换为整齐的格式后采用一些 dplyr 动词的方法。

首先，让我们设置一些包含多个说话者的示例数据，其中一个说话者说 no 个愤怒的词。您可以使用 unnest_tokens() 来使用其默认值处理大部分文本清理步骤，例如拆分标记、删除标点符号等。然后使用 anti_join() 删除停用词。我展示了使用 inner_join() 来查找愤怒的词作为一个单独的步骤，但如果你愿意，你可以将它们连接到一个大管道中。

library(tidyverse)
library(tidytext)

my_df <- tibble(AgentID = c("AA0101", "AA0101", "AA0102", "AA0103"),
                Text = c("I want to report a theft and there has been fraud.",
                         "I have taken great offense when there was theft and also poison. It is distressing.",
                         "I only experience soft, fluffy, happy feelings.",
                         "I have a dispute with the hateful scorpion, and also, I would like to report a fraud."))

my_df
#> # A tibble: 4 x 2
#>   AgentID Text                                                             
#>   <chr>   <chr>                                                            
#> 1 AA0101  I want to report a theft and there has been fraud.               
#> 2 AA0101  I have taken great offense when there was theft and also poison.…
#> 3 AA0102  I only experience soft, fluffy, happy feelings.                  
#> 4 AA0103  I have a dispute with the hateful scorpion, and also, I would li…

tidy_words <- my_df %>%
  unnest_tokens(word, Text) %>%
  anti_join(get_stopwords()) 
#> Joining, by = "word"

anger_words <- tidy_words %>%
  inner_join(get_sentiments("nrc") %>%
               filter(sentiment == "anger"))
#> Joining, by = "word"

anger_words
#> # A tibble: 10 x 3
#>    AgentID word        sentiment
#>    <chr>   <chr>       <chr>    
#>  1 AA0101  theft       anger    
#>  2 AA0101  fraud       anger    
#>  3 AA0101  offense     anger    
#>  4 AA0101  theft       anger    
#>  5 AA0101  poison      anger    
#>  6 AA0101  distressing anger    
#>  7 AA0103  dispute     anger    
#>  8 AA0103  hateful     anger    
#>  9 AA0103  scorpion    anger    
#> 10 AA0103  fraud       anger

现在你知道每个人使用了哪些愤怒的词，下一步就是统计它们并对人们进行排名。 dplyr 包对这类工作有着极好的支持。首先你想要 group_by() 个人标识符，然后计算几个汇总数量：

总字数（可以这样排列）
所用单词的粘贴字符串

然后，按字数排列，并创建一个新的列来给出排名。

anger_words %>%
  group_by(AgentID) %>%
  summarise(TotalWords = n(),
            WordsSpoken = paste0(word, collapse = ", ")) %>%
  arrange(-TotalWords) %>%
  mutate(Rank = row_number())
#> # A tibble: 2 x 4
#>   AgentID TotalWords WordsSpoken                                       Rank
#>   <chr>        <int> <chr>                                            <int>
#> 1 AA0101           6 theft, fraud, offense, theft, poison, distressi…     1
#> 2 AA0103           4 dispute, hateful, scorpion, fraud                    2

请注意，使用这种方法，对于没有说出愤怒话语的人，您没有零条目；他们被丢弃在 inner_join()。如果您希望它们出现在最终数据集中，您可能需要重新加入较早的数据集并使用 replace_na().

^{由 reprex package (v0.3.0)}

于 2019-09-11 创建

从对话文本中找到谁说了前 10 个愤怒词的更好、更简单的方法

better and easy way to find who spoke top 10 anger words from conversation text

r

sentiment-analysis

grepl

sentimentr

tidytext