从聊天记录中拆分播放器和聊天（文本挖掘）

Question

我有一个聊天记录，其中包括 4 个玩家（A、B、C、D）和他们在我的数据框中一行的聊天记录（跨多个组）。我想将每个短语拆分成自己的行，并在单独的列中标识该短语的说话者。

我使用以下包尝试了很多事情，但都没有成功。心理 dplyr 拆分堆叠形状整齐的文字纵梁整洁

数据框不是 txt.document，但我认为它必须是？

例如，这是聊天记录的样子。这些都在我的数据集中的一行中。

[1] " *** D has joined the chat ***"                                                                                                                                         
  [2] " *** B has joined the chat ***"                                                                                                                                         
  [3] " *** A has joined the chat ***"                                                                                                                                         
  [4] "D: hi"                                                                                                                                                                  
  [5] "B: hello!"                                                                                                                                                              
  [6] "A: Hi!"                                                                                                                                                                 
  [7] "D: i think oxygen is most important"                                                                                                                                    
  [8] "A: I do too"                                                                                                                                                            
  [9] " *** C has joined the chat ***"                                                                                                                                         
 [10] "B: agreed, that was my #1"                                                                                                                                              
 [11] "A: I didnt at first but then on second guess"                                                                                                                           
 [12] "A: oxygen then water"                                                                                                                                                   
 [13] "C: hi hi"

我想要以下内容（让这些列的每一行都是一个新短语）

Player ID	Phrase
A	hi!
B	hello!

我想最终用它来计算每个玩家 words/characters 的数量

Answer 1

library(dplyr)
library(tidyr)

d %>%
  t() %>%
  as.data.frame("V1") %>%
  filter(!grepl("***", V1, fixed = TRUE)) %>%
  separate(V1, into = c("PlayerID", "Phrase"), sep = ": ") %>%
  mutate(Count = nchar(Phrase))

结果：

#>   PlayerID                                    Phrase Count
#> 1        D                                        hi     2
#> 2        B                                    hello!     6
#> 3        A                                       Hi!     3
#> 4        D          i think oxygen is most important    32
#> 5        A                                  I do too     8
#> 6        B                    agreed, that was my #1    22
#> 7        A I didnt at first but then on second guess    41
#> 8        A                         oxygen then water    17
#> 9        C                                     hi hi     5

您可以使用将其添加到 dplyr 链来计算每个玩家的角色数：

group_by(PlayerID) %>%
summarize(Total = sum(Count))

#>   PlayerID Total
#>   <chr>    <int>
#> 1 A           69
#> 2 B           28
#> 3 C            5
#> 4 D           34

数据：

d <- structure(c(" *** D has joined the chat ***", " *** B has joined the chat ***", 
                 " *** A has joined the chat ***", "D: hi", "B: hello!", "A: Hi!", 
                 "D: i think oxygen is most important", "A: I do too", " *** C has joined the chat ***", 
                 "B: agreed, that was my #1", "A: I didnt at first but then on second guess", 
                 "A: oxygen then water", "C: hi hi"), .Dim = c(1L, 13L))

Created on 2022-05-25 by the reprex package (v2.0.1)

从聊天记录中拆分播放器和聊天（文本挖掘）

Split Player and Chat from Chat Log (text-mining)

r

text-mining