从聊天记录中拆分播放器和聊天(文本挖掘)
Split Player and Chat from Chat Log (text-mining)
我有一个聊天记录,其中包括 4 个玩家(A、B、C、D)和他们在我的数据框中一行的聊天记录(跨多个组)。我想将每个短语拆分成自己的行,并在单独的列中标识该短语的说话者。
我使用以下包尝试了很多事情,但都没有成功。
心理
dplyr
拆分堆叠形状
整齐的文字
纵梁
整洁
数据框不是 txt.document,但我认为它必须是?
例如,这是聊天记录的样子。这些都在我的数据集中的一行中。
[1] " *** D has joined the chat ***"
[2] " *** B has joined the chat ***"
[3] " *** A has joined the chat ***"
[4] "D: hi"
[5] "B: hello!"
[6] "A: Hi!"
[7] "D: i think oxygen is most important"
[8] "A: I do too"
[9] " *** C has joined the chat ***"
[10] "B: agreed, that was my #1"
[11] "A: I didnt at first but then on second guess"
[12] "A: oxygen then water"
[13] "C: hi hi"
我想要以下内容(让这些列的每一行都是一个新短语)
Player ID
Phrase
A
hi!
B
hello!
我想最终用它来计算每个玩家 words/characters 的数量
library(dplyr)
library(tidyr)
d %>%
t() %>%
as.data.frame("V1") %>%
filter(!grepl("***", V1, fixed = TRUE)) %>%
separate(V1, into = c("PlayerID", "Phrase"), sep = ": ") %>%
mutate(Count = nchar(Phrase))
结果:
#> PlayerID Phrase Count
#> 1 D hi 2
#> 2 B hello! 6
#> 3 A Hi! 3
#> 4 D i think oxygen is most important 32
#> 5 A I do too 8
#> 6 B agreed, that was my #1 22
#> 7 A I didnt at first but then on second guess 41
#> 8 A oxygen then water 17
#> 9 C hi hi 5
您可以使用将其添加到 dplyr 链来计算每个玩家的角色数:
group_by(PlayerID) %>%
summarize(Total = sum(Count))
#> PlayerID Total
#> <chr> <int>
#> 1 A 69
#> 2 B 28
#> 3 C 5
#> 4 D 34
数据:
d <- structure(c(" *** D has joined the chat ***", " *** B has joined the chat ***",
" *** A has joined the chat ***", "D: hi", "B: hello!", "A: Hi!",
"D: i think oxygen is most important", "A: I do too", " *** C has joined the chat ***",
"B: agreed, that was my #1", "A: I didnt at first but then on second guess",
"A: oxygen then water", "C: hi hi"), .Dim = c(1L, 13L))
Created on 2022-05-25 by the reprex package (v2.0.1)
我有一个聊天记录,其中包括 4 个玩家(A、B、C、D)和他们在我的数据框中一行的聊天记录(跨多个组)。我想将每个短语拆分成自己的行,并在单独的列中标识该短语的说话者。
我使用以下包尝试了很多事情,但都没有成功。 心理 dplyr 拆分堆叠形状 整齐的文字 纵梁 整洁
数据框不是 txt.document,但我认为它必须是?
例如,这是聊天记录的样子。这些都在我的数据集中的一行中。
[1] " *** D has joined the chat ***"
[2] " *** B has joined the chat ***"
[3] " *** A has joined the chat ***"
[4] "D: hi"
[5] "B: hello!"
[6] "A: Hi!"
[7] "D: i think oxygen is most important"
[8] "A: I do too"
[9] " *** C has joined the chat ***"
[10] "B: agreed, that was my #1"
[11] "A: I didnt at first but then on second guess"
[12] "A: oxygen then water"
[13] "C: hi hi"
我想要以下内容(让这些列的每一行都是一个新短语)
Player ID | Phrase |
---|---|
A | hi! |
B | hello! |
我想最终用它来计算每个玩家 words/characters 的数量
library(dplyr)
library(tidyr)
d %>%
t() %>%
as.data.frame("V1") %>%
filter(!grepl("***", V1, fixed = TRUE)) %>%
separate(V1, into = c("PlayerID", "Phrase"), sep = ": ") %>%
mutate(Count = nchar(Phrase))
结果:
#> PlayerID Phrase Count
#> 1 D hi 2
#> 2 B hello! 6
#> 3 A Hi! 3
#> 4 D i think oxygen is most important 32
#> 5 A I do too 8
#> 6 B agreed, that was my #1 22
#> 7 A I didnt at first but then on second guess 41
#> 8 A oxygen then water 17
#> 9 C hi hi 5
您可以使用将其添加到 dplyr 链来计算每个玩家的角色数:
group_by(PlayerID) %>%
summarize(Total = sum(Count))
#> PlayerID Total
#> <chr> <int>
#> 1 A 69
#> 2 B 28
#> 3 C 5
#> 4 D 34
数据:
d <- structure(c(" *** D has joined the chat ***", " *** B has joined the chat ***",
" *** A has joined the chat ***", "D: hi", "B: hello!", "A: Hi!",
"D: i think oxygen is most important", "A: I do too", " *** C has joined the chat ***",
"B: agreed, that was my #1", "A: I didnt at first but then on second guess",
"A: oxygen then water", "C: hi hi"), .Dim = c(1L, 13L))
Created on 2022-05-25 by the reprex package (v2.0.1)