如何在两个人A和B之间的对话中只提取人A的陈述
How to extract only person A's statements in a conversation between two persons A and B
我有任意两个人 A 和 B 的对话记录。
c1 <- "Person A: blabla...something Person B: blabla something else Person A: OK blabla"
c2 <- "Person A: again blabla Person B: blabla something else Person A: thanks blabla"
数据框如下所示:
df <- data.frame(id = rbind(123, 345), conversation = rbind(c1, c2))
df
id conversation
c1 123 Person A: blabla...something Person B: blabla something else Person A: OK blabla
c2 345 Person A: again blabla Person B: blabla something else Person A: thanks blabla
现在我想只提取人A的部分并将其放入数据框中。结果应该是:
id person_A
1 123 blabla...something OK blabla
2 345 again blabla thanks blabla
它可能不适用于您的所有情况。尤其是对话是从 Person B
开始的。让我知道是否是这种情况。否则试试
df$person_A <- gsub("Person B.*:|Person A:", "", df$conversation)
df <- data.frame(df$id, df$person_A)
使用 stringr
包
首先我们使用"Person A: "作为分隔符
分割字符串
library(stringr)
conv.split <- str_split(df$conversation, "Person A: ")
这将为我们提供由 A 发起的所有对话,并附上 B 的(可选)回答
我们现在删除 B 的回答
conv.split <- lapply(conv.split, function(x){str_split(x, "Person B:.*")})
最后我们取消列出每个元素并将它们折叠成一个字符串
sapply(conv.split, function(x){x <- unlist(x); paste(x, collapse = "")})
结果:
[1] "blabla...something OK blabla" "again blabla thanks blabla"
在 B 开始对话的情况下也适用,前提是两者中只有一个在说话,并且也适用于长时间的对话。
这是我的尝试,我还添加了由 B 发起的第二个对话和一个由 B 结束的对话,只是为了涵盖这些情况:
c1 <- "Person A: blabla...something Person B: blabla something else Person A: OK blabla"
c2 <- "Person A: again blabla Person B: blabla something else Person A: thanks blabla"
c3 <- "Person A: again blabla Person B: blabla something else"
df <- data.frame(id = rbind(123, 345, 567), conversation = rbind(c1, c2, c3))
df$PersonA <- gsub("(Person A: |Person B: .+? (?<= Person A: )|Person B: .+?\Z)", "", df$conversation, perl = TRUE)
df$PersonA
我用 gsub
做的是删除:
- A 人:
- B 的句子后跟 A 的句子
- B 在对话结束时的句子
\Z
我用了perl = TRUE
因为生命太短暂不能不使用后视镜...嗯...lookbehind operator
我非常喜欢以一种让您可以访问所有数据(也包括 B 的话语)的方式来解决这类问题。我喜欢 tidyr 的 extract
用于这种列拆分。我曾经使用 do.call(rbind, strsplit()))
方法,但喜欢 extract
方法的简洁性。
c1 <- "Person A: blabla...something Person B: blabla something else Person A: OK blabla"
c2 <- "Person A: again blabla Person B: blabla something else Person A: thanks blabla"
c3 <- "Person A: again blabla Person B: blabla something else"
df <- data.frame(id = rbind(123, 345, 567), conversation = rbind(c1, c2, c3))
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, tidyr)
conv <- strsplit(as.character(df[["conversation"]]), "\s+(?=Person\s)", perl=TRUE)
df2 <- df[rep(1:nrow(df), sapply(conv, length)), ,drop=FALSE]
rownames(df2) <- NULL
df2[["conversation"]] <- unlist(conv)
df2 %>%
extract(conversation, c("Person", "Conversation"), "([^:]+):\s+(.+)")
## id Person Conversation
## 1 123 Person A blabla...something
## 2 123 Person B blabla something else
## 3 123 Person A OK blabla
## 4 345 Person A again blabla
## 5 345 Person B blabla something else
## 6 345 Person A thanks blabla
## 7 567 Person A again blabla
## 8 567 Person B blabla something else
df2 %>%
extract(conversation, c("Person", "Conversation"), "([^:]+):\s+(.+)") %>%
filter(Person == "Person A")
## id Person Conversation
## 1 123 Person A blabla...something
## 2 123 Person A OK blabla
## 3 345 Person A again blabla
## 4 345 Person A thanks blabla
## 5 567 Person A again blabla
或按照您在所需输出中显示的方式折叠它们:
df2 %>%
extract(conversation, c("Person", "Conversation"), "([^:]+):\s+(.+)") %>%
filter(Person == "Person A") %>%
group_by(id) %>%
select(-Person) %>%
summarise(Person_A =paste(Conversation, collapse=" "))
## id Person_A
## 1 123 blabla...something OK blabla
## 2 345 again blabla thanks blabla
## 3 567 again blabla
编辑:实际上,我怀疑您的数据具有真实姓名,例如 "john Smith" 与 "Person A"。如果是这种情况,此初始正则表达式拆分将捕获使用大写字母后跟冒号的名字和姓氏:
c1 <- "Greg Smith: blabla...something Sue Williams: blabla something else Greg Smith: OK blabla"
c2 <- "Greg Smith: again blabla Sue Williams: blabla something else Greg Smith: thanks blabla"
c3 <- "Greg Smith: again blabla Sue Williams: blabla something else"
df <- data.frame(id = rbind(123, 345, 567), conversation = rbind(c1, c2, c3))r
conv <- strsplit(as.character(df[["conversation"]]), "\s+(?=([A-Z][a-z]+\s+[A-Z][a-z]+:))", perl=TRUE)
df2 <- df[rep(1:nrow(df), sapply(conv, length)), ,drop=FALSE]
rownames(df2) <- NULL
df2[["conversation"]] <- unlist(conv)
df2 %>%
extract(conversation, c("Person", "Conversation"), "([^:]+):\s+(.+)")
## id Person Conversation
## 1 123 Greg Smith blabla...something
## 2 123 Sue Williams blabla something else
## 3 123 Greg Smith OK blabla
## 4 345 Greg Smith again blabla
## 5 345 Sue Williams blabla something else
## 6 345 Greg Smith thanks blabla
## 7 567 Greg Smith again blabla
## 8 567 Sue Williams blabla something else
使用来自基础 R 的 data.table and
gsub`:
require(data.table)
setDT(df)[, Person_A := gsub(".*Person A:[ ]*(.*)[ ]*Person B.*:[ ]*(.*)$",
"\1\2", conversation)][, conversation := NULL]
df
# id Person_A
# 1: 123 blabla...something OK blabla
# 2: 345 again blabla thanks blabla
我有任意两个人 A 和 B 的对话记录。
c1 <- "Person A: blabla...something Person B: blabla something else Person A: OK blabla"
c2 <- "Person A: again blabla Person B: blabla something else Person A: thanks blabla"
数据框如下所示:
df <- data.frame(id = rbind(123, 345), conversation = rbind(c1, c2))
df
id conversation
c1 123 Person A: blabla...something Person B: blabla something else Person A: OK blabla
c2 345 Person A: again blabla Person B: blabla something else Person A: thanks blabla
现在我想只提取人A的部分并将其放入数据框中。结果应该是:
id person_A
1 123 blabla...something OK blabla
2 345 again blabla thanks blabla
它可能不适用于您的所有情况。尤其是对话是从 Person B
开始的。让我知道是否是这种情况。否则试试
df$person_A <- gsub("Person B.*:|Person A:", "", df$conversation)
df <- data.frame(df$id, df$person_A)
使用 stringr
包
首先我们使用"Person A: "作为分隔符
分割字符串library(stringr)
conv.split <- str_split(df$conversation, "Person A: ")
这将为我们提供由 A 发起的所有对话,并附上 B 的(可选)回答
我们现在删除 B 的回答
conv.split <- lapply(conv.split, function(x){str_split(x, "Person B:.*")})
最后我们取消列出每个元素并将它们折叠成一个字符串
sapply(conv.split, function(x){x <- unlist(x); paste(x, collapse = "")})
结果:
[1] "blabla...something OK blabla" "again blabla thanks blabla"
在 B 开始对话的情况下也适用,前提是两者中只有一个在说话,并且也适用于长时间的对话。
这是我的尝试,我还添加了由 B 发起的第二个对话和一个由 B 结束的对话,只是为了涵盖这些情况:
c1 <- "Person A: blabla...something Person B: blabla something else Person A: OK blabla"
c2 <- "Person A: again blabla Person B: blabla something else Person A: thanks blabla"
c3 <- "Person A: again blabla Person B: blabla something else"
df <- data.frame(id = rbind(123, 345, 567), conversation = rbind(c1, c2, c3))
df$PersonA <- gsub("(Person A: |Person B: .+? (?<= Person A: )|Person B: .+?\Z)", "", df$conversation, perl = TRUE)
df$PersonA
我用 gsub
做的是删除:
- A 人:
- B 的句子后跟 A 的句子
- B 在对话结束时的句子
\Z
我用了perl = TRUE
因为生命太短暂不能不使用后视镜...嗯...lookbehind operator
我非常喜欢以一种让您可以访问所有数据(也包括 B 的话语)的方式来解决这类问题。我喜欢 tidyr 的 extract
用于这种列拆分。我曾经使用 do.call(rbind, strsplit()))
方法,但喜欢 extract
方法的简洁性。
c1 <- "Person A: blabla...something Person B: blabla something else Person A: OK blabla"
c2 <- "Person A: again blabla Person B: blabla something else Person A: thanks blabla"
c3 <- "Person A: again blabla Person B: blabla something else"
df <- data.frame(id = rbind(123, 345, 567), conversation = rbind(c1, c2, c3))
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, tidyr)
conv <- strsplit(as.character(df[["conversation"]]), "\s+(?=Person\s)", perl=TRUE)
df2 <- df[rep(1:nrow(df), sapply(conv, length)), ,drop=FALSE]
rownames(df2) <- NULL
df2[["conversation"]] <- unlist(conv)
df2 %>%
extract(conversation, c("Person", "Conversation"), "([^:]+):\s+(.+)")
## id Person Conversation
## 1 123 Person A blabla...something
## 2 123 Person B blabla something else
## 3 123 Person A OK blabla
## 4 345 Person A again blabla
## 5 345 Person B blabla something else
## 6 345 Person A thanks blabla
## 7 567 Person A again blabla
## 8 567 Person B blabla something else
df2 %>%
extract(conversation, c("Person", "Conversation"), "([^:]+):\s+(.+)") %>%
filter(Person == "Person A")
## id Person Conversation
## 1 123 Person A blabla...something
## 2 123 Person A OK blabla
## 3 345 Person A again blabla
## 4 345 Person A thanks blabla
## 5 567 Person A again blabla
或按照您在所需输出中显示的方式折叠它们:
df2 %>%
extract(conversation, c("Person", "Conversation"), "([^:]+):\s+(.+)") %>%
filter(Person == "Person A") %>%
group_by(id) %>%
select(-Person) %>%
summarise(Person_A =paste(Conversation, collapse=" "))
## id Person_A
## 1 123 blabla...something OK blabla
## 2 345 again blabla thanks blabla
## 3 567 again blabla
编辑:实际上,我怀疑您的数据具有真实姓名,例如 "john Smith" 与 "Person A"。如果是这种情况,此初始正则表达式拆分将捕获使用大写字母后跟冒号的名字和姓氏:
c1 <- "Greg Smith: blabla...something Sue Williams: blabla something else Greg Smith: OK blabla"
c2 <- "Greg Smith: again blabla Sue Williams: blabla something else Greg Smith: thanks blabla"
c3 <- "Greg Smith: again blabla Sue Williams: blabla something else"
df <- data.frame(id = rbind(123, 345, 567), conversation = rbind(c1, c2, c3))r
conv <- strsplit(as.character(df[["conversation"]]), "\s+(?=([A-Z][a-z]+\s+[A-Z][a-z]+:))", perl=TRUE)
df2 <- df[rep(1:nrow(df), sapply(conv, length)), ,drop=FALSE]
rownames(df2) <- NULL
df2[["conversation"]] <- unlist(conv)
df2 %>%
extract(conversation, c("Person", "Conversation"), "([^:]+):\s+(.+)")
## id Person Conversation
## 1 123 Greg Smith blabla...something
## 2 123 Sue Williams blabla something else
## 3 123 Greg Smith OK blabla
## 4 345 Greg Smith again blabla
## 5 345 Sue Williams blabla something else
## 6 345 Greg Smith thanks blabla
## 7 567 Greg Smith again blabla
## 8 567 Sue Williams blabla something else
使用来自基础 R 的 data.table and
gsub`:
require(data.table)
setDT(df)[, Person_A := gsub(".*Person A:[ ]*(.*)[ ]*Person B.*:[ ]*(.*)$",
"\1\2", conversation)][, conversation := NULL]
df
# id Person_A
# 1: 123 blabla...something OK blabla
# 2: 345 again blabla thanks blabla