如何从单个字符串中提取对话话语

How to extract conversational utterances from single string

我将几个说话者之间的对话记录为一个字符串:

convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"

我还有一个演讲者姓名的向量:

speakers <- c("Peter", "Mary", "al hamshi")

使用这个向量作为我的正则表达式模式的一个组成部分,我在这个提取方面做得相对较好:

library(stringr)
str_extract_all(convers, 
                paste("(?<=: )[\w\s]+(?= ", paste0(".*\b(", paste(speakers, collapse = "|"), ")\b.*"), ")", sep = ""))
[[1]]
[1] "hiya"                                        "hi how wz your weekend"                      "ahh still got a headache An you party a lot"
[4] "nuh you know my kid s sick n stuff"          "yeah i know thats erm al"                    "hey guys how s it goin"                     
[7] "Great"                                       "where ve you been last week"

但是,第三个说话者姓名的第一部分 (al) 包含在其中一个提取的话语 (yeah i know thats erm al) 中,并且说话者的最后一句话 al hamshi ( ah you know camping with my girl friend) 从输出中丢失。如何改进正则表达式以便正确匹配和提取所有话语?

如果换一种方法呢?

从文本中删除所有 speakers 并在 '\s*:\s*'

上拆分字符串
strsplit(gsub(paste(speakers, collapse = "|"), '', convers), '\s*:\s*')[[1]]

# [1] ""                                            "hiya"                                       
# [3] "hi how wz your weekend"                      "ahh still got a headache An you party a lot"
# [5] "nuh you know my kid s sick n stuff"          "yeah i know thats erm"                      
# [7] "hey guys how s it goin"                      "Great"                                      
# [9] "where ve you been last week"                 "ah you know camping with my girl friend"   

您可以稍微清理一下输出,从中删除第一个空值。

正确的拆分方法应该是这样的

p2 <- paste0("\s*\b(?:", paste(speakers, collapse = "|"), ")(?=:)")
strsplit(sub("^\W+", "", gsub(p2, "", convers, perl=TRUE)), "\s*:\s*")[[1]]
# => [1] "hiya"                                       
# => [2] "hi how wz your weekend"                     
# => [3] "ahh still got a headache An you party a lot"
# => [4] "nuh you know my kid s sick n stuff"         
# => [5] "yeah i know thats erm"                      
# => [6] "hey guys how s it goin"                     
# => [7] "Great"                                      
# => [8] "where ve you been last week"                
# => [9] "ah you know camping with my girl friend"    

从字符串中删除说话人的正则表达式如下所示

\s*\b(?:Peter|Mary|al hamshi)(?=:)

regex demo。它将匹配

  • \s* - 0+ 个空格
  • \b - 单词边界
  • (?:Peter|Mary|al hamshi) - 演讲者姓名之一
  • (?=:) - 后面必须跟一个 : 字符。

然后,用 sub("^\W+", "", ...) 调用删除开头的非单词字符,然后用 \s*:\s* 匹配 : 的正则表达式拆分整个字符串0+ 个空格。

或者,您可以使用

(?<=(?:Peter|Mary|al hamshi):\s).*?(?=\s*(?:Peter|Mary|al hamshi):|\z)

参见 this regex demo详情:

  • (?<=(?:Peter|Mary|al hamshi):\s) - 紧接任何演讲者姓名和空格的位置
  • .*? - 任何 0+ 个字符(换行字符除外,在模式开头使用 (?s) 以使其匹配任何字符)尽可能少
  • (?=\s*(?:Peter|Mary|al hamshi):|\z) - 紧跟 0+ 个空格的位置,然后是任何演讲者姓名和 : 或字符串结尾。

在 R 中,您可以使用

library(stringr)
speakers <- c("Peter", "Mary", "al hamshi")
convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
p = paste0("(?<=(?:",paste(speakers, collapse="|"),"):\s).*?(?=\s*(?:", paste(speakers, collapse="|"),"):|\z)")
str_extract_all(convers, p)
# => [[1]]
# => [1] "hiya"                                       
# => [2] "hi how wz your weekend"                     
# => [3] "ahh still got a headache An you party a lot"
# => [4] "nuh you know my kid s sick n stuff"         
# => [5] "yeah i know thats erm"                      
# => [6] "hey guys how s it goin"                     
# => [7] "Great"                                      
# => [8] "where ve you been last week"                
# => [9] "ah you know camping with my girl friend"