如何从单个字符串中提取对话话语
How to extract conversational utterances from single string
我将几个说话者之间的对话记录为一个字符串:
convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
我还有一个演讲者姓名的向量:
speakers <- c("Peter", "Mary", "al hamshi")
使用这个向量作为我的正则表达式模式的一个组成部分,我在这个提取方面做得相对较好:
library(stringr)
str_extract_all(convers,
paste("(?<=: )[\w\s]+(?= ", paste0(".*\b(", paste(speakers, collapse = "|"), ")\b.*"), ")", sep = ""))
[[1]]
[1] "hiya" "hi how wz your weekend" "ahh still got a headache An you party a lot"
[4] "nuh you know my kid s sick n stuff" "yeah i know thats erm al" "hey guys how s it goin"
[7] "Great" "where ve you been last week"
但是,第三个说话者姓名的第一部分 (al
) 包含在其中一个提取的话语 (yeah i know thats erm al
) 中,并且说话者的最后一句话 al hamshi
( ah you know camping with my girl friend
) 从输出中丢失。如何改进正则表达式以便正确匹配和提取所有话语?
如果换一种方法呢?
从文本中删除所有 speakers
并在 '\s*:\s*'
上拆分字符串
strsplit(gsub(paste(speakers, collapse = "|"), '', convers), '\s*:\s*')[[1]]
# [1] "" "hiya"
# [3] "hi how wz your weekend" "ahh still got a headache An you party a lot"
# [5] "nuh you know my kid s sick n stuff" "yeah i know thats erm"
# [7] "hey guys how s it goin" "Great"
# [9] "where ve you been last week" "ah you know camping with my girl friend"
您可以稍微清理一下输出,从中删除第一个空值。
正确的拆分方法应该是这样的
p2 <- paste0("\s*\b(?:", paste(speakers, collapse = "|"), ")(?=:)")
strsplit(sub("^\W+", "", gsub(p2, "", convers, perl=TRUE)), "\s*:\s*")[[1]]
# => [1] "hiya"
# => [2] "hi how wz your weekend"
# => [3] "ahh still got a headache An you party a lot"
# => [4] "nuh you know my kid s sick n stuff"
# => [5] "yeah i know thats erm"
# => [6] "hey guys how s it goin"
# => [7] "Great"
# => [8] "where ve you been last week"
# => [9] "ah you know camping with my girl friend"
从字符串中删除说话人的正则表达式如下所示
\s*\b(?:Peter|Mary|al hamshi)(?=:)
见regex demo。它将匹配
\s*
- 0+ 个空格
\b
- 单词边界
(?:Peter|Mary|al hamshi)
- 演讲者姓名之一
(?=:)
- 后面必须跟一个 :
字符。
然后,用 sub("^\W+", "", ...)
调用删除开头的非单词字符,然后用 \s*:\s*
匹配 :
的正则表达式拆分整个字符串0+ 个空格。
或者,您可以使用
(?<=(?:Peter|Mary|al hamshi):\s).*?(?=\s*(?:Peter|Mary|al hamshi):|\z)
参见 this regex demo。 详情:
(?<=(?:Peter|Mary|al hamshi):\s)
- 紧接任何演讲者姓名和空格的位置
.*?
- 任何 0+ 个字符(换行字符除外,在模式开头使用 (?s)
以使其匹配任何字符)尽可能少
(?=\s*(?:Peter|Mary|al hamshi):|\z)
- 紧跟 0+ 个空格的位置,然后是任何演讲者姓名和 :
或字符串结尾。
在 R 中,您可以使用
library(stringr)
speakers <- c("Peter", "Mary", "al hamshi")
convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
p = paste0("(?<=(?:",paste(speakers, collapse="|"),"):\s).*?(?=\s*(?:", paste(speakers, collapse="|"),"):|\z)")
str_extract_all(convers, p)
# => [[1]]
# => [1] "hiya"
# => [2] "hi how wz your weekend"
# => [3] "ahh still got a headache An you party a lot"
# => [4] "nuh you know my kid s sick n stuff"
# => [5] "yeah i know thats erm"
# => [6] "hey guys how s it goin"
# => [7] "Great"
# => [8] "where ve you been last week"
# => [9] "ah you know camping with my girl friend"
我将几个说话者之间的对话记录为一个字符串:
convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
我还有一个演讲者姓名的向量:
speakers <- c("Peter", "Mary", "al hamshi")
使用这个向量作为我的正则表达式模式的一个组成部分,我在这个提取方面做得相对较好:
library(stringr)
str_extract_all(convers,
paste("(?<=: )[\w\s]+(?= ", paste0(".*\b(", paste(speakers, collapse = "|"), ")\b.*"), ")", sep = ""))
[[1]]
[1] "hiya" "hi how wz your weekend" "ahh still got a headache An you party a lot"
[4] "nuh you know my kid s sick n stuff" "yeah i know thats erm al" "hey guys how s it goin"
[7] "Great" "where ve you been last week"
但是,第三个说话者姓名的第一部分 (al
) 包含在其中一个提取的话语 (yeah i know thats erm al
) 中,并且说话者的最后一句话 al hamshi
( ah you know camping with my girl friend
) 从输出中丢失。如何改进正则表达式以便正确匹配和提取所有话语?
如果换一种方法呢?
从文本中删除所有 speakers
并在 '\s*:\s*'
strsplit(gsub(paste(speakers, collapse = "|"), '', convers), '\s*:\s*')[[1]]
# [1] "" "hiya"
# [3] "hi how wz your weekend" "ahh still got a headache An you party a lot"
# [5] "nuh you know my kid s sick n stuff" "yeah i know thats erm"
# [7] "hey guys how s it goin" "Great"
# [9] "where ve you been last week" "ah you know camping with my girl friend"
您可以稍微清理一下输出,从中删除第一个空值。
正确的拆分方法应该是这样的
p2 <- paste0("\s*\b(?:", paste(speakers, collapse = "|"), ")(?=:)")
strsplit(sub("^\W+", "", gsub(p2, "", convers, perl=TRUE)), "\s*:\s*")[[1]]
# => [1] "hiya"
# => [2] "hi how wz your weekend"
# => [3] "ahh still got a headache An you party a lot"
# => [4] "nuh you know my kid s sick n stuff"
# => [5] "yeah i know thats erm"
# => [6] "hey guys how s it goin"
# => [7] "Great"
# => [8] "where ve you been last week"
# => [9] "ah you know camping with my girl friend"
从字符串中删除说话人的正则表达式如下所示
\s*\b(?:Peter|Mary|al hamshi)(?=:)
见regex demo。它将匹配
\s*
- 0+ 个空格\b
- 单词边界(?:Peter|Mary|al hamshi)
- 演讲者姓名之一(?=:)
- 后面必须跟一个:
字符。
然后,用 sub("^\W+", "", ...)
调用删除开头的非单词字符,然后用 \s*:\s*
匹配 :
的正则表达式拆分整个字符串0+ 个空格。
或者,您可以使用
(?<=(?:Peter|Mary|al hamshi):\s).*?(?=\s*(?:Peter|Mary|al hamshi):|\z)
参见 this regex demo。 详情:
(?<=(?:Peter|Mary|al hamshi):\s)
- 紧接任何演讲者姓名和空格的位置.*?
- 任何 0+ 个字符(换行字符除外,在模式开头使用(?s)
以使其匹配任何字符)尽可能少(?=\s*(?:Peter|Mary|al hamshi):|\z)
- 紧跟 0+ 个空格的位置,然后是任何演讲者姓名和:
或字符串结尾。
在 R 中,您可以使用
library(stringr)
speakers <- c("Peter", "Mary", "al hamshi")
convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
p = paste0("(?<=(?:",paste(speakers, collapse="|"),"):\s).*?(?=\s*(?:", paste(speakers, collapse="|"),"):|\z)")
str_extract_all(convers, p)
# => [[1]]
# => [1] "hiya"
# => [2] "hi how wz your weekend"
# => [3] "ahh still got a headache An you party a lot"
# => [4] "nuh you know my kid s sick n stuff"
# => [5] "yeah i know thats erm"
# => [6] "hey guys how s it goin"
# => [7] "Great"
# => [8] "where ve you been last week"
# => [9] "ah you know camping with my girl friend"