为不一致的子字符串制定捕获组
formulate capture groups for inconsistently present substrings
我有部分不规则格式的采访记录:
tst <- c("In: ja COOL; #00:04:24-6# ",
" in den vier, FÜNF wochen, #00:04:57-8# ",
"In: jah, #00:02:07-8# ",
"In: [ja; ] #00:03:25-5# [ja; ] #00:03:26-1#",
" also jA:h; #00:03:16-6# (1.1)",
"Bz: [E::hm; ] #00:03:51-4# (3.0) ",
"Bz: [mhmh, ]",
" in den bilLIE da war;")
我需要做的是通过将关键元素提取到数据框的列中来构造此数据。有四个这样的关键要素:
Role
面试中:受访者或面试官
Utterance
:采访伙伴发言
Timestamp
表示两端#
Gap
用括号中的十进制数表示
问题是 Timestamp
和 Gap
的提供不一致。虽然我可以将 Gap
的最后一个捕获组设为可选,但那些既没有 Timestamp
也没有 Gap
的字符串无法正确呈现:
我正在使用 tidyr
中的 extract
进行提取:
library(tidyr)
data.frame(tst) %>%
extract(col = tst,
into = c("Role", "Utterance", "Timestamp", "Gap"),
regex = "^(\w{2}:\s|\s+)([\S\s]+?)\s*#([^#]+)?#\s*(\([0-9.]+\))?\s*")
Role Utterance Timestamp Gap
1 In: ja COOL; 00:04:24-6
2 in den vier, FÜNF wochen, 00:04:57-8
3 In: jah, 00:02:07-8
4 In: [ja; ] 00:03:25-5
5 also jA:h; 00:03:16-6 (1.1)
6 Bz: [E::hm; ] 00:03:51-4 (3.0)
7 <NA> <NA> <NA> <NA>
8 <NA> <NA> <NA> <NA>
如何改进正则表达式以获得所需的输出:
Role Utterance Timestamp Gap
1 In: ja COOL; 00:04:24-6
2 in den vier, FÜNF wochen, 00:04:57-8
3 In: jah, 00:02:07-8
4 In: [ja; ] 00:03:25-5
5 also jA:h; 00:03:16-6 (1.1)
6 Bz: [E::hm; ] 00:03:51-4 (3.0)
7 Bz: [mhmh, ]
8 in den bilLIE da war;
您可以更新您的模式以使用您的 4 个捕获组,并通过可选地匹配第 3 组然后匹配第 4 组并断言字符串的结尾来使最后一部分成为可选的:
library(tidyr)
tst <- c("In: ja COOL; #00:04:24-6# ",
" in den vier, FÜNF wochen, #00:04:57-8# ",
"In: jah, #00:02:07-8# ",
"In: [ja; ] #00:03:25-5# [ja; ] #00:03:26-1#",
" also jA:h; #00:03:16-6# (1.1)",
"Bz: [E::hm; ] #00:03:51-4# (3.0) ",
"Bz: [mhmh, ]",
" in den bilLIE da war;")
data.frame(tst) %>%
extract(col = tst,
into = c("Role", "Utterance", "Timestamp", "Gap"),
regex = "^(\w{2}:\s|\s+)([\s\S]*?)(?:\s*#([^#]+)(?:#\s*(\([0-9.]+\))?\s*)?)?$")
输出
Role Utterance Timestamp Gap
1 In: ja COOL; 00:04:24-6
2 in den vier, FÜNF wochen, 00:04:57-8
3 In: jah, 00:02:07-8
4 In: [ja; ] #00:03:25-5# [ja; ] 00:03:26-1
5 also jA:h; 00:03:16-6 (1.1)
6 Bz: [E::hm; ] 00:03:51-4 (3.0)
7 Bz: [mhmh, ]
8 in den bilLIE da war;
复杂正则表达式的替代方法是使用具有更简单正则表达式的多个提取。然后将任何 NA 转换为 "" 并去除不需要的空格。
library(dplyr)
library(tidyr)
data.frame(tst) %>%
extract(tst, "Gap", "(\(.*?\))", remove = FALSE) %>%
extract(tst, "Timestamp", "(#.*?#)", remove = FALSE) %>%
extract(tst, c("Role", "Utterance"), "^(\S+:|)([^#]*)") %>%
mutate(across(, coalesce, ""), Utterance = trimws(Utterance))
给予:
Role Utterance Timestamp Gap
1 In: ja COOL; #00:04:24-6#
2 in den vier, FÜNF wochen, #00:04:57-8#
3 In: jah, #00:02:07-8#
4 In: [ja; ] #00:03:25-5#
5 also jA:h; #00:03:16-6# (1.1)
6 Bz: [E::hm; ] #00:03:51-4# (3.0)
7 Bz: [mhmh, ]
8 in den bilLIE da war;
我有部分不规则格式的采访记录:
tst <- c("In: ja COOL; #00:04:24-6# ",
" in den vier, FÜNF wochen, #00:04:57-8# ",
"In: jah, #00:02:07-8# ",
"In: [ja; ] #00:03:25-5# [ja; ] #00:03:26-1#",
" also jA:h; #00:03:16-6# (1.1)",
"Bz: [E::hm; ] #00:03:51-4# (3.0) ",
"Bz: [mhmh, ]",
" in den bilLIE da war;")
我需要做的是通过将关键元素提取到数据框的列中来构造此数据。有四个这样的关键要素:
Role
面试中:受访者或面试官Utterance
:采访伙伴发言Timestamp
表示两端#
Gap
用括号中的十进制数表示
问题是 Timestamp
和 Gap
的提供不一致。虽然我可以将 Gap
的最后一个捕获组设为可选,但那些既没有 Timestamp
也没有 Gap
的字符串无法正确呈现:
我正在使用 tidyr
中的 extract
进行提取:
library(tidyr)
data.frame(tst) %>%
extract(col = tst,
into = c("Role", "Utterance", "Timestamp", "Gap"),
regex = "^(\w{2}:\s|\s+)([\S\s]+?)\s*#([^#]+)?#\s*(\([0-9.]+\))?\s*")
Role Utterance Timestamp Gap
1 In: ja COOL; 00:04:24-6
2 in den vier, FÜNF wochen, 00:04:57-8
3 In: jah, 00:02:07-8
4 In: [ja; ] 00:03:25-5
5 also jA:h; 00:03:16-6 (1.1)
6 Bz: [E::hm; ] 00:03:51-4 (3.0)
7 <NA> <NA> <NA> <NA>
8 <NA> <NA> <NA> <NA>
如何改进正则表达式以获得所需的输出:
Role Utterance Timestamp Gap
1 In: ja COOL; 00:04:24-6
2 in den vier, FÜNF wochen, 00:04:57-8
3 In: jah, 00:02:07-8
4 In: [ja; ] 00:03:25-5
5 also jA:h; 00:03:16-6 (1.1)
6 Bz: [E::hm; ] 00:03:51-4 (3.0)
7 Bz: [mhmh, ]
8 in den bilLIE da war;
您可以更新您的模式以使用您的 4 个捕获组,并通过可选地匹配第 3 组然后匹配第 4 组并断言字符串的结尾来使最后一部分成为可选的:
library(tidyr)
tst <- c("In: ja COOL; #00:04:24-6# ",
" in den vier, FÜNF wochen, #00:04:57-8# ",
"In: jah, #00:02:07-8# ",
"In: [ja; ] #00:03:25-5# [ja; ] #00:03:26-1#",
" also jA:h; #00:03:16-6# (1.1)",
"Bz: [E::hm; ] #00:03:51-4# (3.0) ",
"Bz: [mhmh, ]",
" in den bilLIE da war;")
data.frame(tst) %>%
extract(col = tst,
into = c("Role", "Utterance", "Timestamp", "Gap"),
regex = "^(\w{2}:\s|\s+)([\s\S]*?)(?:\s*#([^#]+)(?:#\s*(\([0-9.]+\))?\s*)?)?$")
输出
Role Utterance Timestamp Gap
1 In: ja COOL; 00:04:24-6
2 in den vier, FÜNF wochen, 00:04:57-8
3 In: jah, 00:02:07-8
4 In: [ja; ] #00:03:25-5# [ja; ] 00:03:26-1
5 also jA:h; 00:03:16-6 (1.1)
6 Bz: [E::hm; ] 00:03:51-4 (3.0)
7 Bz: [mhmh, ]
8 in den bilLIE da war;
复杂正则表达式的替代方法是使用具有更简单正则表达式的多个提取。然后将任何 NA 转换为 "" 并去除不需要的空格。
library(dplyr)
library(tidyr)
data.frame(tst) %>%
extract(tst, "Gap", "(\(.*?\))", remove = FALSE) %>%
extract(tst, "Timestamp", "(#.*?#)", remove = FALSE) %>%
extract(tst, c("Role", "Utterance"), "^(\S+:|)([^#]*)") %>%
mutate(across(, coalesce, ""), Utterance = trimws(Utterance))
给予:
Role Utterance Timestamp Gap
1 In: ja COOL; #00:04:24-6#
2 in den vier, FÜNF wochen, #00:04:57-8#
3 In: jah, #00:02:07-8#
4 In: [ja; ] #00:03:25-5#
5 also jA:h; #00:03:16-6# (1.1)
6 Bz: [E::hm; ] #00:03:51-4# (3.0)
7 Bz: [mhmh, ]
8 in den bilLIE da war;