为不一致的子字符串制定捕获组

formulate capture groups for inconsistently present substrings

我有部分不规则格式的采访记录:

tst <- c("In: ja COOL;  #00:04:24-6#  ",           
         "  in den vier, FÜNF wochen, #00:04:57-8# ",
         "In: jah,  #00:02:07-8# ",
         "In:     [ja; ] #00:03:25-5# [ja; ] #00:03:26-1#",
         "    also jA:h; #00:03:16-6# (1.1)",
         "Bz:        [E::hm;    ]  #00:03:51-4#  (3.0)  ",
         "Bz:    [mhmh,      ]",
         "  in den bilLIE da war;")

我需要做的是通过将关键元素提取到数据框的列中来构造此数据。有四个这样的关键要素:

问题是 TimestampGap 的提供不一致。虽然我可以将 Gap 的最后一个捕获组设为可选,但那些既没有 Timestamp 也没有 Gap 的字符串无法正确呈现:

我正在使用 tidyr 中的 extract 进行提取:

library(tidyr)
data.frame(tst) %>%
  extract(col = tst,
          into = c("Role", "Utterance", "Timestamp", "Gap"),
          regex = "^(\w{2}:\s|\s+)([\S\s]+?)\s*#([^#]+)?#\s*(\([0-9.]+\))?\s*")
  Role                 Utterance  Timestamp   Gap
1 In:                   ja COOL; 00:04:24-6      
2      in den vier, FÜNF wochen, 00:04:57-8      
3 In:                       jah, 00:02:07-8      
4 In:                     [ja; ] 00:03:25-5      
5                     also jA:h; 00:03:16-6 (1.1)
6 Bz:               [E::hm;    ] 00:03:51-4 (3.0)
7 <NA>                      <NA>       <NA>  <NA>
8 <NA>                      <NA>       <NA>  <NA>

如何改进正则表达式以获得所需的输出:

  Role                 Utterance  Timestamp   Gap
1 In:                   ja COOL; 00:04:24-6      
2      in den vier, FÜNF wochen, 00:04:57-8      
3 In:                       jah, 00:02:07-8      
4 In:                     [ja; ] 00:03:25-5      
5                     also jA:h; 00:03:16-6 (1.1)
6 Bz:               [E::hm;    ] 00:03:51-4 (3.0)
7 Bz:              [mhmh,      ]
8          in den bilLIE da war;

您可以更新您的模式以使用您的 4 个捕获组,并通过可选地匹配第 3 组然后匹配第 4 组并断言字符串的结尾来使最后一部分成为可选的:

library(tidyr)

tst <- c("In: ja COOL;  #00:04:24-6#  ",           
         "  in den vier, FÜNF wochen, #00:04:57-8# ",
         "In: jah,  #00:02:07-8# ",
         "In:     [ja; ] #00:03:25-5# [ja; ] #00:03:26-1#",
         "    also jA:h; #00:03:16-6# (1.1)",
         "Bz:        [E::hm;    ]  #00:03:51-4#  (3.0)  ",
         "Bz:    [mhmh,      ]",
         "  in den bilLIE da war;")     

data.frame(tst) %>%
  extract(col = tst,
          into = c("Role", "Utterance", "Timestamp", "Gap"),
          regex = "^(\w{2}:\s|\s+)([\s\S]*?)(?:\s*#([^#]+)(?:#\s*(\([0-9.]+\))?\s*)?)?$")

输出

  Role                      Utterance  Timestamp   Gap
1 In:                        ja COOL; 00:04:24-6      
2           in den vier, FÜNF wochen, 00:04:57-8      
3 In:                            jah, 00:02:07-8      
4 In:      [ja; ] #00:03:25-5# [ja; ] 00:03:26-1      
5                          also jA:h; 00:03:16-6 (1.1)
6 Bz:                    [E::hm;    ] 00:03:51-4 (3.0)
7 Bz:                   [mhmh,      ]                 
8               in den bilLIE da war; 

复杂正则表达式的替代方法是使用具有更简单正则表达式的多个提取。然后将任何 NA 转换为 "" 并去除不需要的空格。

library(dplyr)
library(tidyr)

data.frame(tst) %>%
  extract(tst, "Gap", "(\(.*?\))", remove = FALSE) %>%
  extract(tst, "Timestamp", "(#.*?#)", remove = FALSE) %>%
  extract(tst, c("Role", "Utterance"), "^(\S+:|)([^#]*)") %>%
  mutate(across(, coalesce, ""), Utterance = trimws(Utterance))

给予:

  Role                 Utterance    Timestamp   Gap
1  In:                  ja COOL; #00:04:24-6#      
2      in den vier, FÜNF wochen, #00:04:57-8#      
3  In:                      jah, #00:02:07-8#      
4  In:                    [ja; ] #00:03:25-5#      
5                     also jA:h; #00:03:16-6# (1.1)
6  Bz:              [E::hm;    ] #00:03:51-4# (3.0)
7  Bz:             [mhmh,      ]                   
8          in den bilLIE da war;