过滤镜像定界符未配对的行

Filter rows where mirror-image delimiters are not paired

我有带有“镜像”分隔符的语音转录,即分别标记开始和结束的成对符号,例如 ()<>。此数据中的分隔符是方括号:

df <- data.frame(
  id = 1:9,
  Utterance = c("[but if I came !ho!me",                         # <- closing square bracket is missing
                "=[ye::a:h]",                                    # OK!
                "=[yeah] I mean [does it",                       # <- closing square bracket is missing
                "bu[t if (.) you know",                          # <- closing square bracket is missing
                "=ye::a:h]",                                     # <- opening square bracket is missing
                "[that's right] YEAH (laughs)] [ye::a:h]",        # <- opening square bracket is missing
                "cos I've [heard] very sketchy stories",         # OK!
                "[cos] I've [heard very sketchy [stories]",      # <- closing square bracket is missing 
                "oh well] that's great"                          # <- opening square bracket is missing       
))

我想过滤那些至少缺少一个开始定界符或至少一个结束定界符的行(因为这表示转录错误)。 我实际上用这个 str_count 方法做得很好:

library(string)
library(dplyr)
df %>% 
   filter(str_count(Utterance, "\[|\]") %in% c(1,3,5,7,9))
  id                                Utterance
1  1                    [but if I came !ho!me
2  3                  =[yeah] I mean [does it
3  4                     bu[t if (.) you know
4  5                                =ye::a:h]
5  6  [that's right] YEAH (laughs)] [ye::a:h]
6  8 [cos] I've [heard very sketchy [stories]
7  9                    oh well] that's great

但想知道是否可以设计正则表达式来直接检测缺少元素的字符串。我一直在尝试这个正则表达式,因为缺少右括号:

p_op <- "(?<!.{0,10}\[.{0,10})\].*$"       
df %>%
  filter(str_detect(Utterance, p_op))

效果很好,这是因为缺少右括号,无法捕获所有匹配项:

p_cl<- "\[(?!.*\]).*$"    
df %>%
  filter(str_detect(Utterance, p_cl))

如何更好地制定模式或模式?

可以使用 str_detect

中的模式 (\[[^\]]+(\[|$)|(^|\])[^\[]+\])
library(dplyr)
library(stringr)

df %>%
   filter(str_detect(Utterance, "\[[^\]]+(\[|$)|(^|\])[^\[]+\]"))
  id                                Utterance
1  1                    [but if I came !ho!me
2  3                  =[yeah] I mean [does it
3  4                     bu[t if (.) you know
4  5                                =ye::a:h]
5  6  [that's right] YEAH (laughs)] [ye::a:h]
6  8 [cos] I've [heard very sketchy [stories]
7  9                    oh well] that's great

这里我们检查左括号 [ 后跟一个或多个不是 ] 的字符后跟 [ 或字符串结尾 ($) 或右括号的类似模式

另一个可能的解决方案,使用 purrr::map_dfr

解释

我在下文中按照@ChrisRuehlemann 的要求提供了对我的解决方案的解释:

  1. 使用str_extract_all(df$Utterance, "\[|\]"),我们将每个话语的所有[]提取为一个列表,并根据它们在话语中出现的顺序。

  2. 我们迭代之前为话语创建的所有列表。但是,我们有一个方括号列表。因此,我们需要事先将列表折叠成一个方括号字符串 (str_c(.x, collapse = "")).

  3. 我们将每个语句的方括号字符串与如下字符串[][][]...str_c(rep("[]", length(.x)/2), collapse = ""))进行比较。如果这两个字符串不相等,则缺少方括号!

  4. map_dfr 完成时,我们最终得到一列 TRUEFALSE,我们可以根据需要使用它们来过滤原始数据帧。

library(tidyverse)    

str_extract_all(df$Utterance, "\[|\]") %>% 
  map_dfr(~ list(OK = str_c(.x, collapse = "") != 
            str_c(rep("[]", length(.x)/2), collapse = ""))) %>% 
  filter(df,.)

#>   id                                Utterance
#> 1  1                    [but if I came !ho!me
#> 2  3                  =[yeah] I mean [does it
#> 3  4                     bu[t if (.) you know
#> 4  5                                =ye::a:h]
#> 5  6  [that's right] YEAH (laughs)] [ye::a:h]
#> 6  8 [cos] I've [heard very sketchy [stories]
#> 7  9                    oh well] that's great

如果您需要一个函数来验证(嵌套的)括号,这里有一个基于堆栈的函数。

valid_delim <- function(x, delim = c(open = "[", close = "]"), max_stack_size = 10L){
  f <- function(x, delim, max_stack_size){
    if(is.null(names(delim))) {
      names(delim) <- c("open", "close")
    }
    if(nchar(x) > 0L){
      valid <- TRUE
      stack <- character(max_stack_size)
      i_stack <- 0L
      y <- unlist(strsplit(x, ""))
      for(i in seq_along(y)){
        if(y[i] == delim["open"]){
          i_stack <- i_stack + 1L
          stack[i_stack] <- delim["close"]
        } else if(y[i] == delim["close"]) {
          valid <- (stack[i_stack] == delim["close"]) && (i_stack > 0L)
          if(valid)
            i_stack <- i_stack - 1L
          else break
        }
      }
      valid && (i_stack == 0L)
    } else NULL
  }
  x <- as.character(x)
  y <- sapply(x, f, delim = delim, max_stack_size = max_stack_size)
  unname(y)
}

library(dplyr)

valid_delim(df$Utterance)
#[1] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

df %>% filter(valid_delim(Utterance))
#  id                             Utterance
#1  2                            =[ye::a:h]
#2  7 cos I've [heard] very sketchy stories