过滤镜像定界符未配对的行

Question

我有带有“镜像”分隔符的语音转录，即分别标记开始和结束的成对符号，例如 ( 和 ) 或 < 和>。此数据中的分隔符是方括号：

df <- data.frame(
  id = 1:9,
  Utterance = c("[but if I came !ho!me",                         # <- closing square bracket is missing
                "=[ye::a:h]",                                    # OK!
                "=[yeah] I mean [does it",                       # <- closing square bracket is missing
                "bu[t if (.) you know",                          # <- closing square bracket is missing
                "=ye::a:h]",                                     # <- opening square bracket is missing
                "[that's right] YEAH (laughs)] [ye::a:h]",        # <- opening square bracket is missing
                "cos I've [heard] very sketchy stories",         # OK!
                "[cos] I've [heard very sketchy [stories]",      # <- closing square bracket is missing 
                "oh well] that's great"                          # <- opening square bracket is missing       
))

我想过滤那些至少缺少一个开始定界符或至少一个结束定界符的行（因为这表示转录错误）。我实际上用这个 str_count 方法做得很好：

library(string)
library(dplyr)
df %>% 
   filter(str_count(Utterance, "\[|\]") %in% c(1,3,5,7,9))
  id                                Utterance
1  1                    [but if I came !ho!me
2  3                  =[yeah] I mean [does it
3  4                     bu[t if (.) you know
4  5                                =ye::a:h]
5  6  [that's right] YEAH (laughs)] [ye::a:h]
6  8 [cos] I've [heard very sketchy [stories]
7  9                    oh well] that's great

但想知道是否可以设计正则表达式来直接检测缺少元素的字符串。我一直在尝试这个正则表达式，因为缺少右括号：

p_op <- "(?<!.{0,10}\[.{0,10})\].*$"       
df %>%
  filter(str_detect(Utterance, p_op))

效果很好，这是因为缺少右括号，无法捕获所有匹配项：

p_cl<- "\[(?!.*\]).*$"    
df %>%
  filter(str_detect(Utterance, p_cl))

如何更好地制定模式或模式？

Answer 1

可以使用 str_detect

中的模式 (\[[^\]]+(\[|$)|(^|\])[^\[]+\])

library(dplyr)
library(stringr)

df %>%
   filter(str_detect(Utterance, "\[[^\]]+(\[|$)|(^|\])[^\[]+\]"))
  id                                Utterance
1  1                    [but if I came !ho!me
2  3                  =[yeah] I mean [does it
3  4                     bu[t if (.) you know
4  5                                =ye::a:h]
5  6  [that's right] YEAH (laughs)] [ye::a:h]
6  8 [cos] I've [heard very sketchy [stories]
7  9                    oh well] that's great

这里我们检查左括号 [ 后跟一个或多个不是 ] 的字符后跟 [ 或字符串结尾 ($) 或右括号的类似模式

Answer 2

另一个可能的解决方案，使用 purrr::map_dfr。

解释

我在下文中按照@ChrisRuehlemann 的要求提供了对我的解决方案的解释：

使用str_extract_all(df$Utterance, "\[|\]")，我们将每个话语的所有[和]提取为一个列表，并根据它们在话语中出现的顺序。
我们迭代之前为话语创建的所有列表。但是，我们有一个方括号列表。因此，我们需要事先将列表折叠成一个方括号字符串 (str_c(.x, collapse = "")).
我们将每个语句的方括号字符串与如下字符串[][][]...（str_c(rep("[]", length(.x)/2), collapse = "")）进行比较。如果这两个字符串不相等，则缺少方括号！
当 map_dfr 完成时，我们最终得到一列 TRUE 和 FALSE，我们可以根据需要使用它们来过滤原始数据帧。

library(tidyverse)    

str_extract_all(df$Utterance, "\[|\]") %>% 
  map_dfr(~ list(OK = str_c(.x, collapse = "") != 
            str_c(rep("[]", length(.x)/2), collapse = ""))) %>% 
  filter(df,.)

#>   id                                Utterance
#> 1  1                    [but if I came !ho!me
#> 2  3                  =[yeah] I mean [does it
#> 3  4                     bu[t if (.) you know
#> 4  5                                =ye::a:h]
#> 5  6  [that's right] YEAH (laughs)] [ye::a:h]
#> 6  8 [cos] I've [heard very sketchy [stories]
#> 7  9                    oh well] that's great

Answer 3

如果您需要一个函数来验证（嵌套的）括号，这里有一个基于堆栈的函数。

valid_delim <- function(x, delim = c(open = "[", close = "]"), max_stack_size = 10L){
  f <- function(x, delim, max_stack_size){
    if(is.null(names(delim))) {
      names(delim) <- c("open", "close")
    }
    if(nchar(x) > 0L){
      valid <- TRUE
      stack <- character(max_stack_size)
      i_stack <- 0L
      y <- unlist(strsplit(x, ""))
      for(i in seq_along(y)){
        if(y[i] == delim["open"]){
          i_stack <- i_stack + 1L
          stack[i_stack] <- delim["close"]
        } else if(y[i] == delim["close"]) {
          valid <- (stack[i_stack] == delim["close"]) && (i_stack > 0L)
          if(valid)
            i_stack <- i_stack - 1L
          else break
        }
      }
      valid && (i_stack == 0L)
    } else NULL
  }
  x <- as.character(x)
  y <- sapply(x, f, delim = delim, max_stack_size = max_stack_size)
  unname(y)
}

library(dplyr)

valid_delim(df$Utterance)
#[1] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

df %>% filter(valid_delim(Utterance))
#  id                             Utterance
#1  2                            =[ye::a:h]
#2  7 cos I've [heard] very sketchy stories

过滤镜像定界符未配对的行

Filter rows where mirror-image delimiters are not paired

regex

r

stringr

dplyr