过滤镜像定界符未配对的行
Filter rows where mirror-image delimiters are not paired
我有带有“镜像”分隔符的语音转录,即分别标记开始和结束的成对符号,例如 (
和 )
或 <
和>
。此数据中的分隔符是方括号:
df <- data.frame(
id = 1:9,
Utterance = c("[but if I came !ho!me", # <- closing square bracket is missing
"=[ye::a:h]", # OK!
"=[yeah] I mean [does it", # <- closing square bracket is missing
"bu[t if (.) you know", # <- closing square bracket is missing
"=ye::a:h]", # <- opening square bracket is missing
"[that's right] YEAH (laughs)] [ye::a:h]", # <- opening square bracket is missing
"cos I've [heard] very sketchy stories", # OK!
"[cos] I've [heard very sketchy [stories]", # <- closing square bracket is missing
"oh well] that's great" # <- opening square bracket is missing
))
我想过滤那些至少缺少一个开始定界符或至少一个结束定界符的行(因为这表示转录错误)。
我实际上用这个 str_count
方法做得很好:
library(string)
library(dplyr)
df %>%
filter(str_count(Utterance, "\[|\]") %in% c(1,3,5,7,9))
id Utterance
1 1 [but if I came !ho!me
2 3 =[yeah] I mean [does it
3 4 bu[t if (.) you know
4 5 =ye::a:h]
5 6 [that's right] YEAH (laughs)] [ye::a:h]
6 8 [cos] I've [heard very sketchy [stories]
7 9 oh well] that's great
但想知道是否可以设计正则表达式来直接检测缺少元素的字符串。我一直在尝试这个正则表达式,因为缺少右括号:
p_op <- "(?<!.{0,10}\[.{0,10})\].*$"
df %>%
filter(str_detect(Utterance, p_op))
效果很好,这是因为缺少右括号,无法捕获所有匹配项:
p_cl<- "\[(?!.*\]).*$"
df %>%
filter(str_detect(Utterance, p_cl))
如何更好地制定模式或模式?
可以使用 str_detect
中的模式 (\[[^\]]+(\[|$)|(^|\])[^\[]+\]
)
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Utterance, "\[[^\]]+(\[|$)|(^|\])[^\[]+\]"))
id Utterance
1 1 [but if I came !ho!me
2 3 =[yeah] I mean [does it
3 4 bu[t if (.) you know
4 5 =ye::a:h]
5 6 [that's right] YEAH (laughs)] [ye::a:h]
6 8 [cos] I've [heard very sketchy [stories]
7 9 oh well] that's great
这里我们检查左括号 [
后跟一个或多个不是 ]
的字符后跟 [
或字符串结尾 ($
) 或右括号的类似模式
另一个可能的解决方案,使用 purrr::map_dfr
。
解释
我在下文中按照@ChrisRuehlemann 的要求提供了对我的解决方案的解释:
使用str_extract_all(df$Utterance, "\[|\]")
,我们将每个话语的所有[
和]
提取为一个列表,并根据它们在话语中出现的顺序。
我们迭代之前为话语创建的所有列表。但是,我们有一个方括号列表。因此,我们需要事先将列表折叠成一个方括号字符串 (str_c(.x, collapse = "")
).
我们将每个语句的方括号字符串与如下字符串[][][]...
(str_c(rep("[]", length(.x)/2), collapse = "")
)进行比较。如果这两个字符串不相等,则缺少方括号!
当 map_dfr
完成时,我们最终得到一列 TRUE
和 FALSE
,我们可以根据需要使用它们来过滤原始数据帧。
library(tidyverse)
str_extract_all(df$Utterance, "\[|\]") %>%
map_dfr(~ list(OK = str_c(.x, collapse = "") !=
str_c(rep("[]", length(.x)/2), collapse = ""))) %>%
filter(df,.)
#> id Utterance
#> 1 1 [but if I came !ho!me
#> 2 3 =[yeah] I mean [does it
#> 3 4 bu[t if (.) you know
#> 4 5 =ye::a:h]
#> 5 6 [that's right] YEAH (laughs)] [ye::a:h]
#> 6 8 [cos] I've [heard very sketchy [stories]
#> 7 9 oh well] that's great
如果您需要一个函数来验证(嵌套的)括号,这里有一个基于堆栈的函数。
valid_delim <- function(x, delim = c(open = "[", close = "]"), max_stack_size = 10L){
f <- function(x, delim, max_stack_size){
if(is.null(names(delim))) {
names(delim) <- c("open", "close")
}
if(nchar(x) > 0L){
valid <- TRUE
stack <- character(max_stack_size)
i_stack <- 0L
y <- unlist(strsplit(x, ""))
for(i in seq_along(y)){
if(y[i] == delim["open"]){
i_stack <- i_stack + 1L
stack[i_stack] <- delim["close"]
} else if(y[i] == delim["close"]) {
valid <- (stack[i_stack] == delim["close"]) && (i_stack > 0L)
if(valid)
i_stack <- i_stack - 1L
else break
}
}
valid && (i_stack == 0L)
} else NULL
}
x <- as.character(x)
y <- sapply(x, f, delim = delim, max_stack_size = max_stack_size)
unname(y)
}
library(dplyr)
valid_delim(df$Utterance)
#[1] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
df %>% filter(valid_delim(Utterance))
# id Utterance
#1 2 =[ye::a:h]
#2 7 cos I've [heard] very sketchy stories
我有带有“镜像”分隔符的语音转录,即分别标记开始和结束的成对符号,例如 (
和 )
或 <
和>
。此数据中的分隔符是方括号:
df <- data.frame(
id = 1:9,
Utterance = c("[but if I came !ho!me", # <- closing square bracket is missing
"=[ye::a:h]", # OK!
"=[yeah] I mean [does it", # <- closing square bracket is missing
"bu[t if (.) you know", # <- closing square bracket is missing
"=ye::a:h]", # <- opening square bracket is missing
"[that's right] YEAH (laughs)] [ye::a:h]", # <- opening square bracket is missing
"cos I've [heard] very sketchy stories", # OK!
"[cos] I've [heard very sketchy [stories]", # <- closing square bracket is missing
"oh well] that's great" # <- opening square bracket is missing
))
我想过滤那些至少缺少一个开始定界符或至少一个结束定界符的行(因为这表示转录错误)。
我实际上用这个 str_count
方法做得很好:
library(string)
library(dplyr)
df %>%
filter(str_count(Utterance, "\[|\]") %in% c(1,3,5,7,9))
id Utterance
1 1 [but if I came !ho!me
2 3 =[yeah] I mean [does it
3 4 bu[t if (.) you know
4 5 =ye::a:h]
5 6 [that's right] YEAH (laughs)] [ye::a:h]
6 8 [cos] I've [heard very sketchy [stories]
7 9 oh well] that's great
但想知道是否可以设计正则表达式来直接检测缺少元素的字符串。我一直在尝试这个正则表达式,因为缺少右括号:
p_op <- "(?<!.{0,10}\[.{0,10})\].*$"
df %>%
filter(str_detect(Utterance, p_op))
效果很好,这是因为缺少右括号,无法捕获所有匹配项:
p_cl<- "\[(?!.*\]).*$"
df %>%
filter(str_detect(Utterance, p_cl))
如何更好地制定模式或模式?
可以使用 str_detect
\[[^\]]+(\[|$)|(^|\])[^\[]+\]
)
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Utterance, "\[[^\]]+(\[|$)|(^|\])[^\[]+\]"))
id Utterance
1 1 [but if I came !ho!me
2 3 =[yeah] I mean [does it
3 4 bu[t if (.) you know
4 5 =ye::a:h]
5 6 [that's right] YEAH (laughs)] [ye::a:h]
6 8 [cos] I've [heard very sketchy [stories]
7 9 oh well] that's great
这里我们检查左括号 [
后跟一个或多个不是 ]
的字符后跟 [
或字符串结尾 ($
) 或右括号的类似模式
另一个可能的解决方案,使用 purrr::map_dfr
。
解释
我在下文中按照@ChrisRuehlemann 的要求提供了对我的解决方案的解释:
使用
str_extract_all(df$Utterance, "\[|\]")
,我们将每个话语的所有[
和]
提取为一个列表,并根据它们在话语中出现的顺序。我们迭代之前为话语创建的所有列表。但是,我们有一个方括号列表。因此,我们需要事先将列表折叠成一个方括号字符串 (
str_c(.x, collapse = "")
).我们将每个语句的方括号字符串与如下字符串
[][][]...
(str_c(rep("[]", length(.x)/2), collapse = "")
)进行比较。如果这两个字符串不相等,则缺少方括号!当
map_dfr
完成时,我们最终得到一列TRUE
和FALSE
,我们可以根据需要使用它们来过滤原始数据帧。
library(tidyverse)
str_extract_all(df$Utterance, "\[|\]") %>%
map_dfr(~ list(OK = str_c(.x, collapse = "") !=
str_c(rep("[]", length(.x)/2), collapse = ""))) %>%
filter(df,.)
#> id Utterance
#> 1 1 [but if I came !ho!me
#> 2 3 =[yeah] I mean [does it
#> 3 4 bu[t if (.) you know
#> 4 5 =ye::a:h]
#> 5 6 [that's right] YEAH (laughs)] [ye::a:h]
#> 6 8 [cos] I've [heard very sketchy [stories]
#> 7 9 oh well] that's great
如果您需要一个函数来验证(嵌套的)括号,这里有一个基于堆栈的函数。
valid_delim <- function(x, delim = c(open = "[", close = "]"), max_stack_size = 10L){
f <- function(x, delim, max_stack_size){
if(is.null(names(delim))) {
names(delim) <- c("open", "close")
}
if(nchar(x) > 0L){
valid <- TRUE
stack <- character(max_stack_size)
i_stack <- 0L
y <- unlist(strsplit(x, ""))
for(i in seq_along(y)){
if(y[i] == delim["open"]){
i_stack <- i_stack + 1L
stack[i_stack] <- delim["close"]
} else if(y[i] == delim["close"]) {
valid <- (stack[i_stack] == delim["close"]) && (i_stack > 0L)
if(valid)
i_stack <- i_stack - 1L
else break
}
}
valid && (i_stack == 0L)
} else NULL
}
x <- as.character(x)
y <- sapply(x, f, delim = delim, max_stack_size = max_stack_size)
unname(y)
}
library(dplyr)
valid_delim(df$Utterance)
#[1] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
df %>% filter(valid_delim(Utterance))
# id Utterance
#1 2 =[ye::a:h]
#2 7 cos I've [heard] very sketchy stories