用于匹配所有逗号的正则表达式,除非它们被括在圆括号或方括号中
RegEx for matching all commas unless they are enclosed between parentheses or brackets
考虑 R 中的以下代码:
x <- "A, B (C, D, E), F, G [H, I, J], K (L (M, N), O), P (Q (R, S (T, U)))"
strsplit(x, split = "some regex here")
我想要这个 return 类似于包含字符向量的列表的东西
"A"
"B (C, D, E)"
"F"
"G [H, I, J]"
"K (L (M, N), O)"
"P (Q (R, S (T, U)))"
编辑:提议的备选问题没有回答我的问题,因为允许嵌套的圆括号和方括号,并且可能发生 n 级嵌套(超过 2) .
这看起来更像是自定义解析器的工作,而不是单个正则表达式。我很想被证明是错误的,但在我们等待的时候,这里有一个非常简单的解析函数可以完成工作。
parse_nested <- function(string) {
chars <- strsplit(string, "")[[1]]
parentheses <- numeric(length(chars))
parentheses[chars == "("] <- 1
parentheses[chars == ")"] <- -1
parentheses <- cumsum(parentheses)
brackets <- numeric(length(chars))
brackets[chars == "["] <- 1
brackets[chars == "]"] <- -1
brackets <- cumsum(brackets)
split_on <- which(brackets == 0 & parentheses == 0 & chars == ",")
split_on <- c(0, split_on, length(chars) + 1)
result <- character()
for(i in seq_along(head(split_on, -1))) {
x <- paste0(chars[(split_on[i] + 1):(split_on[i + 1] - 1)], collapse = "")
result <- c(result, x)
}
trimws(result)
}
产生:
parse_nested(x)
#> [1] "A" "B (C, D, E)" "F"
#> [4] "G [H, I, J]" "K (L (M, N), O)" "P (Q (R, S (T, U)))"
仅使用 regex
。由于 stringr
不允许递归,我们需要使用 base R.
x <- "A, B (C, D, E), F, G [H, I, J], K (L (M, N), O), P (Q (R, S (T, U)))"
regmatches(x,
gregexpr("([A-Z] )*([\(\[](?>[^()\[\]]|(?R))*[\)\]])|[A-Z]",
x, perl = TRUE))
#> [[1]]
#> [1] "A" "B (C, D, E)" "F"
#> [4] "G [H, I, J]" "K (L (M, N), O)" "P (Q (R, S (T, U)))"
考虑 R 中的以下代码:
x <- "A, B (C, D, E), F, G [H, I, J], K (L (M, N), O), P (Q (R, S (T, U)))"
strsplit(x, split = "some regex here")
我想要这个 return 类似于包含字符向量的列表的东西
"A"
"B (C, D, E)"
"F"
"G [H, I, J]"
"K (L (M, N), O)"
"P (Q (R, S (T, U)))"
编辑:提议的备选问题没有回答我的问题,因为允许嵌套的圆括号和方括号,并且可能发生 n 级嵌套(超过 2) .
这看起来更像是自定义解析器的工作,而不是单个正则表达式。我很想被证明是错误的,但在我们等待的时候,这里有一个非常简单的解析函数可以完成工作。
parse_nested <- function(string) {
chars <- strsplit(string, "")[[1]]
parentheses <- numeric(length(chars))
parentheses[chars == "("] <- 1
parentheses[chars == ")"] <- -1
parentheses <- cumsum(parentheses)
brackets <- numeric(length(chars))
brackets[chars == "["] <- 1
brackets[chars == "]"] <- -1
brackets <- cumsum(brackets)
split_on <- which(brackets == 0 & parentheses == 0 & chars == ",")
split_on <- c(0, split_on, length(chars) + 1)
result <- character()
for(i in seq_along(head(split_on, -1))) {
x <- paste0(chars[(split_on[i] + 1):(split_on[i + 1] - 1)], collapse = "")
result <- c(result, x)
}
trimws(result)
}
产生:
parse_nested(x)
#> [1] "A" "B (C, D, E)" "F"
#> [4] "G [H, I, J]" "K (L (M, N), O)" "P (Q (R, S (T, U)))"
仅使用 regex
。由于 stringr
不允许递归,我们需要使用 base R.
x <- "A, B (C, D, E), F, G [H, I, J], K (L (M, N), O), P (Q (R, S (T, U)))"
regmatches(x,
gregexpr("([A-Z] )*([\(\[](?>[^()\[\]]|(?R))*[\)\]])|[A-Z]",
x, perl = TRUE))
#> [[1]]
#> [1] "A" "B (C, D, E)" "F"
#> [4] "G [H, I, J]" "K (L (M, N), O)" "P (Q (R, S (T, U)))"