仅用于替换括号外特定字符的正则表达式

regex for replacement of specific character outside parenthesis only

我正在寻找正则表达式(最好在 R 中),它可以用 say ;; 替换(任意数量的)特定字符 say ; 但仅当 不存在 括号内 () 文本字符串内。

注:1. 括号内的替换字符也可能不止一个

2。 data/vector

中没有嵌套括号

例子

不过,如果需要说明,我会尽力解释

in_vec <- c("abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag", "zvc;dfasdf;asdga;asd(asd;hsfd)", "adsg;(asdg;ASF;DFG;ASDF;);sdafdf", "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa")

in_vec
#> [1] "abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag"
#> [2] "zvc;dfasdf;asdga;asd(asd;hsfd)"             
#> [3] "adsg;(asdg;ASF;DFG;ASDF;);sdafdf"           
#> [4] "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa"

预期产量(手动计算)

[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag" 
[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"             
[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"            
[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"

您可以将 gsub;(?![^(]*\)) 一起使用:

gsub(";(?![^(]*\))", ";;", in_vec, perl=TRUE)
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"                   
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"                  
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"       

; 找到 ;, (?!) .. Negative Lookahead(不匹配时进行替换),[^(] .. 一切但不是 (, * 重复前面的 0 到 n 次, \) .. 流过 ).

gsub(";(?=[^)]*($|\())", ";;", in_vec, perl=TRUE)
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"                   
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"                  
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"       

; 找到 ;(?=) .. 正面前瞻(当匹配时进行替换),[^)] .. 除了 [=22= 之外的所有内容],*重复前面的0到n次,($|\()..匹配结束$(.

或使用gregexprregmatches提取()之间的部分并在不匹配的子字符串中进行替换:

x <- gregexpr("\(.*?\)", in_vec)  #Find the part between ( and )
mapply(function(a, b) {
  paste(matrix(c(gsub(";", ";;", b), a, ""), 2, byrow=TRUE), collapse = "")
}, regmatches(in_vec, x), regmatches(in_vec, x, TRUE))
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"                   
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"                  
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"       

但所有这些都只适用于简单的打开 ( 关闭 ) 组合。

虽然可以使用正则表达式解决问题,但使用简单的函数可能更直接、更容易理解。

replace_semicolons_outside_parentheses <- function(raw_string) {
    """Replace ; with ;; outside of parentheses"""

    processed_string <- ""
    n_open_parentheses <- 0

    # Loops over characters in raw_string
    for (char in strsplit(raw_string, "")[[1]]) {

        # Update the net number of open parentheses
        if (char == "(") {
            n_open_parentheses <- n_open_parentheses + 1
        } else if (char == ")") {
            n_open_parentheses <- n_open_parentheses - 1
        }

        # Replace ; with ;; outside of parentheses
        if (char == ";" && n_open_parentheses == 0) {
            processed_string <- paste0(processed_string, ";;")
        } else {
            processed_string <- paste0(processed_string, char)
        }      
    }
    return(processed_string)
}

请注意,上面的函数也适用于嵌套括号:嵌套括号内的分号不会被替换!可以在一行中获得所需的输出:

out_vec <- lapply(in_vec, replace_semicolons_outside_parentheses)

# 1. 'abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag'
# 2. 'zvc;;dfasdf;;asdga;;asd(asd;hsfd)'
# 3. 'adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf'
# 4. 'asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa'

如果没有嵌套括号,请使用以下内容:

gsub("\([^()]*\)(*SKIP)(*FAIL)|;", ";;", in_vec, perl=TRUE)

regex proof

解释

--------------------------------------------------------------------------------
  \(                       '(' char
--------------------------------------------------------------------------------
    [^()]*                   any character except: '(', ')' (0 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    \)                       ')' char
--------------------------------------------------------------------------------
  (*SKIP)(*FAIL)           skip current match, search for new one from here
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  ;                        ';'

如果有嵌套括号:

gsub("(\((?:[^()]++|(?1))*\))(*SKIP)(*FAIL)|;", ";;", in_vec, perl=TRUE)

参见 regex proof

解释

--------------------------------------------------------------------------------
  (                        group and capture to :
--------------------------------------------------------------------------------
    \(                       '('
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      [^()]++                  any character except: '(', ')' (1 or more times 
                               (matching the most amount possible, no backtracking))
--------------------------------------------------------------------------------
     |                         or
--------------------------------------------------------------------------------
     (?1)                    recursing first group pattern
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
  \)                         ')' 
--------------------------------------------------------------------------------  
  )                          end of 
--------------------------------------------------------------------------------  
  (*SKIP)(*FAIL)             skip the match, search for next
--------------------------------------------------------------------------------
  |                         or
--------------------------------------------------------------------------------
  ;                         ';'
--------------------------------------------------------------------------------