仅用于替换括号外特定字符的正则表达式
regex for replacement of specific character outside parenthesis only
我正在寻找正则表达式(最好在 R
中),它可以用 say ;;
替换(任意数量的)特定字符 say ;
但仅当 不存在 括号内 ()
文本字符串内。
注:1. 括号内的替换字符也可能不止一个
2。 data/vector
中没有嵌套括号
例子
text;othertext
替换为 text;;othertext
- 但
text;other(texttt;some;someother);more
将替换为 text;;other(texttt;some;someother);;more
。 (即 ;
仅在 ()
之外被替换为替换文本)
不过,如果需要说明,我会尽力解释
in_vec <- c("abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag", "zvc;dfasdf;asdga;asd(asd;hsfd)", "adsg;(asdg;ASF;DFG;ASDF;);sdafdf", "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa")
in_vec
#> [1] "abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag"
#> [2] "zvc;dfasdf;asdga;asd(asd;hsfd)"
#> [3] "adsg;(asdg;ASF;DFG;ASDF;);sdafdf"
#> [4] "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa"
预期产量(手动计算)
[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
您可以将 gsub
与 ;(?![^(]*\))
一起使用:
gsub(";(?![^(]*\))", ";;", in_vec, perl=TRUE)
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
;
找到 ;
, (?!)
.. Negative Lookahead(不匹配时进行替换),[^(]
.. 一切但不是 (
, *
重复前面的 0 到 n 次, \)
.. 流过 )
.
或
gsub(";(?=[^)]*($|\())", ";;", in_vec, perl=TRUE)
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
;
找到 ;
、(?=)
.. 正面前瞻(当匹配时进行替换),[^)]
.. 除了 [=22= 之外的所有内容],*
重复前面的0到n次,($|\()
..匹配结束$
或(
.
或使用gregexpr
和regmatches
提取(
和)
之间的部分并在不匹配的子字符串中进行替换:
x <- gregexpr("\(.*?\)", in_vec) #Find the part between ( and )
mapply(function(a, b) {
paste(matrix(c(gsub(";", ";;", b), a, ""), 2, byrow=TRUE), collapse = "")
}, regmatches(in_vec, x), regmatches(in_vec, x, TRUE))
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
但所有这些都只适用于简单的打开 (
关闭 )
组合。
虽然可以使用正则表达式解决问题,但使用简单的函数可能更直接、更容易理解。
replace_semicolons_outside_parentheses <- function(raw_string) {
"""Replace ; with ;; outside of parentheses"""
processed_string <- ""
n_open_parentheses <- 0
# Loops over characters in raw_string
for (char in strsplit(raw_string, "")[[1]]) {
# Update the net number of open parentheses
if (char == "(") {
n_open_parentheses <- n_open_parentheses + 1
} else if (char == ")") {
n_open_parentheses <- n_open_parentheses - 1
}
# Replace ; with ;; outside of parentheses
if (char == ";" && n_open_parentheses == 0) {
processed_string <- paste0(processed_string, ";;")
} else {
processed_string <- paste0(processed_string, char)
}
}
return(processed_string)
}
请注意,上面的函数也适用于嵌套括号:嵌套括号内的分号不会被替换!可以在一行中获得所需的输出:
out_vec <- lapply(in_vec, replace_semicolons_outside_parentheses)
# 1. 'abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag'
# 2. 'zvc;;dfasdf;;asdga;;asd(asd;hsfd)'
# 3. 'adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf'
# 4. 'asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa'
如果没有嵌套括号,请使用以下内容:
gsub("\([^()]*\)(*SKIP)(*FAIL)|;", ";;", in_vec, perl=TRUE)
解释
--------------------------------------------------------------------------------
\( '(' char
--------------------------------------------------------------------------------
[^()]* any character except: '(', ')' (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\) ')' char
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skip current match, search for new one from here
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
; ';'
如果有嵌套括号:
gsub("(\((?:[^()]++|(?1))*\))(*SKIP)(*FAIL)|;", ";;", in_vec, perl=TRUE)
参见 regex proof。
解释
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
[^()]++ any character except: '(', ')' (1 or more times
(matching the most amount possible, no backtracking))
--------------------------------------------------------------------------------
| or
--------------------------------------------------------------------------------
(?1) recursing first group pattern
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\) ')'
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skip the match, search for next
--------------------------------------------------------------------------------
| or
--------------------------------------------------------------------------------
; ';'
--------------------------------------------------------------------------------
我正在寻找正则表达式(最好在 R
中),它可以用 say ;;
替换(任意数量的)特定字符 say ;
但仅当 不存在 括号内 ()
文本字符串内。
注:1. 括号内的替换字符也可能不止一个
2。 data/vector
中没有嵌套括号例子
text;othertext
替换为text;;othertext
- 但
text;other(texttt;some;someother);more
将替换为text;;other(texttt;some;someother);;more
。 (即;
仅在()
之外被替换为替换文本)
不过,如果需要说明,我会尽力解释
in_vec <- c("abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag", "zvc;dfasdf;asdga;asd(asd;hsfd)", "adsg;(asdg;ASF;DFG;ASDF;);sdafdf", "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa")
in_vec
#> [1] "abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag"
#> [2] "zvc;dfasdf;asdga;asd(asd;hsfd)"
#> [3] "adsg;(asdg;ASF;DFG;ASDF;);sdafdf"
#> [4] "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa"
预期产量(手动计算)
[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
您可以将 gsub
与 ;(?![^(]*\))
一起使用:
gsub(";(?![^(]*\))", ";;", in_vec, perl=TRUE)
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
;
找到 ;
, (?!)
.. Negative Lookahead(不匹配时进行替换),[^(]
.. 一切但不是 (
, *
重复前面的 0 到 n 次, \)
.. 流过 )
.
或
gsub(";(?=[^)]*($|\())", ";;", in_vec, perl=TRUE)
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
;
找到 ;
、(?=)
.. 正面前瞻(当匹配时进行替换),[^)]
.. 除了 [=22= 之外的所有内容],*
重复前面的0到n次,($|\()
..匹配结束$
或(
.
或使用gregexpr
和regmatches
提取(
和)
之间的部分并在不匹配的子字符串中进行替换:
x <- gregexpr("\(.*?\)", in_vec) #Find the part between ( and )
mapply(function(a, b) {
paste(matrix(c(gsub(";", ";;", b), a, ""), 2, byrow=TRUE), collapse = "")
}, regmatches(in_vec, x), regmatches(in_vec, x, TRUE))
#[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
#[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"
#[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"
#[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"
但所有这些都只适用于简单的打开 (
关闭 )
组合。
虽然可以使用正则表达式解决问题,但使用简单的函数可能更直接、更容易理解。
replace_semicolons_outside_parentheses <- function(raw_string) {
"""Replace ; with ;; outside of parentheses"""
processed_string <- ""
n_open_parentheses <- 0
# Loops over characters in raw_string
for (char in strsplit(raw_string, "")[[1]]) {
# Update the net number of open parentheses
if (char == "(") {
n_open_parentheses <- n_open_parentheses + 1
} else if (char == ")") {
n_open_parentheses <- n_open_parentheses - 1
}
# Replace ; with ;; outside of parentheses
if (char == ";" && n_open_parentheses == 0) {
processed_string <- paste0(processed_string, ";;")
} else {
processed_string <- paste0(processed_string, char)
}
}
return(processed_string)
}
请注意,上面的函数也适用于嵌套括号:嵌套括号内的分号不会被替换!可以在一行中获得所需的输出:
out_vec <- lapply(in_vec, replace_semicolons_outside_parentheses)
# 1. 'abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag'
# 2. 'zvc;;dfasdf;;asdga;;asd(asd;hsfd)'
# 3. 'adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf'
# 4. 'asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa'
如果没有嵌套括号,请使用以下内容:
gsub("\([^()]*\)(*SKIP)(*FAIL)|;", ";;", in_vec, perl=TRUE)
解释
--------------------------------------------------------------------------------
\( '(' char
--------------------------------------------------------------------------------
[^()]* any character except: '(', ')' (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\) ')' char
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skip current match, search for new one from here
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
; ';'
如果有嵌套括号:
gsub("(\((?:[^()]++|(?1))*\))(*SKIP)(*FAIL)|;", ";;", in_vec, perl=TRUE)
参见 regex proof。
解释
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
[^()]++ any character except: '(', ')' (1 or more times
(matching the most amount possible, no backtracking))
--------------------------------------------------------------------------------
| or
--------------------------------------------------------------------------------
(?1) recursing first group pattern
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\) ')'
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skip the match, search for next
--------------------------------------------------------------------------------
| or
--------------------------------------------------------------------------------
; ';'
--------------------------------------------------------------------------------