扩展 gsub 和 grepl 以忽略给定分隔符之间的子字符串
extend gsub and grepl to ignore substrings between given delimiters
我希望能够仅在给定的定界符集之外使用 grepl()
和 gsub()
,例如我希望能够忽略引号之间的文本。
这是我想要的输出:
grepl2("banana", "'banana' banana \"banana\"", escaped =c('""', "''"))
#> [1] TRUE
grepl2("banana", "'banana' apple \"banana\"", escaped =c('""', "''"))
#> [1] FALSE
grepl2("banana", "{banana} banana {banana}", escaped = "{}")
#> [1] TRUE
grepl2("banana", "{banana} apple {banana}", escaped = "{}")
#> [1] FALSE
gsub2("banana", "potatoe", "'banana' banana \"banana\"")
#> [1] "'banana' potatoe \"banana\""
gsub2("banana", "potatoe", "'banana' apple \"banana\"")
#> [1] "'banana' apple \"banana\""
gsub2("banana", "potatoe", "{banana} banana {banana}", escaped = "{}")
#> [1] "{banana} potatoe {banana}"
gsub2("banana", "potatoe", "{banana} apple {banana}", escaped = "{}")
#> [1] "{banana} apple {banana}"
真实案例可能会以不同的数量和顺序引用子字符串。
我已经编写了以下适用于这些情况的函数,但它们很笨重并且 gsub2()
一点也不健壮,因为它暂时用占位符替换了分隔的内容,这些占位符可能会受到后续的影响操作。
regex_escape <-
function(string,n = 1) {
for(i in seq_len(n)){
string <- gsub("([][{}().+*^$|\?])", "\\\1", string)
}
string
}
grepl2 <-
function(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE,
useBytes = FALSE, escaped =c('""', "''")){
escaped <- strsplit(escaped,"")
# TODO check that "escaped" delimiters are balanced and don't cross each other
for(i in 1:length(escaped)){
close <- regex_escape(escaped[[i]][[2]])
open <- regex_escape(escaped[[i]][[1]])
pattern_i <- sprintf("%s.*?%s", open, close)
x <- gsub(pattern_i,"",x)
}
grepl(pattern, x, ignore.case, perl, fixed, useBytes)
}
gsub2 <- function(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE, escaped =c('""', "''")){
escaped <- strsplit(escaped,"")
# TODO check that "escaped" delimiters are balanced and don't cross each other
matches <- character()
for(i in 1:length(escaped)){
close <- regex_escape(escaped[[i]][[2]])
open <- regex_escape(escaped[[i]][[1]])
pattern_i <- sprintf("%s.*?%s", open, close)
ind <- gregexpr(pattern_i,x)
matches_i <- regmatches(x, ind)[[1]]
regmatches(x, ind)[[1]] <- paste0("((",length(matches) + seq_along(matches_i),"))")
matches <- c(matches, matches_i)
}
x <- gsub(pattern, replacement, x, ignore.case, perl, fixed, useBytes)
for(i in seq_along(matches)){
pattern <- sprintf("\(\(%s\)\)", i)
x <- gsub(pattern, matches[[i]], x)
}
x
}
是否有使用正则表达式而不使用占位符的解决方案?请注意,我当前的函数支持多对定界符,但我会对仅支持一对定界符的解决方案感到满意,并且不会尝试匹配简单引号之间的子字符串。
施加不同的分隔符也是可以接受的,例如 {
和 }
而不是 2 "
或 2 '
如果有帮助的话。
我也可以强加perl = TRUE
这是一个简单的正则表达式解决方案,在字符 class 中使用否定运算符。它只满足你的简单情况。我无法使其满足成对的多分隔符请求:
grepl2 <- function(patt, escape="'", arg=NULL) {
grepl( patt=paste0("[^",escape,"]",
patt,
"[^",escape,"]"), arg) }
grepl2("banana", "'banana' apple \"banana\"", escape =c( "'"))
#[1] TRUE
grepl2("banana", "'banana' apple ", escape =c( "'"))
[#1] FALSE
我的意见是您可能需要将左括号和右括号分开以使代码正常工作。
在这里,我正在使用正则表达式环视功能。这可能无法在 R 之外普遍使用(尤其是回溯 ?< 匹配运算符)。
grepl2 = function(pattern, x, escapes = c(open="\"'{", close="\"'}")){
grepl(paste0("(?<![", escapes[[1]], "])",
pattern,
"(?![", escapes[[2]], "])"),
x, perl=T)
}
grepl2("banana", "'banana' banana \"banana\"")
#> [1] TRUE
grepl2("banana", "'banana' apple \"banana\"")
#> [1] FALSE
grepl2("banana", "{banana} banana {banana}")
#> [1] TRUE
grepl2("banana", "{banana} apple {banana}")
#> [1] FALSE
我尝试了 grepl2
,但还没有破解(或想到明确的解决方案)gsub2
。无论如何,这只会删除所提供的 escaped
字符的最短对之间的所有字符(不包括换行符)。它也应该扩展得相当好。如果您使用此解决方案,您可能需要内置检查以确保 对 的 escaped
字符没有空格(或以其他方式适应使用substr()
。希望对您有所帮助!
grepl3 <-
function(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE,
useBytes = FALSE, escaped =c('""', "''")){
new_esc1 <- gsub("([][{}().+*^$|\?])", "\\\1", substr(escaped, 1, 1))
new_esc2 <- gsub("([][{}().+*^$|\?])", "\\\1", substr(escaped, 2, 2))
rm_pat <- paste0(new_esc1, ".*?", new_esc2, collapse = "|")
new_arg <- gsub(rm_pat, "", arg)
grepl(pattern, new_arg)
}
grepl3(pattern = "banana", x = "'banana' apple \"banana\" {banana}", escaped =c("''", '""', "{}"))
[1] FALSE
您可以使用 start/end_escape
参数来提供匹配的分隔符(例如 {
和 }
的 LHS 和 RHS,而不会在错误的地方匹配它们(}
作为 LHS 分隔符)
perl = TRUE
允许环视断言。这些评估其中语句的有效性,而不在模式中捕获它们。 This post 很好地涵盖了它们。
您会在 perl = FALSE
中遇到错误,因为 TRE,R 的默认正则表达式引擎不支持它们。
gsub3 <- function(pattern, replacement, x, escape = NULL, start_escape = NULL, end_escape = NULL) {
if (!is.null(escape) || !is.null(start_escape))
left_escape <- paste0("(?<![", paste0(escape, paste0(start_escape, collapse = ""), collapse = ""), "])")
if (!is.null(escape) || !is.null(end_escape))
right_escape <- paste0("(?![", paste0(escape, paste0(end_escape, collapse = ""), collapse = ""), "])")
patt <- paste0(left_escape, "(", pattern, ")", right_escape)
gsub(patt, replacement, x, perl = TRUE)
}
gsub3("banana", "potatoe", "'banana' banana \"banana\"", escape = "'\"")
#> [1] "'banana' potatoe \"banana\""
gsub3("banana", "potatoe", "'banana' apple \"banana\"", escape = '"\'')
#> [1] "'banana' apple \"banana\""
gsub3("banana", "potatoe", "{banana} banana {banana}", escape = "{}")
#> [1] "{banana} potatoe {banana}"
gsub3("banana", "potatoe", "{banana} apple {banana}", escape = "{}")
#> [1] "{banana} apple {banana}"
下面是 grepl3
- 请注意,这不需要 perl = TRUE
,因为我们不关心模式捕获的内容,只关心它是否匹配。
grepl3 <- function(pattern, x, escape = "'", start_escape = NULL, end_escape = NULL) {
if (!is.null(escape) || !is.null(start_escape))
start_escape <- paste0("[^", paste0(escape, paste0(start_escape, collapse = ""), collapse = ""), "]")
if (!is.null(escape) || !is.null(end_escape))
end_escape <- paste0("[^", paste0(escape, paste0(end_escape, collapse = ""), collapse = ""), "]")
patt <- paste0(start_escape, pattern, end_escape)
grepl(patt, x)
}
grepl3("banana", "'banana' banana \"banana\"", escape =c('"', "'"))
#> [1] TRUE
grepl3("banana", "'banana' apple \"banana\"", escape =c('""', "''"))
#> [1] FALSE
grepl3("banana", "{banana} banana {banana}", escape = "{}")
#> [1] TRUE
grepl3("banana", "{banana} apple {banana}", escape = "{}")
#> [1] FALSE
编辑:
这应该可以解决 gsub 而不会出现 Andrew 提到的问题,只要您可以使用一组成对的运算符。我认为您可以修改它以允许使用多个定界符。感谢您的精彩问题,在 regmatches
!
中找到了一个新的 gem
gsub4 <-
function(pattern,
replacement,
x,
left_escape = "{",
right_escape = "}") {
# `regmatches()` takes a character vector and
# output of `gregexpr` and friends and returns
# the matching (or unmatching, as here) substrings
string_pieces <-
regmatches(x,
gregexpr(
paste0(
"\Q", # Begin quote, regex will treat everything after as fixed.
left_escape,
"\E(?>[^", # \ ends quotes.
left_escape,
right_escape,
"]|(?R))*", # Recurses, allowing nested escape characters
"\Q",
right_escape,
"\E",
collapse = ""
),
x,
perl = TRUE
), invert =NA) # even indices match pattern (so are escaped),
# odd indices we want to perform replacement on.
for (k in seq_along(string_pieces)) {
n_pieces <- length(string_pieces[[k]])
# Due to the structure of regmatches(invert = NA), we know that it will always
# return unmatched strings at odd values, padding with "" as needed.
to_replace <- seq(from = 1, to = n_pieces, by = 2)
string_pieces[[k]][to_replace] <- gsub(pattern, replacement, string_pieces[[k]][to_replace])
}
sapply(string_pieces, paste0, collapse = "")
}
gsub4('banana', 'apples', "{banana's} potatoes {banana} banana", left_escape = "{", right_escape = "}")
#> [1] "{banana's} potatoes {banana} apples"
gsub4('banana', 'apples', "{banana's} potatoes {banana} banana", left_escape = "{", right_escape = "}")
#> [1] "{banana's} potatoes {banana} apples"
gsub4('banana', 'apples', "banana's potatoes", left_escape = "{", right_escape = "}")
#> [1] "apples's potatoes"
gsub4('banana', 'apples', "{banana's} potatoes", left_escape = "{", right_escape = "}")
#> [1] "{banana's} potatoes"
我希望能够仅在给定的定界符集之外使用 grepl()
和 gsub()
,例如我希望能够忽略引号之间的文本。
这是我想要的输出:
grepl2("banana", "'banana' banana \"banana\"", escaped =c('""', "''"))
#> [1] TRUE
grepl2("banana", "'banana' apple \"banana\"", escaped =c('""', "''"))
#> [1] FALSE
grepl2("banana", "{banana} banana {banana}", escaped = "{}")
#> [1] TRUE
grepl2("banana", "{banana} apple {banana}", escaped = "{}")
#> [1] FALSE
gsub2("banana", "potatoe", "'banana' banana \"banana\"")
#> [1] "'banana' potatoe \"banana\""
gsub2("banana", "potatoe", "'banana' apple \"banana\"")
#> [1] "'banana' apple \"banana\""
gsub2("banana", "potatoe", "{banana} banana {banana}", escaped = "{}")
#> [1] "{banana} potatoe {banana}"
gsub2("banana", "potatoe", "{banana} apple {banana}", escaped = "{}")
#> [1] "{banana} apple {banana}"
真实案例可能会以不同的数量和顺序引用子字符串。
我已经编写了以下适用于这些情况的函数,但它们很笨重并且 gsub2()
一点也不健壮,因为它暂时用占位符替换了分隔的内容,这些占位符可能会受到后续的影响操作。
regex_escape <-
function(string,n = 1) {
for(i in seq_len(n)){
string <- gsub("([][{}().+*^$|\?])", "\\\1", string)
}
string
}
grepl2 <-
function(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE,
useBytes = FALSE, escaped =c('""', "''")){
escaped <- strsplit(escaped,"")
# TODO check that "escaped" delimiters are balanced and don't cross each other
for(i in 1:length(escaped)){
close <- regex_escape(escaped[[i]][[2]])
open <- regex_escape(escaped[[i]][[1]])
pattern_i <- sprintf("%s.*?%s", open, close)
x <- gsub(pattern_i,"",x)
}
grepl(pattern, x, ignore.case, perl, fixed, useBytes)
}
gsub2 <- function(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE, escaped =c('""', "''")){
escaped <- strsplit(escaped,"")
# TODO check that "escaped" delimiters are balanced and don't cross each other
matches <- character()
for(i in 1:length(escaped)){
close <- regex_escape(escaped[[i]][[2]])
open <- regex_escape(escaped[[i]][[1]])
pattern_i <- sprintf("%s.*?%s", open, close)
ind <- gregexpr(pattern_i,x)
matches_i <- regmatches(x, ind)[[1]]
regmatches(x, ind)[[1]] <- paste0("((",length(matches) + seq_along(matches_i),"))")
matches <- c(matches, matches_i)
}
x <- gsub(pattern, replacement, x, ignore.case, perl, fixed, useBytes)
for(i in seq_along(matches)){
pattern <- sprintf("\(\(%s\)\)", i)
x <- gsub(pattern, matches[[i]], x)
}
x
}
是否有使用正则表达式而不使用占位符的解决方案?请注意,我当前的函数支持多对定界符,但我会对仅支持一对定界符的解决方案感到满意,并且不会尝试匹配简单引号之间的子字符串。
施加不同的分隔符也是可以接受的,例如 {
和 }
而不是 2 "
或 2 '
如果有帮助的话。
我也可以强加perl = TRUE
这是一个简单的正则表达式解决方案,在字符 class 中使用否定运算符。它只满足你的简单情况。我无法使其满足成对的多分隔符请求:
grepl2 <- function(patt, escape="'", arg=NULL) {
grepl( patt=paste0("[^",escape,"]",
patt,
"[^",escape,"]"), arg) }
grepl2("banana", "'banana' apple \"banana\"", escape =c( "'"))
#[1] TRUE
grepl2("banana", "'banana' apple ", escape =c( "'"))
[#1] FALSE
我的意见是您可能需要将左括号和右括号分开以使代码正常工作。 在这里,我正在使用正则表达式环视功能。这可能无法在 R 之外普遍使用(尤其是回溯 ?< 匹配运算符)。
grepl2 = function(pattern, x, escapes = c(open="\"'{", close="\"'}")){
grepl(paste0("(?<![", escapes[[1]], "])",
pattern,
"(?![", escapes[[2]], "])"),
x, perl=T)
}
grepl2("banana", "'banana' banana \"banana\"")
#> [1] TRUE
grepl2("banana", "'banana' apple \"banana\"")
#> [1] FALSE
grepl2("banana", "{banana} banana {banana}")
#> [1] TRUE
grepl2("banana", "{banana} apple {banana}")
#> [1] FALSE
我尝试了 grepl2
,但还没有破解(或想到明确的解决方案)gsub2
。无论如何,这只会删除所提供的 escaped
字符的最短对之间的所有字符(不包括换行符)。它也应该扩展得相当好。如果您使用此解决方案,您可能需要内置检查以确保 对 的 escaped
字符没有空格(或以其他方式适应使用substr()
。希望对您有所帮助!
grepl3 <-
function(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE,
useBytes = FALSE, escaped =c('""', "''")){
new_esc1 <- gsub("([][{}().+*^$|\?])", "\\\1", substr(escaped, 1, 1))
new_esc2 <- gsub("([][{}().+*^$|\?])", "\\\1", substr(escaped, 2, 2))
rm_pat <- paste0(new_esc1, ".*?", new_esc2, collapse = "|")
new_arg <- gsub(rm_pat, "", arg)
grepl(pattern, new_arg)
}
grepl3(pattern = "banana", x = "'banana' apple \"banana\" {banana}", escaped =c("''", '""', "{}"))
[1] FALSE
您可以使用 start/end_escape
参数来提供匹配的分隔符(例如 {
和 }
的 LHS 和 RHS,而不会在错误的地方匹配它们(}
作为 LHS 分隔符)
perl = TRUE
允许环视断言。这些评估其中语句的有效性,而不在模式中捕获它们。 This post 很好地涵盖了它们。
您会在 perl = FALSE
中遇到错误,因为 TRE,R 的默认正则表达式引擎不支持它们。
gsub3 <- function(pattern, replacement, x, escape = NULL, start_escape = NULL, end_escape = NULL) {
if (!is.null(escape) || !is.null(start_escape))
left_escape <- paste0("(?<![", paste0(escape, paste0(start_escape, collapse = ""), collapse = ""), "])")
if (!is.null(escape) || !is.null(end_escape))
right_escape <- paste0("(?![", paste0(escape, paste0(end_escape, collapse = ""), collapse = ""), "])")
patt <- paste0(left_escape, "(", pattern, ")", right_escape)
gsub(patt, replacement, x, perl = TRUE)
}
gsub3("banana", "potatoe", "'banana' banana \"banana\"", escape = "'\"")
#> [1] "'banana' potatoe \"banana\""
gsub3("banana", "potatoe", "'banana' apple \"banana\"", escape = '"\'')
#> [1] "'banana' apple \"banana\""
gsub3("banana", "potatoe", "{banana} banana {banana}", escape = "{}")
#> [1] "{banana} potatoe {banana}"
gsub3("banana", "potatoe", "{banana} apple {banana}", escape = "{}")
#> [1] "{banana} apple {banana}"
下面是 grepl3
- 请注意,这不需要 perl = TRUE
,因为我们不关心模式捕获的内容,只关心它是否匹配。
grepl3 <- function(pattern, x, escape = "'", start_escape = NULL, end_escape = NULL) {
if (!is.null(escape) || !is.null(start_escape))
start_escape <- paste0("[^", paste0(escape, paste0(start_escape, collapse = ""), collapse = ""), "]")
if (!is.null(escape) || !is.null(end_escape))
end_escape <- paste0("[^", paste0(escape, paste0(end_escape, collapse = ""), collapse = ""), "]")
patt <- paste0(start_escape, pattern, end_escape)
grepl(patt, x)
}
grepl3("banana", "'banana' banana \"banana\"", escape =c('"', "'"))
#> [1] TRUE
grepl3("banana", "'banana' apple \"banana\"", escape =c('""', "''"))
#> [1] FALSE
grepl3("banana", "{banana} banana {banana}", escape = "{}")
#> [1] TRUE
grepl3("banana", "{banana} apple {banana}", escape = "{}")
#> [1] FALSE
编辑:
这应该可以解决 gsub 而不会出现 Andrew 提到的问题,只要您可以使用一组成对的运算符。我认为您可以修改它以允许使用多个定界符。感谢您的精彩问题,在 regmatches
!
gsub4 <-
function(pattern,
replacement,
x,
left_escape = "{",
right_escape = "}") {
# `regmatches()` takes a character vector and
# output of `gregexpr` and friends and returns
# the matching (or unmatching, as here) substrings
string_pieces <-
regmatches(x,
gregexpr(
paste0(
"\Q", # Begin quote, regex will treat everything after as fixed.
left_escape,
"\E(?>[^", # \ ends quotes.
left_escape,
right_escape,
"]|(?R))*", # Recurses, allowing nested escape characters
"\Q",
right_escape,
"\E",
collapse = ""
),
x,
perl = TRUE
), invert =NA) # even indices match pattern (so are escaped),
# odd indices we want to perform replacement on.
for (k in seq_along(string_pieces)) {
n_pieces <- length(string_pieces[[k]])
# Due to the structure of regmatches(invert = NA), we know that it will always
# return unmatched strings at odd values, padding with "" as needed.
to_replace <- seq(from = 1, to = n_pieces, by = 2)
string_pieces[[k]][to_replace] <- gsub(pattern, replacement, string_pieces[[k]][to_replace])
}
sapply(string_pieces, paste0, collapse = "")
}
gsub4('banana', 'apples', "{banana's} potatoes {banana} banana", left_escape = "{", right_escape = "}")
#> [1] "{banana's} potatoes {banana} apples"
gsub4('banana', 'apples', "{banana's} potatoes {banana} banana", left_escape = "{", right_escape = "}")
#> [1] "{banana's} potatoes {banana} apples"
gsub4('banana', 'apples', "banana's potatoes", left_escape = "{", right_escape = "}")
#> [1] "apples's potatoes"
gsub4('banana', 'apples', "{banana's} potatoes", left_escape = "{", right_escape = "}")
#> [1] "{banana's} potatoes"