量词可以用于 R 中的正则表达式替换吗？

Question

如果我的模式是"...(*)..." 我对 (*) 捕获的内容的替换将是 x\q1 或 {\q1}x 这样的东西，所以我会得到与 [=19 捕获的字符一样多的 x =].

这可能吗？

我主要在 sub,gsub 中思考，但你可以用 stringi,stringr 等其他图书馆员回答。您可以方便地使用 perl = TRUE 或 perl = FALSE 以及任何其他选项。

我认为答案可能是否定的，因为似乎选项非常有限 (?gsub):

a replacement for matched pattern in sub and gsub. Coerced to character if possible. For fixed = FALSE this can include backreferences "" to "" to parenthesized subexpressions of pattern. For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion. If a character vector of length 2 or more is supplied, the first element is used with a warning. If NA, all elements in the result corresponding to matches will be set to NA.

主要量词是 (?base::regex):

?

    The preceding item is optional and will be matched at most once.
*

    The preceding item will be matched zero or more times.
+

    The preceding item will be matched one or more times.
{n}

    The preceding item is matched exactly n times.
{n,}

    The preceding item is matched n or more times.
{n,m}

    The preceding item is matched at least n times, but not more than m times.

好的，但它似乎是一个选项（不在 PCRE 中，不确定是否在 PERL 中或哪里...） (*) 捕获的数量星号量词能够匹配的字符（我在 https://www.rexegg.com/regex-quantifier-capture.html 找到它），因此可以使用 \q1（相同的参考）来引用第一个捕获的量词（和 \q2，等等.).我还读到 (*) 等同于 {0,} 但我不确定这是否真的是我感兴趣的事实。

编辑更新:

由于评论者的提问，我用 this interesting question 提供的具体示例更新了我的问题。我修改了一下这个例子。假设我们有 a <- "I hate extra spaces elephant" 所以我们有兴趣在单词之间保持唯一的 space ，每个单词的前 5 个字符（直到这里作为原始问题）然后每个字符都有一个点（不确定这是否是原始问题中预期的但无关紧要）所以生成的字符串将是 "I hate extra space. eleph..."（spaces 中最后一个 s 的一个 .和 elephant 末尾的 3 个字母 ant 的 3 个点）。所以我首先将前 5 个字符保留为

gsub("(?<!\S)(\S{5})\S*", "\1", a, perl = TRUE)
[1] "I hate extra space eleph"

我应该如何用点或任何其他符号替换 \S* 中的确切字符数？

Answer 1

替换模式中不能使用量词，也不能显示它们匹配的字符数。

您需要 \G base PCRE pattern 来查找字符串中特定位置之后的连续匹配项：

a <- "I hate extra spaces elephant"
gsub("(?:\G(?!^)|(?<!\S)\S{5})\K\S", ".", a, perl = TRUE)

参见R demo and the regex demo。

详情

(?:\G(?!^)|(?<!\S)\S{5}) - 上一个成功匹配的结尾或前面没有 non-whitespace 字符
\K - match reset operator 丢弃目前匹配的文本
\S - 任意 non-whitespace 个字符。

Answer 2

gsubfn 类似于 gsub 除了替换字符串可以是输入匹配并输出替换的函数。该函数可以选择性地表示为一个公式，就像我们在这里所做的那样，用替换该字符串的函数的输出替换每个单词字符串。不需要复杂的正则表达式。

library(gsubfn)

gsubfn("\w+", ~ paste0(substr(x, 1, 5), strrep(".", max(0, nchar(x) - 5))), a)
## [1] "I hate extra space. eleph..."

除功能略有不同外，几乎相同：

gsubfn("\w+", ~ paste0(substr(x, 1, 5), substring(gsub(".", ".", x), 6)), a)
## [1] "I hate extra space. eleph..."

量词可以用于 R 中的正则表达式替换吗？

Can quantifiers be used in regex replacement in R?

regex

pcre

r

character-replacement