如何替换特定的尾随字符但阻止前两个字母

Question

我想通过用下划线替换所有尾随的 X 和 Y 来协调字符串。由于这些字符串的长度各不相同，因此我编写了以下工作正常的正则表达式。但是，前两个字符应始终保持原样。我知道我可以使用 substr() 和 paste0() 作为解决方法，但是如何在正则表达式中包含这个“不要替换前两个字符”？

x <- c("AXZ", "AZXYYX", "HZX_Y", "BXX", "XYX_")

# replaces all trailing X / Y
gsub("[XY](?=[XY_]*$)", "_", x, perl = TRUE)
#> [1] "AXZ"    "AZ____" "HZ___"  "B__"    "____"

# blocks first character
gsub("(?<!^)[XY](?=[XY_]*$)", "_", x, perl = TRUE)
#> [1] "AXZ"    "AZ____" "HZ___"  "B__"    "X___"

# desired output
c("AXZ", "AZ____", "HZ___", "BX_", "XY__")
#> [1] "AXZ"    "AZ____" "HZ___"  "BX_"    "XY__"

我已经设法排除了第一个字母，所以我想这应该很容易解决。

Answer 1

以下方法似乎可行：

gsub("(?<=.{2})[XY](?=[XY_]*$)", "_", x, perl=TRUE)

[1] "AXZ"    "AZ____" "HZ___"  "BX_"    "XY__"

这里是对正则表达式模式的解释，它使用环视来强制执行正确的替换：

(?<=.{2})        lookbehind and assert there exist at least 2 preceding characters;
                 this ensures replacement will never be made on first 2 characters
[XY]             match any of X or Y
(?=[XY_]*$)      lookahead and assert that previous X/Y/_ is only followed
                 by more X/Y/_ until the end of the string

请注意，我们一次用下划线替换一个字符，但我们使用 gsub 以便进行所有必要的替换。

Answer 2

您可以使用 (*SKIP)(*FAIL):

跳过前两个字符

x <- c("AXZ", "AZXYYX", "HZX_Y", "BXX", "XYX_")

gsub("^.{2}(*SKIP)(*FAIL)|[XY](?=[XY_]*$)", "_", x, perl = TRUE)

产生

[1] "AXZ"    "AZ____" "HZ___"  "BX_"    "XY__"

见a demo on regex101.com。

Answer 3

一种方法是捕获前面的两个字符并重复它们。您可以在替换字符串中使用'\1'、'\2' 等分别指代第一个、第二个等捕获组。这里我们只有一个捕获组。

sub("(..)[XY]+$)", "\1_", x, perl = TRUE)

如何替换特定的尾随字符但阻止前两个字母

How to replace specific trailing characters but block the first two letters

regex

r

gsub

regex-lookarounds