删除子文本数组之前的文本

Remove text before an array of subtexts

我有一组字符串需要操作。在每一个中,如果它们包含一组子字符串,我想保留子字符串,否则保持不变。

下面是一个例子:

keep <- c("USA","UNITED STATES")
keep <- paste0(paste0(" ",keep,"$"),collapse="|")

data <- c("DETROIT","DETROIT USA","DETROIT UNITED STATES")
expected_result <- c("DETROIT","USA","UNITED STATES")

您可以使用 str_extract 提取模式(如果存在)。这个 returns NA 以防模式丢失,你可以用原来的 data.

替换
keep <- c("USA","UNITED STATES")
keep <- paste0(paste0(" ",keep,"$"),collapse="|")

result <- stringr::str_extract(data, keep)
result[is.na(result)] <- data[is.na(result)]
trimws(result)
#[1] "DETROIT"       "USA"           "UNITED STATES"

您可以使用

data <- c("DETROIT","DETROIT USA","DETROIT UNITED STATES")
keep <- c("USA","UNITED STATES")

regex <- paste0(".*\s*\b(",paste0(keep,collapse="|"), ")\b")
sub(regex, "\1", data)
## => [1] "DETROIT"       "USA"           "UNITED STATES"

参见R demo online

正则表达式为 .*\s*\b(USA|UNITED STATES)\b,参见 its online demo

详情:

  • .* - 尽可能多的任意零个或多个字符
  • \s* - 零个或多个空格
  • \b(USA|UNITED STATES)\b - 整个单词 USAUNITED STATES,捕获到第 1 组(替换模式中的 )。