R - 删除以大写字母开头的字符向量中的字符串

R - Delete string in character vector that begins with capital letter

我有一个 df:

df <- c("hello goodbye Delete Me", "Another Sentence good program", "hello world The End")

我想要这个:

c("hello goodbye", "good program", "hello world")

我试过:

df <- grep("^[A-Z]", df, invert = TRUE, value = TRUE)

但这会删除以大写字母开头的整个字符:

c("hello goodbye Delete Me", "hello world The End")

我该怎么做?

你可以使用-

trimws(gsub('[A-Z]\w+', '', df))
#[1] "hello goodbye" "good program"  "hello world" 

您可以使用以下正则表达式模式,然后仅替换为单个 space:

\s*[A-Z]\w+\s*

这将捕获所有以大写字母开头的单词,以及可能出现在两边的任何白色space。对 trimws() 的外部调用是为了删除任何可能保留在开头或结尾的 spaces,作为替换逻辑的剩余部分。

x <- c("nice to meet You however", "cat Ran away", "Cat", "Dog")
trimws(gsub('\s*[A-Z]\w+\s*', ' ', x))

[1] "nice to meet however" "cat away"             ""                    
[4] ""

怎么样:

library(stringr)
str_extract(df, "[^ ]+ [^ ]+")

输出:

[1] "hello goodbye"    "Another Sentence" "hello world" 

您可以使用以下三种解决方案:

df <- c("hello goodbye Delete Me", "Another Sentence good program", "hello world The End", "an iPhone", "Ещё Одно слово")

## Base R gsub with default TRE regex engine:
trimws(gsub("\s*\b[[:upper:]][[:alpha:]]*\b", "", df))

## Base R gsub with PCRE regex engine:
trimws(gsub("(*UCP)\s*\b\p{Lu}\p{L}*\b", "", df, perl=TRUE))

## stringr::str_replace_all with ICU regex engine:
library(stringr)
str_trim(str_replace_all(df, "\s*\b\p{Lu}\p{L}*\b", ""))

所有三个的输出都是 [1] "hello goodbye" "good program" "hello world" "an iPhone" "слово"。请注意,单词边界对于正确处理 iPhone 这样的单词至关重要。

参见online R demo. Also, see the PCRE regex demo showing how the regex works (you can go here to watch the internals of the regex engine)。

正则表达式详细信息:

  • \s* - 零个或多个空白字符
  • \b - 单词边界
  • [[:upper:]] / \p{Lu} - 任何 Unicode 大写字母
  • [[:alpha:]]* - 任意零个或多个字母
  • \b - 单词边界

PCRE 正则表达式中的 (*UCP) 启用正则表达式中的 Unicode 属性 类。

trimws 是删除 leading/trailing 空格所必需的,以防它们出现在替换之后。