我想在 R 中编写一个正则表达式来删除包含数字的字符串的所有单词

Question

例如：

x<-"Saint  A/74/PV.46 12/12/2019 4/66 19-40538 Lucia"

应该给我“圣卢西亚”。

我试过了

trimws(gsub("\w*[0-9]+\w*\s*", "", x))

这给了我

Saint  A//PV.///-Lucia

非常感谢任何帮助。

Answer 1

您可以使用替换方法：

x<-"Saint  A/74/PV.46 12/12/2019 4/66 19-40538 Lucia"
gsub("\s*(?<!\S)(?!\p{L}+(?!\S))\S+", "", x, perl=TRUE)
## => [1] "Saint Lucia"
library(stringr)
str_replace_all(x, "\s*(?<!\S)(?!\p{L}+(?!\S))\S+", "")
## => [1] "Saint Lucia"

参见R demo. See the regex demo。详情：

\s* - 零个或多个白色spaces
(?<!\S) - 字符串的开头或紧跟白色 space
(?!\p{L}+(?!\S)) - 下一个非白色space 块不能是纯字母单词
\S+ - 一个或多个非白色space 个字符。

或者，您可以匹配白色space 边界之间的所有仅包含字母的单词，并使用 space:

加入匹配项

paste(unlist(regmatches(x, gregexpr("(?<!\S)\p{L}+(?!\S)", x, perl=TRUE))), collapse=" ")

看到R demo online. Also, see the regex demo，它匹配

(?<!\S) - 字符串开头或白色 space
\p{L}+ - 一个或多个 Unicode 字母
(?!\S) - 紧靠右边，必须有白色space 或字符串结尾。

Answer 2

我们可以使用gsub来匹配字母、数字，从一个单词边界(\b)到下一个单词边界，并替换为空白("")

gsub("\s{2,}", " ", gsub("\b[A-Z/0-9.-]+\b", "", x))
#[1] "Saint Lucia"

或使用str_extract

library(stringr)
str_c(str_extract_all(x, "(?<= |^)[[:alpha:]]+(?= |$)")[[1]], collapse = " ")
#[1] "Saint Lucia"

Answer 3

您可以使用 gsub 将从第一个 space(" ") 到最后一个 space 的字符替换为单个 space。

x <- "Saint  A/74/PV.46 12/12/2019 4/66 19-40538 Lucia"
gsub(" .+ ", " ", x)
[1] "Saint Lucia"

我想在 R 中编写一个正则表达式来删除包含数字的字符串的所有单词

I want to write a regex in R to remove all words of a string containing numbers

regex

r

gsub

stringr