如何从 R 中的字符串中删除特定长度的数字模式

Question

假设我有字符串 -

some_string <- "this is a string with some numbers 9639998 21057535 1000 2021 2022"

我想删除 7 个字符长、8 个字符长和 4 个字符长的数字模式，除非它是 1000。所以基本上我想要以下结果 -

"this is a string with some numbers 1000"

Answer 1

在这里使用 gsub 和正则表达式模式 \b(?:\d{7,8}|(?!1000\b)\d{4})\b:

some_string <- "this is a string with some numbers 9639998 21057535 1000 2021 2022"
output <- gsub("\b(?:\d{7,8}|(?!1000\b)\d{4})\b", "", some_string, perl=TRUE)
output

[1] "this is a string with some numbers   1000  "

实际上，整理松散空白的更好版本应该是这样的：

some_string <- "this is a string with some numbers 9639998 21057535 1000 2021 2022"
output <- gsub("\s*(?:\d{7,8}|(?!1000\b)\d{4})\s*", " ", some_string, perl=TRUE)
output <- gsub("^\s+|\s+$", "", gsub("\s{2,}", " ", output))
output

[1] "this is a string with some numbers 1000"

Answer 2

保留 1000 和 4,7 和 8 以外的长度的 stringr 选项。（示例数据中包含长度 5 之一。）

library(stringr)

"this is a string with some numbers 9639998 21057535 1000 2021 20022 2022" |> 
  str_remove_all("(?!1000)\b(\d{7,8}|\d{4})\b") |> 
  str_squish()
#> [1] "this is a string with some numbers 1000 20022"

^{由 reprex package (v2.0.1)}

于 2022-05-17 创建

如何从 R 中的字符串中删除特定长度的数字模式

How do I remove numeric patterns of a certain length from a string in R

r

stringr