如果字符串中有超过 x 个数字或超过 x 个字母，如何用 NA 替换列中的所有情况？

Question

我的数据有点像这样

col1   col2
1      "1042AZ"
2      "9523 pa"
3      "dog"
4      "New York"
5      "20000 (usa)"
6      "Outside the country"
7      "1052"

我想保留所有

只有4个数字
只是4个数字和两个字母与空格的任意组合

我目前有这个代码：

df$col2 <- gsub('\s+', '', df$col2)
df$col2 <- toupper(df$col2)
#Delete all rows that does not start with 4 numbers and make PC4 column
df <- df %>% 
  mutate(col3 = str_extract(col2, "^[0-9]{4,}"), 
         col4 = str_extract(col2, "[A-Z].*$"),
         across(c(col2,col3,col4), ~ifelse(grepl("^[0-9]{4}", col2), .x, "")))

我想要这个结果：

col1    col2       col3   col4
1       "1042AZ"   1042   "AZ"
2       "9523PA"   9523   "PA"
3       NA         NA     NA
4       NA         NA     NA
5       NA         NA     NA
6       NA         NA     NA
7       "1052"     1052   NA

问题是第 5 行中的数字保留在运行我的代码之后。

Answer 1

按照你的代码，如果col3没有4个字符，你可以设置为NA：

df %>% 
  mutate(col2 = gsub('\s+', '', toupper(col2)),
         col3 = str_extract(col2, "^[0-9]{4,}"), 
         col4 = str_extract(col2, "[A-Z|a-z].*$"),
         across(c(col2,col3,col4), ~ ifelse(nchar(col3) == 4, .x, NA)))

  col1   col2 col3 col4
1    1 1042AZ 1042   AZ
2    2 9523PA 9523   PA
3    3   <NA> <NA> <NA>
4    4   <NA> <NA> <NA>
5    5   <NA> <NA> <NA>
6    6   <NA> <NA> <NA>
7    7   1052 1052 <NA>

数据

df <- read.table(header = T, text = 'col1   col2
1      "1042AZ"
2      "9523 pa"
3      "dog"
4      "New York"
5      "20000 (usa)"
6      "Outside the country"
7      "1052"')

如果字符串中有超过 x 个数字或超过 x 个字母，如何用 NA 替换列中的所有情况？

How replace all cases in columns with NA if there are more than x numbers OR more than x letters in the string?

string

r

postal-code

na

data-cleaning