使用正则表达式重命名多列

Renaming multiple columns using regexp

问题:

我想通过替换某些重复的字符串来重命名大量列名。

Reprex:

library(dplyr)
library(stringr)

code <- c(round(runif(26, 0, 100),0))
names <- letters
AIYN <- stringi::stri_rand_strings(26, 2)
SIYN <- stringi::stri_rand_strings(26, 2)


df <- bind_cols(code, names, AIYN, SIYN)
colnames(df) <- c("code (2021)", "names (2021)", "all the info you need (AIYN) from A to Z", 
                  "some info you need (SIYN) from A to Z")
View(df)

尝试的解决方案

colnames(df) <- str_replace_all(colnames(df), "[(2021)]", "")
colnames(df) <- str_replace_all(colnames(df), "all the info you need (AIYN) from A to Z", "AIYN")
colnames(df) <- str_replace_all(colnames(df), "some info you need (SIYN) from A to Z", "SIYN")

目标

我想删除括号中的数字(例如“(2019)”),并保留括号中的字符,其中只有字符(例如“(AIYN)”,“(SIYN)”)。我的解决方案很啰嗦,因为我的数据框有一百多列。

要删除带数字的括号,您需要

stringr::str_replace_all(colnames(df), "\s*\(\d+\)", "")
stringr::str_remove_all(colnames(df), "\s*\(\d+\)")
gsub("\s*\(\d+\)", "", colnames(df))

如果括号内的数字必须由 4 位数字组成,请将 \d+ 替换为 \d{4}

把上面的代码放在trimws(...)里面,去掉leading/trailing空格。

参见regex demo

要将第一个仅包含字母的值保留在括号内,您需要

stringr::str_extract(colnames(df), '(?<=\()[A-Za-z]+(?=\))') # ASCII only
stringr::str_extract(colnames(df), '(?<=\()\p{L}+(?=\))')   # Any Unicode

两者结合:

colnames(df) <- coalesce(str_extract(colnames(df), '(?<=\()[A-Za-z]+(?=\))'), str_replace_all(colnames(df), "\s*\(\d+\)", ""))

R测试

library(dplyr)
library(stringr)

x <-  c("code (2021)", "names (2021)", "all the info you need (AIYN) from A to Z", 
        "some info you need (SIYN) from A to Z")

z <- str_replace_all(x, "\s*\(\d+\)", "")
# => [1] "code" "names" "all the info you need (AIYN) from A to Z" [4] "some info you need (SIYN) from A to Z"
y <- str_extract(z, '(?<=\()[A-Za-z]+(?=\))')
# => [1] NA     NA     "AIYN" "SIYN"
coalesce(y, z)
# => "code"  "names" "AIYN"  "SIYN" 

你可以试试-

library(magrittr)

names(df) <- sub('\s\(\d+\)', '', names(df)) %>%
                sub('.*\(([A-Z]+)\).*', '\1', .)
names(df)
#[1] "code"  "names" "AIYN"  "SIYN" 

第一个 sub 将数字和空格放在括号内。

第二个 sub 提取括号内的多个 [A-Z] 值。


将此与 dplyr 和管道一起使用 -

library(dplyr)
df %>% 
    rename_with(~sub('\s\(\d+\)', '', .) %>% 
                 sub('.*\(([A-Z]+)\).*', '\1', .))

#    code names AIYN  SIYN 
#   <dbl> <chr> <chr> <chr>
# 1     1 a     1A    NR   
# 2    96 b     Dq    hi   
# 3    46 c     28    AQ   
# 4    78 d     Y8    xH   
# 5    76 e     ps    ES   
# 6    56 f     m5    gQ   
# 7    51 g     vV    8u   
# 8    72 h     Hw    JV   
# 9    24 i     0T    7A   
#10    76 j     mq    Qy   
# … with 16 more rows