具有特定字符串的子集列名称

Question

我正在尝试根据以特定字符串开头的列名对数据框进行子集化。我有一些像 ABC_1 ABC_2 ABC_3 的列，有些像 ABC_XYZ_1、ABC_XYZ_2、ABC_XYZ_3

如何对我的数据框进行子集化，使其仅包含 ABC_1、ABC_2、ABC_3 ...ABC_n 列而不包含 ABC_XYZ_1 , ABC_XYZ_2...?

我试过这个选项

set.seed(1)
df <- data.frame( ABC_1 = sample(0:1,3,repl = TRUE),
            ABC_2 = sample(0:1,3,repl = TRUE),
            ABC_XYZ_1 = sample(0:1,3,repl = TRUE),
            ABC_XYZ_2 = sample(0:1,3,repl = TRUE) )


df1 <- df[ , grepl( "ABC" , names( df ) ) ]

ind <- apply( df1 , 1 , function(x) any( x > 0 ) )

df1[ ind , ]

但这给了我两个列名 ABC_1...ABC_n ...和 ABC_XYZ_1...ABC_XYZ_n... 我不是对 ABC_XYZ_1 列感兴趣，只对 ABC_1 列感兴趣，....非常感谢任何建议。

Answer 1

要指定 "ABC_" 后跟一位或多位数字（即 \d+ 或 [0-9]+），您可以使用

df1 <- df[ , grepl("ABC_\d+", names( df ), perl = TRUE ) ]
# df1 <- df[ , grepl("ABC_[0-9]+", names( df ), perl = TRUE ) ] # another option

要强制列名称以 "ABC_" 开头，您可以将 ^ 添加到正则表达式以仅在 "ABC_\d+" 出现在字符串的开头而不是出现在任何地方时匹配在里面。

df1 <- df[ , grepl("^ABC_\d+", names( df ), perl = TRUE ) ]

如果 dplyr 更符合您的喜好，您可以尝试

library(dplyr)
select(df, matches("^ABC_\d+"))

Answer 2

另一个简单的解决方案是使用 substr :

df1 <- df[,substr(names(df),5,7) != 'XYZ']

具有特定字符串的子集列名称

Subset column names with specific string

regex

r

subset

grepl