R subset data.frame by column names 使用来自另一个列表的部分字符串匹配

Question

我有一个像这样的数据框（称为“我的文件”）：

      P3170.Tp2  P3189.Tn10 C453.Tn7 F678.Tc23 P3170.Tn10
gene1 0.3035130  0.5909081 0.8918271 0.2623648 0.13392672
gene2 0.2542919  0.5797730 0.4226669 0.9091961 0.96056308
gene3 0.9923911  0.4318736 0.7020107 0.1936181 0.58723105
gene4 0.4113318  0.1239206 0.4091794 0.8196982 0.54791214
gene5 0.4095719  0.6392045 0.4416208 0.8853356 0.01008299

我有一个有趣的字符串列表（称为“interesting.list”），如下所示：

interesting.list <- c("P3170", "C453")

我想使用此 interesting.list 并通过列 headers.

的部分字符串匹配对 myfile 进行子集化

ss.file <- NULL
for (i in 1:length(interesting.list)){
    ss.file[[i]] <- myfile[,colnames(myfile) %like% interesting.list[[i]]]
}

但是，此循环不提供运行之后的列 headers。由于我有一个巨大的数据集（超过 30000 行），因此很难手动实现 colnames。有更好的方法吗？

Answer 1

# Specify `interesting.list` items manually
df[,grep("P3170|C453", x=names(df))]
#>   P3170.Tp2 C453.Tn7 P3170.Tn10
#> 1         1        3          5

# Use paste to create pattern from lots of items in `interesting.list`
il <- c("P3170", "C453")
df[,grep(paste(il, collapse = "|"), x=names(df))]
#>   P3170.Tp2 C453.Tn7 P3170.Tn10
#> 1         1        3          5

示例数据：

n <- c("P3170.Tp2" , "P3189.Tn10" ,"C453.Tn7" ,"F678.Tc23" ,"P3170.Tn10")
df <- data.frame(1,2,3,4,5)
names(df) <- n
Created on 2021-10-20 by the reprex package (v2.0.1)

Answer 2

除了这个问题之外，您还需要考虑很多事情；如果 interesting.list returns 中的项目不止一个匹配怎么办，如果没有找到匹配怎么办，等等

根据您的数据，这是一种方法：

nms <- colnames(myFile)

matchIdx <- unlist(lapply(interesting.list, function(pattern) {
  matches <- which(grepl(pattern, nms, fixed = TRUE))

  # If more than one match is found, only return the first
  if (length(matches) > 1) matches[1] else matches
}))

myFile[, matchIdx, drop = FALSE]

R subset data.frame by column names 使用来自另一个列表的部分字符串匹配

R subset data.frame by column names using partial string match from another list

r

subset