如何应用 gsub 或类似更改列名,但前提是列名包含特定单词

How to apply gsub or similar to change column names but only if column name contain specific word

我有一个蛋白质组数据集,它自动打印了长得离谱的列名。

  PG.BiologicalProcess PG.MolecularFunction X.1..20210427_EXPL2_Evo1_CML_DIA44min_CSF-01-R_49.raw.PG.MS1Quantity X.2..20210427_EXPL2_Evo1_CML_DIA44min_CSF-02-R_50.raw.PG.MS1Quantity
1                  NaN                  NaN                                                             642500.0                                                             174625.3
2                  NaN                  NaN                                                             790875.8                                                             910906.9
  X.3..20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-30R_109.raw.PG.MS1Quantity X.4..20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-Sr.raw.PG.MS1Quantity
1                                                                       866325                                                                300197.3
2 

如果列名包含 CSFgsub 或类似词,则应提取 CSF- 和第一个 ._。然后,提取的数字和字母应由 -.

分隔

因此,X.1..20210427_EXPL2_Evo1_CML_DIA44min_CSF-01-R_49.raw.PG.MS1Quantity 变为 01-R(重要的是它是 01 而不仅仅是 1)。

X.4..20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-Sr.raw.PG.MS1Quantity 变为 01-Sr.

我尝试了不同的方法,例如gsub(".*CSF|\s.*", ".", a),但并没有解决问题。所有 包含单词 CSF 的列应保持不变。

预期输出

  PG.BiologicalProcess PG.MolecularFunction     01-R     02-R  01-30R    01-Sr
1                  NaN                  NaN 642500.0 174625.3  866325 300197.3
2                  NaN                  NaN 790875.8 910906.9 2164413 682274.3

数据样本

a <- structure(list(PG.BiologicalProcess = c(NaN, NaN), PG.MolecularFunction = c(NaN, 
NaN), `[1] 20210427_EXPL2_Evo1_CML_DIA44min_CSF-01-R_49.raw.PG.MS1Quantity` = c(642500, 
790875.75), `[2] 20210427_EXPL2_Evo1_CML_DIA44min_CSF-02-R_50.raw.PG.MS1Quantity` = c(174625.3281, 
910906.875), `[3] 20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-30R_109.raw.PG.MS1Quantity` = c(866325, 
2164413), `[4] 20210608_EXPL2_Evo3_CML_DIA44m_60k15k_CSF-01-Sr.raw.PG.MS1Quantity` = c(300197.3125, 
682274.3125)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", 
"data.frame"))

您可以使用

colnames(a) <- sub(".*CSF-([^._]*).*", "\1", colnames(a))

参见regex demo详情:

  • .* - 尽可能多的任意零个或多个字符
  • CSF- - CSF- 文字
  • ([^._]*) - 捕获组 1(</code> 指的是替换模式中的组值):除 <code>._[ 之外的任何零个或多个字符=31=]
  • .* - 字符串的其余部分。