dplyr：如何在同时创建新协变量的同时使用多个列 pivot_wider？

Question

我有一个蛋白质组学数据集，目前有大约 60 列（患者和蛋白质名称等信息）和大约 1800 行（特定蛋白质）。

我需要将长格式转换为宽格式，以便每一行都对应于患者，而所有列都代表蛋白质。我可以进行（非常）简单的转换，但此示例中有很多列，并且在扩展中，需要进行一些数据管理，因为新的协变量需要来自下面的原始蛋白质组学输出 created/extracted。我根本不知道如何继续，也没有找到任何解决方案来查看许多可用的转换大型数据集的演练。

我更喜欢dplyr-输入、提示或解决方案。

蛋白质组学软件的原始输出看起来像这样：

> head(Heat_BT)
# A tibble: 11 x 6
   protein                                        gene   Intensity_10 Intensity_11 Intensity_MB_1 Intensity_Ref1
   <chr>                                          <chr>  <chr>        <chr>        <chr>          <chr>         
 1 NA                                             NA     Bruschi      Bruschi      Reichl         Reichl        
 2 NA                                             NA     Ctrl         Ctrl         Tumor          Ctrl          
 3 NA                                             NA     Hydro        Hydro        Malignant      Hydro         
 4 NA                                             NA     Ctrl         Ctrl         MB             Ctrl          
 5 von Willebrand factor                          VWF    0.674627721  0.255166769  0.970489979    0.215972215   
 6 Sex hormone-binding globulin                   SHBG   0.516914487  0.476843655  0.88173753     0.306484252   
 7 Glyceraldehyde-3-phosphate dehydrogenase       GAPDH  0.622163594  0.231107563  0.71856463     0.204625234   
 8 Nestin                                         NES    0.868476391  0.547319174  0.832109928    0.440162212   
 9 Heat shock 70 kDa protein 13                   HSPA13 0.484973907  0.435322136  0.539334834    0.28678757    
10 Isocitrate dehydrogenase [NADP], mitochondrial IDH2   1.017596364  0.107395157  0.710225344    0.251976997   
11 Mannan-binding lectin serine protease 1        MASP1  0.491321206  0.434995681  0.812500775    0.403583705

预期输出：

              id     lab malig      diag       VWF      SHBG     GAPDH       NES    HSPA13      IDH2     MASP1
1   Intensity_10 Bruschi  Ctrl     Hydro 0.6746277 0.5169145 0.6221636 0.8684764 0.4849739 1.0175964 0.4913212
2   Intensity_11 Bruschi  Ctrl     Hydro 0.2551668 0.4768437 0.2311076 0.5473192 0.4353221 0.1073952 0.4349957
3 Intensity_MB_1  Reichl Tumor Malignant 0.9704900 0.8817375 0.7185646 0.8321099 0.5393348 0.7102253 0.8125008
4 Intensity_Ref1  Reichl  Ctrl     Hydro 0.2159722 0.3064843 0.2046252 0.4401622 0.2867876 0.2519770 0.4035837

蛋白质组学软件自动打印前四行作为每个患者所属的类别。

基于前四行：

必须在宽格式中添加四个新的协变量：(1) Heat_BT$id对应每个患者的研究名称，( 2) Heat_BT$lab 对应哪个实验室产生了数据，(3) Heat_BT$malig 对应患者病例是对照病例还是一个肿瘤病例，最后 (4) Heat_BT$diag 对应于基础诊断。

数据

Heat_BT <- structure(list(protein = c(NA, NA, NA, NA, "von Willebrand factor", 
                           "Sex hormone-binding globulin", "Glyceraldehyde-3-phosphate dehydrogenase", 
                           "Nestin", "Heat shock 70 kDa protein 13", "Isocitrate dehydrogenase [NADP], mitochondrial", 
                           "Mannan-binding lectin serine protease 1"), gene = c(NA, NA, 
                                                                                NA, NA, "VWF", "SHBG", "GAPDH", "NES", "HSPA13", "IDH2", "MASP1"
                           ), Intensity_10 = c("Bruschi", "Ctrl", "Hydro", "Ctrl", "0.674627721", 
                                               "0.516914487", "0.622163594", "0.868476391", "0.484973907", "1.017596364", 
                                               "0.491321206"), Intensity_11 = c("Bruschi", "Ctrl", "Hydro", 
                                                                                "Ctrl", "0.255166769", "0.476843655", "0.231107563", "0.547319174", 
                                                                                "0.435322136", "0.107395157", "0.434995681"), Intensity_MB_1 = c("Reichl", 
                                                                                                                                                 "Tumor", "Malignant", "MB", "0.970489979", "0.88173753", "0.71856463", 
                                                                                                                                                 "0.832109928", "0.539334834", "0.710225344", "0.812500775"), 
               Intensity_Ref1 = c("Reichl", "Ctrl", "Hydro", "Ctrl", "0.215972215", 
                                  "0.306484252", "0.204625234", "0.440162212", "0.28678757", 
                                  "0.251976997", "0.403583705")), row.names = c(NA, -11L), class = c("tbl_df", 
                                                                                                     "tbl", "data.frame"))

Answer 1

你可以这样做：

Heat_BT[,2][1:3] <- c('lab', 'malig', 'diag')
data.table::transpose(Heat_BT[,-1],keep.names = 'gene',make.names = TRUE)

            gene     lab malig      diag   NA         VWF        SHBG       GAPDH         NES      HSPA13        IDH2       MASP1
1   Intensity_10 Bruschi  Ctrl     Hydro Ctrl 0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2   Intensity_11 Bruschi  Ctrl     Hydro Ctrl 0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1  Reichl Tumor Malignant   MB 0.970489979  0.88173753  0.71856463 0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1  Reichl  Ctrl     Hydro Ctrl 0.215972215 0.306484252 0.204625234 0.440162212  0.28678757 0.251976997 0.403583705

Answer 2

这里有一个 dplyr 解决方案供您使用。它有两个步骤，因为您需要先收集 intensity-变量。

Heat_BT <- Heat_BT %>% na.exclude()

Heat_BT[,-1] %>% pivot_longer(
        cols = Intensity_10:Intensity_Ref1,
        names_to = "id"
) %>% pivot_wider(
        names_from = gene
) %>% mutate(
        across(.cols = -"id", as.numeric)
)

给出以下 output

# A tibble: 4 x 8
  id             VWF         SHBG        GAPDH       NES         HSPA13      IDH2        MASP1      
  <chr>          <chr>       <chr>       <chr>       <chr>       <chr>       <chr>       <chr>      
1 Intensity_10   0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11   0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 0.970489979 0.88173753  0.71856463  0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 0.215972215 0.306484252 0.204625234 0.440162212 0.28678757  0.251976997 0.403583705

我无法看到你想从 data 添加的 variables 之间的联系，所以我假设一旦你能够 pivot 你的数据正确，你会可以填写其余的。

我很乐意修改我的答案，如果你能更清楚地解释这些变量是如何相关的。

最佳

编辑： 请注意，我从 data 中删除了前四行，因为我没有立即看到您想要添加的变量之间的联系。

编辑 2: 我假设前 3 行是您要添加的协变量，因此第一行是 lab、malig和 diag 分别。

# Extract the relevant information
# from the data.
id_cols <- bind_cols(
        var = c("lab", "malig", "diag"),
        Heat_BT[1:3,-c(1,2)] 
) %>% group_by(var) %>% pivot_longer(
        cols = Intensity_10:Intensity_Ref1, names_to = "id"
) %>% pivot_wider(
        names_from = var,
)
        
        
# Remove these identifiers;
Heat_BT <- Heat_BT %>% na.exclude() 

# Pivot the table;
pivoted_table <- Heat_BT[,-1] %>% pivot_longer(
        cols = Intensity_10:Intensity_Ref1,names_to = "id"
) %>% pivot_wider(
        names_from = gene,
) %>% mutate(
        across(.cols = -"id", as.numeric)
        )

# Join with the ID colums
left_join(
        id_cols,
        pivoted_table
)

这给出了 output,

# A tibble: 4 x 11
  id             lab     malig diag      VWF         SHBG        GAPDH       NES         HSPA13      IDH2        MASP1      
  <chr>          <chr>   <chr> <chr>     <chr>       <chr>       <chr>       <chr>       <chr>       <chr>       <chr>      
1 Intensity_10   Bruschi Ctrl  Hydro     0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11   Bruschi Ctrl  Hydro     0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 Reichl  Tumor Malignant 0.970489979 0.88173753  0.71856463  0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 Reichl  Ctrl  Hydro     0.215972215 0.306484252 0.204625234 0.440162212 0.28678757  0.251976997 0.403583705

这将适用于您拥有的数据，无论其大小如何。显然，您可以通过将 cols = Intensity_10:Intensity_Ref1 替换为 contains("intensity").

来使该方法更加可靠

编辑 3

你有比这里提供的更多的变量，所以当你 pivot 这些在 pivot 过程中不会被修改。

因此我们可以采取更稳健的方法，假设此处未提供的所有 variables 都与提供的相似，相应地更改 cols 参数。

# Extract the relevant information
# from the data.
id_cols <- bind_cols(
        var = c("lab", "malig", "diag"),
        Heat_BT[1:3,-c(1,2)] 
) %>% group_by(var) %>% pivot_longer(
        cols = -"var", names_to = "id"
) %>% pivot_wider(
        names_from = var,
)


# Remove these identifiers;
Heat_BT <- Heat_BT[-(1:4),]

# Pivot the table;
pivoted_table <- Heat_BT[,-1] %>% pivot_longer(
        cols = -"gene",
        names_to = "id"
) %>% pivot_wider(
        names_from = gene,
) %>% mutate(
        across(.cols = -"id", as.numeric)
)

# Join with the ID colums
left_join(
        id_cols,
        pivoted_table
)

给出与上面相同的输出。

dplyr：如何在同时创建新协变量的同时使用多个列 pivot_wider？

dplyr: how to pivot_wider with multiple columns while creating new covariates at the same time?

pivot

r

dataframe

dplyr