dplyr:如何在同时创建新协变量的同时使用多个列 pivot_wider?
dplyr: how to pivot_wider with multiple columns while creating new covariates at the same time?
我有一个蛋白质组学数据集,目前有大约 60 列(患者和蛋白质名称等信息)和大约 1800 行(特定蛋白质)。
我需要将长格式转换为宽格式,以便每一行都对应于患者,而所有列都代表蛋白质。我可以进行(非常)简单的转换,但此示例中有很多列,并且在扩展中,需要进行一些数据管理,因为新的协变量需要来自下面的原始蛋白质组学输出 created/extracted。我根本不知道如何继续,也没有找到任何解决方案来查看许多可用的转换大型数据集的演练。
我更喜欢dplyr
-输入、提示或解决方案。
蛋白质组学软件的原始输出看起来像这样:
> head(Heat_BT)
# A tibble: 11 x 6
protein gene Intensity_10 Intensity_11 Intensity_MB_1 Intensity_Ref1
<chr> <chr> <chr> <chr> <chr> <chr>
1 NA NA Bruschi Bruschi Reichl Reichl
2 NA NA Ctrl Ctrl Tumor Ctrl
3 NA NA Hydro Hydro Malignant Hydro
4 NA NA Ctrl Ctrl MB Ctrl
5 von Willebrand factor VWF 0.674627721 0.255166769 0.970489979 0.215972215
6 Sex hormone-binding globulin SHBG 0.516914487 0.476843655 0.88173753 0.306484252
7 Glyceraldehyde-3-phosphate dehydrogenase GAPDH 0.622163594 0.231107563 0.71856463 0.204625234
8 Nestin NES 0.868476391 0.547319174 0.832109928 0.440162212
9 Heat shock 70 kDa protein 13 HSPA13 0.484973907 0.435322136 0.539334834 0.28678757
10 Isocitrate dehydrogenase [NADP], mitochondrial IDH2 1.017596364 0.107395157 0.710225344 0.251976997
11 Mannan-binding lectin serine protease 1 MASP1 0.491321206 0.434995681 0.812500775 0.403583705
预期输出:
id lab malig diag VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
1 Intensity_10 Bruschi Ctrl Hydro 0.6746277 0.5169145 0.6221636 0.8684764 0.4849739 1.0175964 0.4913212
2 Intensity_11 Bruschi Ctrl Hydro 0.2551668 0.4768437 0.2311076 0.5473192 0.4353221 0.1073952 0.4349957
3 Intensity_MB_1 Reichl Tumor Malignant 0.9704900 0.8817375 0.7185646 0.8321099 0.5393348 0.7102253 0.8125008
4 Intensity_Ref1 Reichl Ctrl Hydro 0.2159722 0.3064843 0.2046252 0.4401622 0.2867876 0.2519770 0.4035837
- 蛋白质组学软件自动打印前四行作为每个患者所属的类别。
基于前四行:
- 必须在宽格式中添加四个新的协变量:(1)
Heat_BT$id
对应每个患者的研究名称,( 2) Heat_BT$lab
对应哪个实验室产生了数据,(3) Heat_BT$malig
对应患者病例是对照病例还是一个肿瘤病例,最后 (4) Heat_BT$diag
对应于基础诊断。
数据
Heat_BT <- structure(list(protein = c(NA, NA, NA, NA, "von Willebrand factor",
"Sex hormone-binding globulin", "Glyceraldehyde-3-phosphate dehydrogenase",
"Nestin", "Heat shock 70 kDa protein 13", "Isocitrate dehydrogenase [NADP], mitochondrial",
"Mannan-binding lectin serine protease 1"), gene = c(NA, NA,
NA, NA, "VWF", "SHBG", "GAPDH", "NES", "HSPA13", "IDH2", "MASP1"
), Intensity_10 = c("Bruschi", "Ctrl", "Hydro", "Ctrl", "0.674627721",
"0.516914487", "0.622163594", "0.868476391", "0.484973907", "1.017596364",
"0.491321206"), Intensity_11 = c("Bruschi", "Ctrl", "Hydro",
"Ctrl", "0.255166769", "0.476843655", "0.231107563", "0.547319174",
"0.435322136", "0.107395157", "0.434995681"), Intensity_MB_1 = c("Reichl",
"Tumor", "Malignant", "MB", "0.970489979", "0.88173753", "0.71856463",
"0.832109928", "0.539334834", "0.710225344", "0.812500775"),
Intensity_Ref1 = c("Reichl", "Ctrl", "Hydro", "Ctrl", "0.215972215",
"0.306484252", "0.204625234", "0.440162212", "0.28678757",
"0.251976997", "0.403583705")), row.names = c(NA, -11L), class = c("tbl_df",
"tbl", "data.frame"))
你可以这样做:
Heat_BT[,2][1:3] <- c('lab', 'malig', 'diag')
data.table::transpose(Heat_BT[,-1],keep.names = 'gene',make.names = TRUE)
gene lab malig diag NA VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
1 Intensity_10 Bruschi Ctrl Hydro Ctrl 0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11 Bruschi Ctrl Hydro Ctrl 0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 Reichl Tumor Malignant MB 0.970489979 0.88173753 0.71856463 0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 Reichl Ctrl Hydro Ctrl 0.215972215 0.306484252 0.204625234 0.440162212 0.28678757 0.251976997 0.403583705
这里有一个 dplyr
解决方案供您使用。它有两个步骤,因为您需要先收集 intensity
-变量。
Heat_BT <- Heat_BT %>% na.exclude()
Heat_BT[,-1] %>% pivot_longer(
cols = Intensity_10:Intensity_Ref1,
names_to = "id"
) %>% pivot_wider(
names_from = gene
) %>% mutate(
across(.cols = -"id", as.numeric)
)
给出以下 output
# A tibble: 4 x 8
id VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Intensity_10 0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11 0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 0.970489979 0.88173753 0.71856463 0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 0.215972215 0.306484252 0.204625234 0.440162212 0.28678757 0.251976997 0.403583705
我无法看到你想从 data
添加的 variables
之间的联系,所以我假设一旦你能够 pivot
你的数据正确,你会可以填写其余的。
我很乐意修改我的答案,如果你能更清楚地解释这些变量是如何相关的。
最佳
编辑: 请注意,我从 data
中删除了前四行,因为我没有立即看到您想要添加的变量之间的联系。
编辑 2: 我假设前 3 行是您要添加的协变量,因此第一行是 lab
、malig
和 diag
分别。
# Extract the relevant information
# from the data.
id_cols <- bind_cols(
var = c("lab", "malig", "diag"),
Heat_BT[1:3,-c(1,2)]
) %>% group_by(var) %>% pivot_longer(
cols = Intensity_10:Intensity_Ref1, names_to = "id"
) %>% pivot_wider(
names_from = var,
)
# Remove these identifiers;
Heat_BT <- Heat_BT %>% na.exclude()
# Pivot the table;
pivoted_table <- Heat_BT[,-1] %>% pivot_longer(
cols = Intensity_10:Intensity_Ref1,names_to = "id"
) %>% pivot_wider(
names_from = gene,
) %>% mutate(
across(.cols = -"id", as.numeric)
)
# Join with the ID colums
left_join(
id_cols,
pivoted_table
)
这给出了 output
,
# A tibble: 4 x 11
id lab malig diag VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Intensity_10 Bruschi Ctrl Hydro 0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11 Bruschi Ctrl Hydro 0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 Reichl Tumor Malignant 0.970489979 0.88173753 0.71856463 0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 Reichl Ctrl Hydro 0.215972215 0.306484252 0.204625234 0.440162212 0.28678757 0.251976997 0.403583705
这将适用于您拥有的数据,无论其大小如何。显然,您可以通过将 cols = Intensity_10:Intensity_Ref1
替换为 contains("intensity")
.
来使该方法更加可靠
编辑 3
你有比这里提供的更多的变量,所以当你 pivot
这些在 pivot
过程中不会被修改。
因此我们可以采取更稳健的方法,假设此处未提供的所有 variables
都与提供的相似,相应地更改 cols
参数。
# Extract the relevant information
# from the data.
id_cols <- bind_cols(
var = c("lab", "malig", "diag"),
Heat_BT[1:3,-c(1,2)]
) %>% group_by(var) %>% pivot_longer(
cols = -"var", names_to = "id"
) %>% pivot_wider(
names_from = var,
)
# Remove these identifiers;
Heat_BT <- Heat_BT[-(1:4),]
# Pivot the table;
pivoted_table <- Heat_BT[,-1] %>% pivot_longer(
cols = -"gene",
names_to = "id"
) %>% pivot_wider(
names_from = gene,
) %>% mutate(
across(.cols = -"id", as.numeric)
)
# Join with the ID colums
left_join(
id_cols,
pivoted_table
)
给出与上面相同的输出。
我有一个蛋白质组学数据集,目前有大约 60 列(患者和蛋白质名称等信息)和大约 1800 行(特定蛋白质)。
我需要将长格式转换为宽格式,以便每一行都对应于患者,而所有列都代表蛋白质。我可以进行(非常)简单的转换,但此示例中有很多列,并且在扩展中,需要进行一些数据管理,因为新的协变量需要来自下面的原始蛋白质组学输出 created/extracted。我根本不知道如何继续,也没有找到任何解决方案来查看许多可用的转换大型数据集的演练。
我更喜欢dplyr
-输入、提示或解决方案。
蛋白质组学软件的原始输出看起来像这样:
> head(Heat_BT)
# A tibble: 11 x 6
protein gene Intensity_10 Intensity_11 Intensity_MB_1 Intensity_Ref1
<chr> <chr> <chr> <chr> <chr> <chr>
1 NA NA Bruschi Bruschi Reichl Reichl
2 NA NA Ctrl Ctrl Tumor Ctrl
3 NA NA Hydro Hydro Malignant Hydro
4 NA NA Ctrl Ctrl MB Ctrl
5 von Willebrand factor VWF 0.674627721 0.255166769 0.970489979 0.215972215
6 Sex hormone-binding globulin SHBG 0.516914487 0.476843655 0.88173753 0.306484252
7 Glyceraldehyde-3-phosphate dehydrogenase GAPDH 0.622163594 0.231107563 0.71856463 0.204625234
8 Nestin NES 0.868476391 0.547319174 0.832109928 0.440162212
9 Heat shock 70 kDa protein 13 HSPA13 0.484973907 0.435322136 0.539334834 0.28678757
10 Isocitrate dehydrogenase [NADP], mitochondrial IDH2 1.017596364 0.107395157 0.710225344 0.251976997
11 Mannan-binding lectin serine protease 1 MASP1 0.491321206 0.434995681 0.812500775 0.403583705
预期输出:
id lab malig diag VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
1 Intensity_10 Bruschi Ctrl Hydro 0.6746277 0.5169145 0.6221636 0.8684764 0.4849739 1.0175964 0.4913212
2 Intensity_11 Bruschi Ctrl Hydro 0.2551668 0.4768437 0.2311076 0.5473192 0.4353221 0.1073952 0.4349957
3 Intensity_MB_1 Reichl Tumor Malignant 0.9704900 0.8817375 0.7185646 0.8321099 0.5393348 0.7102253 0.8125008
4 Intensity_Ref1 Reichl Ctrl Hydro 0.2159722 0.3064843 0.2046252 0.4401622 0.2867876 0.2519770 0.4035837
- 蛋白质组学软件自动打印前四行作为每个患者所属的类别。
基于前四行:
- 必须在宽格式中添加四个新的协变量:(1)
Heat_BT$id
对应每个患者的研究名称,( 2)Heat_BT$lab
对应哪个实验室产生了数据,(3)Heat_BT$malig
对应患者病例是对照病例还是一个肿瘤病例,最后 (4)Heat_BT$diag
对应于基础诊断。
数据
Heat_BT <- structure(list(protein = c(NA, NA, NA, NA, "von Willebrand factor",
"Sex hormone-binding globulin", "Glyceraldehyde-3-phosphate dehydrogenase",
"Nestin", "Heat shock 70 kDa protein 13", "Isocitrate dehydrogenase [NADP], mitochondrial",
"Mannan-binding lectin serine protease 1"), gene = c(NA, NA,
NA, NA, "VWF", "SHBG", "GAPDH", "NES", "HSPA13", "IDH2", "MASP1"
), Intensity_10 = c("Bruschi", "Ctrl", "Hydro", "Ctrl", "0.674627721",
"0.516914487", "0.622163594", "0.868476391", "0.484973907", "1.017596364",
"0.491321206"), Intensity_11 = c("Bruschi", "Ctrl", "Hydro",
"Ctrl", "0.255166769", "0.476843655", "0.231107563", "0.547319174",
"0.435322136", "0.107395157", "0.434995681"), Intensity_MB_1 = c("Reichl",
"Tumor", "Malignant", "MB", "0.970489979", "0.88173753", "0.71856463",
"0.832109928", "0.539334834", "0.710225344", "0.812500775"),
Intensity_Ref1 = c("Reichl", "Ctrl", "Hydro", "Ctrl", "0.215972215",
"0.306484252", "0.204625234", "0.440162212", "0.28678757",
"0.251976997", "0.403583705")), row.names = c(NA, -11L), class = c("tbl_df",
"tbl", "data.frame"))
你可以这样做:
Heat_BT[,2][1:3] <- c('lab', 'malig', 'diag')
data.table::transpose(Heat_BT[,-1],keep.names = 'gene',make.names = TRUE)
gene lab malig diag NA VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
1 Intensity_10 Bruschi Ctrl Hydro Ctrl 0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11 Bruschi Ctrl Hydro Ctrl 0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 Reichl Tumor Malignant MB 0.970489979 0.88173753 0.71856463 0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 Reichl Ctrl Hydro Ctrl 0.215972215 0.306484252 0.204625234 0.440162212 0.28678757 0.251976997 0.403583705
这里有一个 dplyr
解决方案供您使用。它有两个步骤,因为您需要先收集 intensity
-变量。
Heat_BT <- Heat_BT %>% na.exclude()
Heat_BT[,-1] %>% pivot_longer(
cols = Intensity_10:Intensity_Ref1,
names_to = "id"
) %>% pivot_wider(
names_from = gene
) %>% mutate(
across(.cols = -"id", as.numeric)
)
给出以下 output
# A tibble: 4 x 8
id VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Intensity_10 0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11 0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 0.970489979 0.88173753 0.71856463 0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 0.215972215 0.306484252 0.204625234 0.440162212 0.28678757 0.251976997 0.403583705
我无法看到你想从 data
添加的 variables
之间的联系,所以我假设一旦你能够 pivot
你的数据正确,你会可以填写其余的。
我很乐意修改我的答案,如果你能更清楚地解释这些变量是如何相关的。
最佳
编辑: 请注意,我从 data
中删除了前四行,因为我没有立即看到您想要添加的变量之间的联系。
编辑 2: 我假设前 3 行是您要添加的协变量,因此第一行是 lab
、malig
和 diag
分别。
# Extract the relevant information
# from the data.
id_cols <- bind_cols(
var = c("lab", "malig", "diag"),
Heat_BT[1:3,-c(1,2)]
) %>% group_by(var) %>% pivot_longer(
cols = Intensity_10:Intensity_Ref1, names_to = "id"
) %>% pivot_wider(
names_from = var,
)
# Remove these identifiers;
Heat_BT <- Heat_BT %>% na.exclude()
# Pivot the table;
pivoted_table <- Heat_BT[,-1] %>% pivot_longer(
cols = Intensity_10:Intensity_Ref1,names_to = "id"
) %>% pivot_wider(
names_from = gene,
) %>% mutate(
across(.cols = -"id", as.numeric)
)
# Join with the ID colums
left_join(
id_cols,
pivoted_table
)
这给出了 output
,
# A tibble: 4 x 11
id lab malig diag VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Intensity_10 Bruschi Ctrl Hydro 0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11 Bruschi Ctrl Hydro 0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 Reichl Tumor Malignant 0.970489979 0.88173753 0.71856463 0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 Reichl Ctrl Hydro 0.215972215 0.306484252 0.204625234 0.440162212 0.28678757 0.251976997 0.403583705
这将适用于您拥有的数据,无论其大小如何。显然,您可以通过将 cols = Intensity_10:Intensity_Ref1
替换为 contains("intensity")
.
编辑 3
你有比这里提供的更多的变量,所以当你 pivot
这些在 pivot
过程中不会被修改。
因此我们可以采取更稳健的方法,假设此处未提供的所有 variables
都与提供的相似,相应地更改 cols
参数。
# Extract the relevant information
# from the data.
id_cols <- bind_cols(
var = c("lab", "malig", "diag"),
Heat_BT[1:3,-c(1,2)]
) %>% group_by(var) %>% pivot_longer(
cols = -"var", names_to = "id"
) %>% pivot_wider(
names_from = var,
)
# Remove these identifiers;
Heat_BT <- Heat_BT[-(1:4),]
# Pivot the table;
pivoted_table <- Heat_BT[,-1] %>% pivot_longer(
cols = -"gene",
names_to = "id"
) %>% pivot_wider(
names_from = gene,
) %>% mutate(
across(.cols = -"id", as.numeric)
)
# Join with the ID colums
left_join(
id_cols,
pivoted_table
)
给出与上面相同的输出。