如何根据 is.na() 和匹配条件一次连接多个列?
How to join multiple columns at once based on is.na() and match criteria?
我有一个很大的蛋白质组学数据集。我正在寻找 dplyr
解决方案。
a
合并自两个独立的数据集,其中之一是 b
.
> tail(a)
# A tibble: 6 x 5
Majority_protein_IDs Majority_protein_IDs_ Protein_IDs Protein_names Gene_names
<chr> <chr> <chr> <chr> <chr>
1 NA Q9Y2X3 NA NA NA
2 NA Q9Y3B4 NA NA NA
3 NA Q9Y3I0 NA NA NA
4 NA Q9Y4P9 NA NA NA
5 NA Q9Y696 NA NA NA
6 NA Q9Y6C9 NA NA NA
和
> tail(b)
# A tibble: 6 x 5
Majority_protein_I… Majority_protein_I… Protein_IDs Protein_names Gene_names
<chr> <chr> <chr> <chr> <chr>
1 Q9Y617 Q9Y617 Q9Y617 Phosphoserine aminotransferase PSAT1
2 Q9Y646 Q9Y646 Q9Y646 Carboxypeptidase Q CPQ
3 Q9Y696 Q9Y696 Q9Y696 Chloride intracellular channe… CLIC4
4 Q9Y6C9 Q9Y6C9 Q9Y6C9 Mitochondrial carrier homolog… MTCH2
5 Q9Y6N7 Q9Y6N7 Q9Y6N7;Q9HC… Roundabout homolog 1 ROBO1
6 Q9Y6R7 Q9Y6R7 Q9Y6R7 IgGFc-binding protein FCGBP
如您所见,a
中存在许多NA
,唯一已知的信息是a$Majority_protein_IDs_
我想从 b
中提取此信息以在 a
中填写 NA
,以便 a
中所有列中的所有 NA
行从b
.
填写
类似
- if
is.na(a$Majority_protein_IDs)
AND a$Majority_protein_IDs_
match b$Majority_protein_IDs_
, then
- 用
b$Majority_protein_IDs
、b$Protein_IDs
、b$Protein_names
和b$Gene_names
填写a
中所有对应的NA
行
- 保留
a
中的所有行,无论仍然 NA
还是来自 b
的匹配
我试过 left_join
和 ifelse()
的一些变体;但是,我还没有成功。
预期输出
> tail(a)
# A tibble: 6 x 5
Majority_protein_IDs Majority_protein_IDs_ Protein_IDs Protein_names Gene_names
<chr> <chr> <chr> <chr> <chr>
1 NA Q9Y2X3 NA NA NA
2 NA Q9Y3B4 NA NA NA
3 NA Q9Y3I0 NA NA NA
4 Q9Y6C9 Q9Y6C9 Q9Y6C9 Mitochondrial carrier homolog… MTCH2
5 Q9Y696 Q9Y696 Q9Y696 Chloride intracellular channe… CLIC4
6 NA Q9Y6C9 NA NA NA
数据
a <- structure(list(Majority_protein_IDs = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_),
Majority_protein_IDs_ = c("Q9Y2X3", "Q9Y3B4", "Q9Y3I0", "Q9Y4P9",
"Q9Y696", "Q9Y6C9"), Protein_IDs = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), Protein_names = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), Gene_names = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
和
b <- structure(list(Majority_protein_IDs = c("Q9Y617", "Q9Y646", "Q9Y696",
"Q9Y6C9", "Q9Y6N7", "Q9Y6R7"), Majority_protein_IDs_ = c("Q9Y617",
"Q9Y646", "Q9Y696", "Q9Y6C9", "Q9Y6N7", "Q9Y6R7"), Protein_IDs = c("Q9Y617",
"Q9Y646", "Q9Y696", "Q9Y6C9", "Q9Y6N7;Q9HCK4", "Q9Y6R7"), Protein_names = c("Phosphoserine aminotransferase",
"Carboxypeptidase Q", "Chloride intracellular channel protein 4",
"Mitochondrial carrier homolog 2", "Roundabout homolog 1", "IgGFc-binding protein"
), Gene_names = c("PSAT1", "CPQ", "CLIC4", "MTCH2", "ROBO1",
"FCGBP")), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
更新
a
和b
这两个数据集的列数不一定相同。因此,该解决方案必须与 a
和 b
之间的不同列数兼容。
假设 a
看起来像这样
> tail(a)
# A tibble: 6 x 7
Intensity_CH1 Intensity_CH10 Majority_protei… Majority_protei… Protein_IDs Protein_names
<chr> <chr> <chr> <chr> <chr> <chr>
1 NaN NaN NA Q9Y2X3 NA NA
2 NaN NaN NA Q9Y3B4 NA NA
3 NaN NaN NA Q9Y3I0 NA NA
4 NaN NaN NA Q9Y4P9 NA NA
5 NaN NaN NA Q9Y696 NA NA
6 NaN NaN NA Q9Y6C9 NA NA
# … with 1 more variable: Gene_names <chr>
而b
和
一样没有变化
> tail(b)
# A tibble: 6 x 5
Majority_protein_I… Majority_protein_I… Protein_IDs Protein_names Gene_names
<chr> <chr> <chr> <chr> <chr>
1 Q9Y617 Q9Y617 Q9Y617 Phosphoserine aminotransferase PSAT1
2 Q9Y646 Q9Y646 Q9Y646 Carboxypeptidase Q CPQ
3 Q9Y696 Q9Y696 Q9Y696 Chloride intracellular channe… CLIC4
4 Q9Y6C9 Q9Y6C9 Q9Y6C9 Mitochondrial carrier homolog… MTCH2
5 Q9Y6N7 Q9Y6N7 Q9Y6N7;Q9HC… Roundabout homolog 1 ROBO1
6 Q9Y6R7 Q9Y6R7 Q9Y6R7 IgGFc-binding protein FCGBP
数据
a <- structure(list(Intensity_CH1 = c("NaN", "NaN", "NaN", "NaN",
"NaN", "NaN"), Intensity_CH10 = c("NaN", "NaN", "NaN", "NaN",
"NaN", "NaN"), Majority_protein_IDs = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_),
Majority_protein_IDs_ = c("Q9Y2X3", "Q9Y3B4", "Q9Y3I0", "Q9Y4P9",
"Q9Y696", "Q9Y6C9"), Protein_IDs = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), Protein_names = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), Gene_names = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
和
b <- structure(list(Majority_protein_IDs = c("Q9Y617", "Q9Y646", "Q9Y696",
"Q9Y6C9", "Q9Y6N7", "Q9Y6R7"), Majority_protein_IDs_ = c("Q9Y617",
"Q9Y646", "Q9Y696", "Q9Y6C9", "Q9Y6N7", "Q9Y6R7"), Protein_IDs = c("Q9Y617",
"Q9Y646", "Q9Y696", "Q9Y6C9", "Q9Y6N7;Q9HCK4", "Q9Y6R7"), Protein_names = c("Phosphoserine aminotransferase",
"Carboxypeptidase Q", "Chloride intracellular channel protein 4",
"Mitochondrial carrier homolog 2", "Roundabout homolog 1", "IgGFc-binding protein"
), Gene_names = c("PSAT1", "CPQ", "CLIC4", "MTCH2", "ROBO1",
"FCGBP")), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
我认为在这种情况下,基本解决方案可能更有帮助。 dplyr
的便利功能在这里没有用,可能会使解决方案更加复杂
试试这个:
# which protein ids are missing in a (missing means here na in the column Protein_IDs )
missing_prot_ids <- unique(a[is.na(a$Protein_IDs),][["Majority_protein_IDs_"]])
# select every row in b handling those protein ids
b_selected <- b[b$Majority_protein_IDs_ %in% missing_prot_ids, ]
# append the selected rows to the a dataframe, return resulting df
# use bind_rows in order to bind dataframes with different columns
# cols which are not in the other frame are imputed as NA
res_df <- bind_rows(a,b_selected)
它现在确实与您预期的输出完全匹配,但 additonal col
除外。我将此列添加到 a
以演示 bind_rows
:
的行为
# A tibble: 8 x 6
Majority_protein_IDs Majority_protein_IDs_ Protein_IDs Protein_names Gene_names additional_col
<chr> <chr> <chr> <chr> <chr> <dbl>
1 NA Q9Y2X3 NA NA NA 2
2 NA Q9Y3B4 NA NA NA 2
3 NA Q9Y3I0 NA NA NA 2
4 NA Q9Y4P9 NA NA NA 2
5 NA Q9Y696 NA NA NA 2
6 NA Q9Y6C9 NA NA NA 2
7 Q9Y696 Q9Y696 Q9Y696 Chloride intracellular channel protein 4 CLIC4 NA
8 Q9Y6C9 Q9Y6C9 Q9Y6C9 Mitochondrial carrier homolog 2 MTCH2 NA
现在冗余但对评论上下文很重要:
您可能会评论为什么您的预期输出没有 ID Q9Y696
的值。我认为它应该符合您的标准。
我有一个很大的蛋白质组学数据集。我正在寻找 dplyr
解决方案。
a
合并自两个独立的数据集,其中之一是 b
.
> tail(a)
# A tibble: 6 x 5
Majority_protein_IDs Majority_protein_IDs_ Protein_IDs Protein_names Gene_names
<chr> <chr> <chr> <chr> <chr>
1 NA Q9Y2X3 NA NA NA
2 NA Q9Y3B4 NA NA NA
3 NA Q9Y3I0 NA NA NA
4 NA Q9Y4P9 NA NA NA
5 NA Q9Y696 NA NA NA
6 NA Q9Y6C9 NA NA NA
和
> tail(b)
# A tibble: 6 x 5
Majority_protein_I… Majority_protein_I… Protein_IDs Protein_names Gene_names
<chr> <chr> <chr> <chr> <chr>
1 Q9Y617 Q9Y617 Q9Y617 Phosphoserine aminotransferase PSAT1
2 Q9Y646 Q9Y646 Q9Y646 Carboxypeptidase Q CPQ
3 Q9Y696 Q9Y696 Q9Y696 Chloride intracellular channe… CLIC4
4 Q9Y6C9 Q9Y6C9 Q9Y6C9 Mitochondrial carrier homolog… MTCH2
5 Q9Y6N7 Q9Y6N7 Q9Y6N7;Q9HC… Roundabout homolog 1 ROBO1
6 Q9Y6R7 Q9Y6R7 Q9Y6R7 IgGFc-binding protein FCGBP
如您所见,a
中存在许多NA
,唯一已知的信息是a$Majority_protein_IDs_
我想从 b
中提取此信息以在 a
中填写 NA
,以便 a
中所有列中的所有 NA
行从b
.
类似
- if
is.na(a$Majority_protein_IDs)
ANDa$Majority_protein_IDs_
matchb$Majority_protein_IDs_
, then - 用
b$Majority_protein_IDs
、b$Protein_IDs
、b$Protein_names
和b$Gene_names
填写 - 保留
a
中的所有行,无论仍然NA
还是来自b
的匹配
a
中所有对应的NA
行
我试过 left_join
和 ifelse()
的一些变体;但是,我还没有成功。
预期输出
> tail(a)
# A tibble: 6 x 5
Majority_protein_IDs Majority_protein_IDs_ Protein_IDs Protein_names Gene_names
<chr> <chr> <chr> <chr> <chr>
1 NA Q9Y2X3 NA NA NA
2 NA Q9Y3B4 NA NA NA
3 NA Q9Y3I0 NA NA NA
4 Q9Y6C9 Q9Y6C9 Q9Y6C9 Mitochondrial carrier homolog… MTCH2
5 Q9Y696 Q9Y696 Q9Y696 Chloride intracellular channe… CLIC4
6 NA Q9Y6C9 NA NA NA
数据
a <- structure(list(Majority_protein_IDs = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_),
Majority_protein_IDs_ = c("Q9Y2X3", "Q9Y3B4", "Q9Y3I0", "Q9Y4P9",
"Q9Y696", "Q9Y6C9"), Protein_IDs = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), Protein_names = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), Gene_names = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
和
b <- structure(list(Majority_protein_IDs = c("Q9Y617", "Q9Y646", "Q9Y696",
"Q9Y6C9", "Q9Y6N7", "Q9Y6R7"), Majority_protein_IDs_ = c("Q9Y617",
"Q9Y646", "Q9Y696", "Q9Y6C9", "Q9Y6N7", "Q9Y6R7"), Protein_IDs = c("Q9Y617",
"Q9Y646", "Q9Y696", "Q9Y6C9", "Q9Y6N7;Q9HCK4", "Q9Y6R7"), Protein_names = c("Phosphoserine aminotransferase",
"Carboxypeptidase Q", "Chloride intracellular channel protein 4",
"Mitochondrial carrier homolog 2", "Roundabout homolog 1", "IgGFc-binding protein"
), Gene_names = c("PSAT1", "CPQ", "CLIC4", "MTCH2", "ROBO1",
"FCGBP")), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
更新
a
和b
这两个数据集的列数不一定相同。因此,该解决方案必须与 a
和 b
之间的不同列数兼容。
假设 a
看起来像这样
> tail(a)
# A tibble: 6 x 7
Intensity_CH1 Intensity_CH10 Majority_protei… Majority_protei… Protein_IDs Protein_names
<chr> <chr> <chr> <chr> <chr> <chr>
1 NaN NaN NA Q9Y2X3 NA NA
2 NaN NaN NA Q9Y3B4 NA NA
3 NaN NaN NA Q9Y3I0 NA NA
4 NaN NaN NA Q9Y4P9 NA NA
5 NaN NaN NA Q9Y696 NA NA
6 NaN NaN NA Q9Y6C9 NA NA
# … with 1 more variable: Gene_names <chr>
而b
和
> tail(b)
# A tibble: 6 x 5
Majority_protein_I… Majority_protein_I… Protein_IDs Protein_names Gene_names
<chr> <chr> <chr> <chr> <chr>
1 Q9Y617 Q9Y617 Q9Y617 Phosphoserine aminotransferase PSAT1
2 Q9Y646 Q9Y646 Q9Y646 Carboxypeptidase Q CPQ
3 Q9Y696 Q9Y696 Q9Y696 Chloride intracellular channe… CLIC4
4 Q9Y6C9 Q9Y6C9 Q9Y6C9 Mitochondrial carrier homolog… MTCH2
5 Q9Y6N7 Q9Y6N7 Q9Y6N7;Q9HC… Roundabout homolog 1 ROBO1
6 Q9Y6R7 Q9Y6R7 Q9Y6R7 IgGFc-binding protein FCGBP
数据
a <- structure(list(Intensity_CH1 = c("NaN", "NaN", "NaN", "NaN",
"NaN", "NaN"), Intensity_CH10 = c("NaN", "NaN", "NaN", "NaN",
"NaN", "NaN"), Majority_protein_IDs = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_),
Majority_protein_IDs_ = c("Q9Y2X3", "Q9Y3B4", "Q9Y3I0", "Q9Y4P9",
"Q9Y696", "Q9Y6C9"), Protein_IDs = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), Protein_names = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), Gene_names = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
和
b <- structure(list(Majority_protein_IDs = c("Q9Y617", "Q9Y646", "Q9Y696",
"Q9Y6C9", "Q9Y6N7", "Q9Y6R7"), Majority_protein_IDs_ = c("Q9Y617",
"Q9Y646", "Q9Y696", "Q9Y6C9", "Q9Y6N7", "Q9Y6R7"), Protein_IDs = c("Q9Y617",
"Q9Y646", "Q9Y696", "Q9Y6C9", "Q9Y6N7;Q9HCK4", "Q9Y6R7"), Protein_names = c("Phosphoserine aminotransferase",
"Carboxypeptidase Q", "Chloride intracellular channel protein 4",
"Mitochondrial carrier homolog 2", "Roundabout homolog 1", "IgGFc-binding protein"
), Gene_names = c("PSAT1", "CPQ", "CLIC4", "MTCH2", "ROBO1",
"FCGBP")), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
我认为在这种情况下,基本解决方案可能更有帮助。 dplyr
的便利功能在这里没有用,可能会使解决方案更加复杂
试试这个:
# which protein ids are missing in a (missing means here na in the column Protein_IDs )
missing_prot_ids <- unique(a[is.na(a$Protein_IDs),][["Majority_protein_IDs_"]])
# select every row in b handling those protein ids
b_selected <- b[b$Majority_protein_IDs_ %in% missing_prot_ids, ]
# append the selected rows to the a dataframe, return resulting df
# use bind_rows in order to bind dataframes with different columns
# cols which are not in the other frame are imputed as NA
res_df <- bind_rows(a,b_selected)
它现在确实与您预期的输出完全匹配,但 additonal col
除外。我将此列添加到 a
以演示 bind_rows
:
# A tibble: 8 x 6
Majority_protein_IDs Majority_protein_IDs_ Protein_IDs Protein_names Gene_names additional_col
<chr> <chr> <chr> <chr> <chr> <dbl>
1 NA Q9Y2X3 NA NA NA 2
2 NA Q9Y3B4 NA NA NA 2
3 NA Q9Y3I0 NA NA NA 2
4 NA Q9Y4P9 NA NA NA 2
5 NA Q9Y696 NA NA NA 2
6 NA Q9Y6C9 NA NA NA 2
7 Q9Y696 Q9Y696 Q9Y696 Chloride intracellular channel protein 4 CLIC4 NA
8 Q9Y6C9 Q9Y6C9 Q9Y6C9 Mitochondrial carrier homolog 2 MTCH2 NA
现在冗余但对评论上下文很重要:
您可能会评论为什么您的预期输出没有 ID Q9Y696
的值。我认为它应该符合您的标准。