识别 df$columnS 中出现两次的值对应的行，然后在 df$column 中赋值

Question

我有一个数据框 pedigree 如下所示：

FamilyID SampleID           MotherID            FatherID            Sex        
F1961    F1961-1_8005116592 F1961-3_8005116421  F1961-2_8005116603  1
F1961    F1961-2_8005116603 0                   0                   2   
F1961    F1961-3_8005116421 0                   0                   1   
0450     F350_8005441283    0                   0                   1   
0006     F355_8005441353    0                   0                   1   
0189     F359_8005441284    0                   0                   1   
0189     F359_8005441285    0                   0                   2
.
.
.

有些FamilyIDs（比如0189）出现了两次，对应的是同胞对，parents的信息没有被记录

我需要添加一个 "dummy fatherID / motherID" 在这些兄弟对之间共享，用于下游分析。

我想具体识别那些 FamilyID 出现两次的样本，并为它们分配一个共享的 motherID / fatherID 值，因此上面的示例看起来像这个：

期望输出：

FamilyID SampleID           MotherID            FatherID            Sex        
F1961    F1961-1_8005116592 F1961-3_8005116421  F1961-2_8005116603  1
F1961    F1961-2_8005116603 0                   0                   2   
F1961    F1961-3_8005116421 0                   0                   1   
0450     F350_8005441283    0                   0                   1   
0006     F355_8005441353    0                   0                   1   
0189     F359_8005441284    0189_mother         0189_father         1   
0189     F359_8005441285    0189_mother         0189_father         2   
.
.
.

到目前为止，我已经尝试从 mutate 开始添加一个列，指示每个 FamilyID 被观察到的次数，但这不起作用：

pedigree %>% 
  mutate(FamilySize = count(Family_ID))

Error in mutate_impl(.data, dots) : Evaluation error: no applicable method for 'groups' applied to an object of class "character".

非常感谢您的帮助。

Answer 1

要计算家庭人数，我们要按 FamilyID 对它们进行分组，然后用 n() 计算每组中的行数。然后，如果需要，我们可以使用 mutate 和 if_else 来替换 MotherID 或 FatherID 的值。实际上，我们可以在此处将 table 按 FamilyID 分组，因为我们在 mutate 语句中使用的所有变量（FamilySize、FatherID 和 MotherID) 在整个组中是相同的。如果它们不是（例如，如果我们想根据 Sex 做一些不同的事情）那么我们想要切换到 rowwise 以便 mutate 将在每一行上应用 if_else 函数一个一个地单独计算，而不是一个单一的向量化计算。

pedigree %>%
    group_by(FamilyID) %>%
    mutate(FamilySize = n()) %>%
    mutate(MotherID = if_else(FamilySize == 2 & MotherID == 0,
                              paste0(FamilyID, '_mother'),
                              MotherID),
           FatherID = if_else(FamilySize == 2 & FatherID == 0,
                              paste0(FamilyID, '_father'),
                              FatherID))

# A tibble: 7 x 6
  FamilyID SampleID           MotherID           FatherID             Sex FamilySize
  <chr>    <chr>              <chr>              <chr>              <int>      <int>
1 F1961    F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603     1          3
2 F1961    F1961-2_8005116603 0                  0                      2          3
3 F1961    F1961-3_8005116421 0                  0                      1          3
4 0450     F350_8005441283    0                  0                      1          1
5 0006     F355_8005441353    0                  0                      1          1
6 0189     F359_8005441284    0189_mother        0189_father            1          2
7 0189     F359_8005441285    0189_mother        0189_father            2          2

Answer 2

您可以使用 dplyr 对 FamiliID 进行分组并更新条件 n()==2 的列 (MotherID/FatherID)。

选项#1：以 OP 的预期格式获取结果

library(dplyr)
df %>% group_by(FamilyID) %>%
  mutate(MotherID = ifelse(n() == 2, paste(FamilyID, "mother", sep= "_"), MotherID)) %>%
  mutate(FatherID = ifelse(n() == 2, paste(FamilyID, "father", sep= "_"), FatherID)) 

# FamilyID SampleID           MotherID           FatherID             Sex
# <chr>    <chr>              <chr>              <chr>              <int>
# 1 F1961    F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603     1
# 2 F1961    F1961-2_8005116603 F1961-3_8005116421 F1961-2_8005116603     2
# 3 F1961    F1961-3_8005116421 F1961-3_8005116421 F1961-2_8005116603     1
# 4 0450     F350_8005441283    0                  0                      1
# 5 0006     F355_8005441353    0                  0                      1
# 6 0189     F359_8005441284    0189_mother        0189_father            1
# 7 0189     F359_8005441285    0189_mother        0189_father            2

Option#2: 如果 OP 乐于拥有 FamilyID_dummy 形式的虚拟 ID，那么使用 mutate_at 可以实现更优雅的解决方案:

library(dplyr)

df %>% group_by(FamilyID) %>%
  mutate_at(vars(c("MotherID","FatherID")), 
              funs(ifelse(n() == 2, paste(FamilyID, "dummy", sep= "_"), .)))

# # A tibble: 7 x 5
# # Groups: FamilyID [4]
# FamilyID SampleID           MotherID           FatherID             Sex
# <chr>    <chr>              <chr>              <chr>              <int>
# 1 F1961    F1961-1_8005116592 F1961-3_8005116421 F1961-2_8005116603     1
# 2 F1961    F1961-2_8005116603 F1961-3_8005116421 F1961-2_8005116603     2
# 3 F1961    F1961-3_8005116421 F1961-3_8005116421 F1961-2_8005116603     1
# 4 0450     F350_8005441283    0                  0                      1
# 5 0006     F355_8005441353    0                  0                      1
# 6 0189     F359_8005441284    0189_dummy         0189_dummy             1
# 7 0189     F359_8005441285    0189_dummy         0189_dummy             2

数据：

df <- read.table(text = 
"FamilyID SampleID           MotherID            FatherID            Sex        
F1961    F1961-1_8005116592 F1961-3_8005116421  F1961-2_8005116603  1
F1961    F1961-2_8005116603 0                   0                   2   
F1961    F1961-3_8005116421 0                   0                   1   
0450     F350_8005441283    0                   0                   1   
0006     F355_8005441353    0                   0                   1   
0189     F359_8005441284    0                   0                   1   
0189     F359_8005441285    0                   0                   2",
header = TRUE, stringsAsFactors = FALSE)

识别 df$columnS 中出现两次的值对应的行，然后在 df$column 中赋值

Identify rows corresponding to values in df$columnA that occur twice, then assign a value in df$columnB

r

bioinformatics

dplyr

tidyr