BASH - 汇总一列(ped 文件)中 2 个基因型数据列中存在的信息

BASH - Summarising information present in 2 genotype data columns in one column (ped file)

我有一个包含 36 列 (6+30) 的 PLINK ped 文件,如下所示:

FID IID PID MID SEX PHENO SNP_1a SNP_1b SNP_2a SNP_2b SNP_3a SNP_3b SNP_4a SNP_4b SNP_5a SNP_5b SNP_6a SNP_6b SNP_7a SNP_7b SNP_8a SNP_8b SNP_9a SNP_9b SNP_10a SNP_10b SNP_11a SNP_11b SNP_12a SNP_12b SNP_13a SNP_13b SNP_14a SNP_14b SNP_15a SNP_15b
A1  A1  0   0   1   1   0   0   0   0   2   2   1   2   1   2   1   2   2   1   2   1   1   1   1   2   0   0   0   0   0   0   2   1   2   2
A2  A2  0   0   1   1   1   1   1   1   0   0   1   2   2   2   2   2   1   1   0   0   2   1   2   2   0   0   0   0   0   0   1   1   0   0
A3  A3  0   0   1   2   1   1   1   1   0   0   2   2   2   2   2   2   1   1   0   0   1   1   2   2   0   0   0   0   0   0   1   1   0   0

我有兴趣修改基因型列(从第 7 列开始),以便:

以上示例的输出将包含 21 列 (6+15),如下所示:

FID IID PID MID SEX PHENO SNP_1 SNP_2 SNP_3 SNP_4 SNP_5 SNP_6 SNP_7 SNP_8 SNP_9 SNP_10 SNP_11 SNP_11 SNP_12 SNP_13 SNP_14 SNP_15
A1  A1  0   0   1   1   NA  NA  2   2   2   2   2   2   1   2   NA  NA  NA  2   2
A2  A2  0   0   1   1   1   1   NA  2   2   2   1   NA  2   2   NA  NA  NA  1   NA
A3  A3  0   0   1   2   1   1   NA  2   2   2   1   NA  1   2   NA  NA  NA  1   NA

希望有人能帮帮我,先谢谢了!

你试过什么?我们不是来为您编写代码的。

但我很慷慨,这里有一个非常基本的 Perl 解决方案,似乎适用于您的数据。我不打算解释它,因为我认为你应该付出一些努力:-)

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;

# Skip header
<DATA>;

while (<DATA>) {
  chomp;
  my @data = split;
  my @fixed = @data[0..5];
  my @snps = @data[6..$#data];
  my @new_snps;
  while (my ($s1, $s2) = splice @snps,0, 2) {
    push @new_snps, summarise($s1, $s2);
  }
  say join ' ', @fixed, @new_snps;
}

sub summarise {
  my ($s1, $s2) = @_;

  return 2 if $s1 == 2 or $s2 == 2;
  return 1 if $s1 == 1 and $s2 == 1;
  return 'NA' if $s1 == 0 and $s2 == 0;
  return '?';
}

__DATA__
FID IID PID MID SEX PHENO SNP_1a SNP_1b SNP_2a SNP_2b SNP_3a SNP_3b SNP_4a SNP_4b SNP_5a SNP_5b SNP_6a SNP_6b SNP_7a SNP_7b SNP_8a SNP_8b SNP_9a SNP_9b SNP_10a SNP_10b SNP_11a SNP_11b SNP_12a SNP_12b SNP_13a SNP_13b SNP_14a SNP_14b SNP_15a SNP_15b
A1  A1  0   0   1   1   0   0   0   0   2   2   1   2   1   2   1   2   2   1   2   1   1   1   1   2   0   0   0   0   0   0   2   1   2   2
A2  A2  0   0   1   1   1   1   1   1   0   0   1   2   2   2   2   2   1   1   0   0   2   1   2   2   0   0   0   0   0   0   1   1   0   0
A3  A3  0   0   1   2   1   1   1   1   0   0   2   2   2   2   2   2   1   1   0   0   1   1   2   2   0   0   0   0   0   0   1   1   0   0
$ cat > test.awk
NR>1{
    for(i=j=7; i<NF; i+=2)                                              # for fields 7-(NF-1)
        $(j++) = ($i$(i+1)~/2/) ? "2" : (($i$(i+1)=="11") ? "1" : "NA") # see below *)
    for (i=1; i<=21; i++)                                               # reduced to 21 fields
        printf "%-2s%s", $i,(i<21?OFS:ORS)                              # print
} 
$ awk -f test.awk test.in
A1 A1 0  0  1  1  NA NA 2  2  2  2  2  2  1  2  NA NA NA 2  2
A2 A2 0  0  1  1  1  1  NA 2  2  2  1  NA 2  2  NA NA NA 1  NA
A3 A3 0  0  1  2  1  1  NA 2  2  2  1  NA 1  2  NA NA NA 1  NA

如果规则 1(2 或 2)或规则 2(1 和 1)失败,则 returns NA

*) 连接 a 和 b 字段($i$(i+1),在每次迭代时将 2 添加到 i)并检查它们是否为 2 或 11,并将结果写入已处理的列(即结果来自字段 7 和 8 存储到字段 7、9 和 10 到 8 等。每次迭代时 j 增加 1)。

这是另一个神秘的 awk headers

$ awk '{for(i=1;i<7;i++) printf "%s ",$i; 
        for(i=7;i<=NF;i+=2) 
           {c=$i^2+$(i+1)^2; 
            r=(c>1)+(c>3); 
            printf "%2s ",(NR>1)?(r?r:"NA"):substr($i,1,length($i)-1)}; 
            print ""}' file

FID IID PID MID SEX PHENO SNP_1 SNP_2 SNP_3 SNP_4 SNP_5 SNP_6 SNP_7 SNP_8 SNP_9 SNP_10 SNP_11 SNP_12 SNP_13 SNP_14 SNP_15
A1 A1 0 0 1 1 NA NA  2  2  2  2  2  2  1  2 NA NA NA  2  2
A2 A2 0 0 1 1  1  1 NA  2  2  2  1 NA  2  2 NA NA NA  1 NA
A3 A3 0 0 1 2  1  1 NA  2  2  2  1 NA  1  2 NA NA NA  1 NA