列之间的匹配

Match between the columns

我有两个文件。首先,我想查看第一个文件中的“Variant_Type”列。如果是 DEL,那么我应该查看两个文件(染色体、vcf_pos、Reference_Allele)中的三列是否匹配,并将第一个文件的 AC 和 AF 列附加到第二个文件。如果它是“Variant_Type”中的 INS,那么我会在两个文件(染色体、vcf_pos、Tumor_Seq_Allele2)中查找另外三列之间的匹配项,并附加相关的 AC 和 AF 列从第二个文件。如果是SNP,则再次查找两个文件中另外三列(染色体,vcf_pos,Tumor_Seq_Allele2)的匹配项,并从第二个文件中追加相关的AC和AF列。

这是文件 1 的片段

Hugo_Symbol Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2 Variant_Type vcf_pos 
TMEM80      chr11      704605         704605       A                -                 DEL          704604 
OR52P1P     chr11      5726537        5726537      T                -                 DEL          5726536
UBTFL1      chr11      90086720       90086721     -                T                 INS          90086720
DCPS        chr11      126306583      126306584    -                TGGGGA            INS          126306583
DCPS        chr11      126306583      126306584    -                TGGGGAAA          INS          126306583

文件 2

Chromosome vcf_pos      AF       AC      Reference_Allele  Tumor_Seq_Allele2
chr11      704604       0.2      10      A                 - 
chr11      5726536      0.35     13      T                 -
chr11      90086720     0.25     16      -                 T
chr11      126306583    0.5      29      -                 TGGGGA 
chr11      126306583    0.3      39      -                 TGGGGAAA

期望的输出

Hugo_Symbol Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2 Variant_Type vcf_pos   AF   AC 
TMEM80      chr11      704605         704605       A                -                 DEL          704604    0.2  10
OR52P1P     chr11      5726537        5726537      T                -                 DEL          5726536   0.35 13
UBTFL1      chr11      90086720       90086721     -                T                 INS          90086720  0.25 16
DCPS        chr11      126306583      126306584    -                TGGGGA            INS          126306583 0.5  29
DCPS        chr11      126306583      126306584    -                TGGGGAAA          INS          126306583 0.3  39

作为一个可能的解决方案,我正在考虑 R 中的合并功能,但可能与 awk 一起工作得更好

As a possible solution I was thinking about merge function in R

的确,R merge 函数与普通索引一起可以做到。

t1 = read.table('File 1', T)
t2 = read.table('File 2', T)
AFAC = c('AF', 'AC')        # columns to copy
l = t1$Variant_Type=='DEL'  # rows to process
t1[l, AFAC] = merge(t1[l, c('Chromosome', 'vcf_pos', 'Reference_Allele')], t2, sort=F)[AFAC]
l = t1$Variant_Type %in% c('INS', 'SNP')
t1[l, AFAC] = merge(t1[l, c('Chromosome', 'vcf_pos', 'Tumor_Seq_Allele2')], t2, sort=F)[AFAC]
write.table(t1, 'output', F, F)

由于INSSNP的匹配列相同,所以这两个Variant_Type的处理可以合并

might work better with awk

仅供比较 - awk 的解决方案:

awk 'BEGIN { while (getline <"File 2" > 0)      # make "dictionary" of AF, AC for ...
             { AFAC1[","","] = " "    # ... Chr., vcf. and Reference_Allele
               AFAC2[","","] = " "    # ... Chr., vcf. and Tumor_Seq_Allele2
             }
           }
     NR==1     { print [=10=]" AF AC" }             # first line has column headers
     =="DEL" { print [=10=], AFAC1[","","] }# append the 1st stored AF, AC
     =="INS"||
     =="SNP" { print [=10=], AFAC2[","","] }# append the 2nd stored AF, AC
    ' "File 1"