列之间的匹配
Match between the columns
我有两个文件。首先,我想查看第一个文件中的“Variant_Type”列。如果是 DEL,那么我应该查看两个文件(染色体、vcf_pos、Reference_Allele)中的三列是否匹配,并将第一个文件的 AC 和 AF 列附加到第二个文件。如果它是“Variant_Type”中的 INS,那么我会在两个文件(染色体、vcf_pos、Tumor_Seq_Allele2)中查找另外三列之间的匹配项,并附加相关的 AC 和 AF 列从第二个文件。如果是SNP,则再次查找两个文件中另外三列(染色体,vcf_pos,Tumor_Seq_Allele2)的匹配项,并从第二个文件中追加相关的AC和AF列。
这是文件 1 的片段
Hugo_Symbol Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2 Variant_Type vcf_pos
TMEM80 chr11 704605 704605 A - DEL 704604
OR52P1P chr11 5726537 5726537 T - DEL 5726536
UBTFL1 chr11 90086720 90086721 - T INS 90086720
DCPS chr11 126306583 126306584 - TGGGGA INS 126306583
DCPS chr11 126306583 126306584 - TGGGGAAA INS 126306583
文件 2
Chromosome vcf_pos AF AC Reference_Allele Tumor_Seq_Allele2
chr11 704604 0.2 10 A -
chr11 5726536 0.35 13 T -
chr11 90086720 0.25 16 - T
chr11 126306583 0.5 29 - TGGGGA
chr11 126306583 0.3 39 - TGGGGAAA
期望的输出
Hugo_Symbol Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2 Variant_Type vcf_pos AF AC
TMEM80 chr11 704605 704605 A - DEL 704604 0.2 10
OR52P1P chr11 5726537 5726537 T - DEL 5726536 0.35 13
UBTFL1 chr11 90086720 90086721 - T INS 90086720 0.25 16
DCPS chr11 126306583 126306584 - TGGGGA INS 126306583 0.5 29
DCPS chr11 126306583 126306584 - TGGGGAAA INS 126306583 0.3 39
作为一个可能的解决方案,我正在考虑 R 中的合并功能,但可能与 awk 一起工作得更好
As a possible solution I was thinking about merge function in R
的确,R merge
函数与普通索引一起可以做到。
t1 = read.table('File 1', T)
t2 = read.table('File 2', T)
AFAC = c('AF', 'AC') # columns to copy
l = t1$Variant_Type=='DEL' # rows to process
t1[l, AFAC] = merge(t1[l, c('Chromosome', 'vcf_pos', 'Reference_Allele')], t2, sort=F)[AFAC]
l = t1$Variant_Type %in% c('INS', 'SNP')
t1[l, AFAC] = merge(t1[l, c('Chromosome', 'vcf_pos', 'Tumor_Seq_Allele2')], t2, sort=F)[AFAC]
write.table(t1, 'output', F, F)
由于INS
和SNP
的匹配列相同,所以这两个Variant_Type
的处理可以合并
might work better with awk
仅供比较 - awk
的解决方案:
awk 'BEGIN { while (getline <"File 2" > 0) # make "dictionary" of AF, AC for ...
{ AFAC1[","","] = " " # ... Chr., vcf. and Reference_Allele
AFAC2[","","] = " " # ... Chr., vcf. and Tumor_Seq_Allele2
}
}
NR==1 { print [=10=]" AF AC" } # first line has column headers
=="DEL" { print [=10=], AFAC1[","","] }# append the 1st stored AF, AC
=="INS"||
=="SNP" { print [=10=], AFAC2[","","] }# append the 2nd stored AF, AC
' "File 1"
我有两个文件。首先,我想查看第一个文件中的“Variant_Type”列。如果是 DEL,那么我应该查看两个文件(染色体、vcf_pos、Reference_Allele)中的三列是否匹配,并将第一个文件的 AC 和 AF 列附加到第二个文件。如果它是“Variant_Type”中的 INS,那么我会在两个文件(染色体、vcf_pos、Tumor_Seq_Allele2)中查找另外三列之间的匹配项,并附加相关的 AC 和 AF 列从第二个文件。如果是SNP,则再次查找两个文件中另外三列(染色体,vcf_pos,Tumor_Seq_Allele2)的匹配项,并从第二个文件中追加相关的AC和AF列。
这是文件 1 的片段
Hugo_Symbol Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2 Variant_Type vcf_pos
TMEM80 chr11 704605 704605 A - DEL 704604
OR52P1P chr11 5726537 5726537 T - DEL 5726536
UBTFL1 chr11 90086720 90086721 - T INS 90086720
DCPS chr11 126306583 126306584 - TGGGGA INS 126306583
DCPS chr11 126306583 126306584 - TGGGGAAA INS 126306583
文件 2
Chromosome vcf_pos AF AC Reference_Allele Tumor_Seq_Allele2
chr11 704604 0.2 10 A -
chr11 5726536 0.35 13 T -
chr11 90086720 0.25 16 - T
chr11 126306583 0.5 29 - TGGGGA
chr11 126306583 0.3 39 - TGGGGAAA
期望的输出
Hugo_Symbol Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2 Variant_Type vcf_pos AF AC
TMEM80 chr11 704605 704605 A - DEL 704604 0.2 10
OR52P1P chr11 5726537 5726537 T - DEL 5726536 0.35 13
UBTFL1 chr11 90086720 90086721 - T INS 90086720 0.25 16
DCPS chr11 126306583 126306584 - TGGGGA INS 126306583 0.5 29
DCPS chr11 126306583 126306584 - TGGGGAAA INS 126306583 0.3 39
作为一个可能的解决方案,我正在考虑 R 中的合并功能,但可能与 awk 一起工作得更好
As a possible solution I was thinking about merge function in R
的确,R merge
函数与普通索引一起可以做到。
t1 = read.table('File 1', T)
t2 = read.table('File 2', T)
AFAC = c('AF', 'AC') # columns to copy
l = t1$Variant_Type=='DEL' # rows to process
t1[l, AFAC] = merge(t1[l, c('Chromosome', 'vcf_pos', 'Reference_Allele')], t2, sort=F)[AFAC]
l = t1$Variant_Type %in% c('INS', 'SNP')
t1[l, AFAC] = merge(t1[l, c('Chromosome', 'vcf_pos', 'Tumor_Seq_Allele2')], t2, sort=F)[AFAC]
write.table(t1, 'output', F, F)
由于INS
和SNP
的匹配列相同,所以这两个Variant_Type
的处理可以合并
might work better with awk
仅供比较 - awk
的解决方案:
awk 'BEGIN { while (getline <"File 2" > 0) # make "dictionary" of AF, AC for ...
{ AFAC1[","","] = " " # ... Chr., vcf. and Reference_Allele
AFAC2[","","] = " " # ... Chr., vcf. and Tumor_Seq_Allele2
}
}
NR==1 { print [=10=]" AF AC" } # first line has column headers
=="DEL" { print [=10=], AFAC1[","","] }# append the 1st stored AF, AC
=="INS"||
=="SNP" { print [=10=], AFAC2[","","] }# append the 2nd stored AF, AC
' "File 1"