VLOOKUP 就像使用 awk 的 1 个班轮

VLOOKUP like 1 liner using awk

关于将 awk 用作 VLOOKUP 的大量线程,但 none 在我尝试时似乎有效。

我有 2 个文件:

@BioPower3-IBM ~/Goldfish/Assemblies/HighLength/blastx $ head GAGA_all_merged_k125_VS_Danio.blastp_results
Sequence name   Hit desc.   E-Value Similarity
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223 gnl|BL_ORD_ID|19336gi|50540432|ref|NP_001002682.1| calsequestrin-2 precursor [Danio rerio]  0.0 89
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240   gnl|BL_ORD_ID|42660gi|688610863|ref|XP_009294955.1| PREDICTED: band 4.1-like protein 1 isoform X1 [Danio rerio] 0.0 97
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901   gnl|BL_ORD_ID|39369gi|59858543|ref|NP_001012312.1| gelsolin [Danio rerio]   0.0 92
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023 gnl|BL_ORD_ID|30731gi|528504026|ref|XP_001345885.4| PREDICTED: protein Jumonji [Danio rerio]    0.0 91
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005   gnl|BL_ORD_ID|28851gi|688587725|ref|XP_009289915.1| PREDICTED: phosphatidylinositol binding clathrin assembly protein b isoform X6 [Danio rerio]    0.0 98
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179 gnl|BL_ORD_ID|45364gi|52219062|ref|NP_001004604.1| BCSC-1 [Danio rerio] 0.0 86
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266 gnl|BL_ORD_ID|10854gi|528479736|ref|XP_005165325.1| PREDICTED: cathepsin L1 isoform X1 [Danio rerio]    0.0 97
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912 gnl|BL_ORD_ID|39467gi|116004513|ref|NP_001070618.1| 3-oxoacid CoA transferase 1b [Danio rerio]  0.0 97
Locus_11_Transcript_7/7_Confidence_0.647_Length_1989    gnl|BL_ORD_ID|6732gi|528475412|ref|XP_005164328.1| PREDICTED: cerebellar degeneration-related protein 2-like isoform X2 [Danio rerio]   0.0 96

@BioPower3-IBM ~/Goldfish/Assemblies/HighLength/blastx $ head GAGA_all_merged_k125.LocusList
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240
Locus_3_Transcript_1/1_Confidence_1.000_Length_417
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005
Locus_7_Transcript_2/7_Confidence_0.611_Length_2222
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912

注意第二个文件的所有基因座是如何从 1 开始计数的,而第一个文件跳过了几个,3 和 7。

当 Locus 出现在文件 #1 中时,我需要文件 2 的输出,其中包含文件 1 的列(假设为第 2 列)。如果 File1 中不存在 Locus,我想查看 NA。

到目前为止这是我得到的最接近的,但它没有显示文件 1 中的列:

@BioPower3-IBM ~/Goldfish/Assemblies/HighLength/blastx $ awk 'FNR == NR {keys[]; next} {if ( in keys) {print , } else {print , "NA"} }' GAGA_all_merged_k125_VS_Danio.blastp_results GAGA_all_merged_k125.LocusList | head
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223 
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240 
Locus_3_Transcript_1/1_Confidence_1.000_Length_417 NA
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901 
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023 
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005 
Locus_7_Transcript_2/7_Confidence_0.611_Length_2222 NA
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179 
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266 
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912 

Notice 3 和Notice 7 有需要的NA,但是,如何让其他人显示file1 中的内容?谢谢,阿德里安

您已接近尾声。什么问题?你这样做:

FNR == NR {keys[]; next}

这不会在关联数组中保存任何内容。替换为:

FNR == NR {keys[] = ; next}

而打印时,</code>不存在:</p> <pre><code>if ( in keys) {print , }

而是将之前保存在关联数组中的内容:

if ( in keys) {print , keys[]}

所以,它仍然像:

awk '
    FNR == NR {keys[] = ; next} 
    { if ( in keys) { print , keys[] } 
          else {print , "NA"} 
        }
' GAGA_all_merged_k125_VS_Danio.blastp_results GAGA_all_merged_k125.LocusList

基于评论的更新:它与上一个相似。只需删除第一个字段,然后将整行保存在数组中。

awk '
    FNR == NR {f1 = ;  = ""; keys[f1] = [=15=]; next} 
    { if ( in keys) { print , keys[] } 
          else {print , "NA"} 
        }
' GAGA_all_merged_k125_VS_Danio.blastp_results GAGA_all_merged_k125.LocusList

它产生:

Locus_1_Transcript_1/1_Confidence_1.000_Length_2223  gnl|BL_ORD_ID|19336gi|50540432|ref|NP_001002682.1| calsequestrin-2 precursor [Danio rerio] 0.0 89
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240  gnl|BL_ORD_ID|42660gi|688610863|ref|XP_009294955.1| PREDICTED: band 4.1-like protein 1 isoform X1 [Danio rerio] 0.0 97
Locus_3_Transcript_1/1_Confidence_1.000_Length_417 NA
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901  gnl|BL_ORD_ID|39369gi|59858543|ref|NP_001012312.1| gelsolin [Danio rerio] 0.0 92
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023  gnl|BL_ORD_ID|30731gi|528504026|ref|XP_001345885.4| PREDICTED: protein Jumonji [Danio rerio] 0.0 91
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005  gnl|BL_ORD_ID|28851gi|688587725|ref|XP_009289915.1| PREDICTED: phosphatidylinositol binding clathrin assembly protein b isoform X6 [Danio rerio] 0.0 98
Locus_7_Transcript_2/7_Confidence_0.611_Length_2222 NA
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179  gnl|BL_ORD_ID|45364gi|52219062|ref|NP_001004604.1| BCSC-1 [Danio rerio] 0.0 86
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266  gnl|BL_ORD_ID|10854gi|528479736|ref|XP_005165325.1| PREDICTED: cathepsin L1 isoform X1 [Danio rerio] 0.0 97
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912  gnl|BL_ORD_ID|39467gi|116004513|ref|NP_001070618.1| 3-oxoacid CoA transferase 1b [Danio rerio] 0.0 97