VLOOKUP 就像使用 awk 的 1 个班轮
VLOOKUP like 1 liner using awk
关于将 awk 用作 VLOOKUP 的大量线程,但 none 在我尝试时似乎有效。
我有 2 个文件:
@BioPower3-IBM ~/Goldfish/Assemblies/HighLength/blastx $ head GAGA_all_merged_k125_VS_Danio.blastp_results
Sequence name Hit desc. E-Value Similarity
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223 gnl|BL_ORD_ID|19336gi|50540432|ref|NP_001002682.1| calsequestrin-2 precursor [Danio rerio] 0.0 89
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240 gnl|BL_ORD_ID|42660gi|688610863|ref|XP_009294955.1| PREDICTED: band 4.1-like protein 1 isoform X1 [Danio rerio] 0.0 97
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901 gnl|BL_ORD_ID|39369gi|59858543|ref|NP_001012312.1| gelsolin [Danio rerio] 0.0 92
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023 gnl|BL_ORD_ID|30731gi|528504026|ref|XP_001345885.4| PREDICTED: protein Jumonji [Danio rerio] 0.0 91
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005 gnl|BL_ORD_ID|28851gi|688587725|ref|XP_009289915.1| PREDICTED: phosphatidylinositol binding clathrin assembly protein b isoform X6 [Danio rerio] 0.0 98
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179 gnl|BL_ORD_ID|45364gi|52219062|ref|NP_001004604.1| BCSC-1 [Danio rerio] 0.0 86
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266 gnl|BL_ORD_ID|10854gi|528479736|ref|XP_005165325.1| PREDICTED: cathepsin L1 isoform X1 [Danio rerio] 0.0 97
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912 gnl|BL_ORD_ID|39467gi|116004513|ref|NP_001070618.1| 3-oxoacid CoA transferase 1b [Danio rerio] 0.0 97
Locus_11_Transcript_7/7_Confidence_0.647_Length_1989 gnl|BL_ORD_ID|6732gi|528475412|ref|XP_005164328.1| PREDICTED: cerebellar degeneration-related protein 2-like isoform X2 [Danio rerio] 0.0 96
@BioPower3-IBM ~/Goldfish/Assemblies/HighLength/blastx $ head GAGA_all_merged_k125.LocusList
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240
Locus_3_Transcript_1/1_Confidence_1.000_Length_417
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005
Locus_7_Transcript_2/7_Confidence_0.611_Length_2222
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912
注意第二个文件的所有基因座是如何从 1 开始计数的,而第一个文件跳过了几个,3 和 7。
当 Locus 出现在文件 #1 中时,我需要文件 2 的输出,其中包含文件 1 的列(假设为第 2 列)。如果 File1 中不存在 Locus,我想查看 NA。
到目前为止这是我得到的最接近的,但它没有显示文件 1 中的列:
@BioPower3-IBM ~/Goldfish/Assemblies/HighLength/blastx $ awk 'FNR == NR {keys[]; next} {if ( in keys) {print , } else {print , "NA"} }' GAGA_all_merged_k125_VS_Danio.blastp_results GAGA_all_merged_k125.LocusList | head
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240
Locus_3_Transcript_1/1_Confidence_1.000_Length_417 NA
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005
Locus_7_Transcript_2/7_Confidence_0.611_Length_2222 NA
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912
Notice 3 和Notice 7 有需要的NA,但是,如何让其他人显示file1 中的内容?谢谢,阿德里安
您已接近尾声。什么问题?你这样做:
FNR == NR {keys[]; next}
这不会在关联数组中保存任何内容。替换为:
FNR == NR {keys[] = ; next}
而打印时,</code>不存在:</p>
<pre><code>if ( in keys) {print , }
而是将之前保存在关联数组中的内容:
if ( in keys) {print , keys[]}
所以,它仍然像:
awk '
FNR == NR {keys[] = ; next}
{ if ( in keys) { print , keys[] }
else {print , "NA"}
}
' GAGA_all_merged_k125_VS_Danio.blastp_results GAGA_all_merged_k125.LocusList
基于评论的更新:它与上一个相似。只需删除第一个字段,然后将整行保存在数组中。
awk '
FNR == NR {f1 = ; = ""; keys[f1] = [=15=]; next}
{ if ( in keys) { print , keys[] }
else {print , "NA"}
}
' GAGA_all_merged_k125_VS_Danio.blastp_results GAGA_all_merged_k125.LocusList
它产生:
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223 gnl|BL_ORD_ID|19336gi|50540432|ref|NP_001002682.1| calsequestrin-2 precursor [Danio rerio] 0.0 89
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240 gnl|BL_ORD_ID|42660gi|688610863|ref|XP_009294955.1| PREDICTED: band 4.1-like protein 1 isoform X1 [Danio rerio] 0.0 97
Locus_3_Transcript_1/1_Confidence_1.000_Length_417 NA
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901 gnl|BL_ORD_ID|39369gi|59858543|ref|NP_001012312.1| gelsolin [Danio rerio] 0.0 92
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023 gnl|BL_ORD_ID|30731gi|528504026|ref|XP_001345885.4| PREDICTED: protein Jumonji [Danio rerio] 0.0 91
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005 gnl|BL_ORD_ID|28851gi|688587725|ref|XP_009289915.1| PREDICTED: phosphatidylinositol binding clathrin assembly protein b isoform X6 [Danio rerio] 0.0 98
Locus_7_Transcript_2/7_Confidence_0.611_Length_2222 NA
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179 gnl|BL_ORD_ID|45364gi|52219062|ref|NP_001004604.1| BCSC-1 [Danio rerio] 0.0 86
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266 gnl|BL_ORD_ID|10854gi|528479736|ref|XP_005165325.1| PREDICTED: cathepsin L1 isoform X1 [Danio rerio] 0.0 97
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912 gnl|BL_ORD_ID|39467gi|116004513|ref|NP_001070618.1| 3-oxoacid CoA transferase 1b [Danio rerio] 0.0 97
关于将 awk 用作 VLOOKUP 的大量线程,但 none 在我尝试时似乎有效。
我有 2 个文件:
@BioPower3-IBM ~/Goldfish/Assemblies/HighLength/blastx $ head GAGA_all_merged_k125_VS_Danio.blastp_results
Sequence name Hit desc. E-Value Similarity
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223 gnl|BL_ORD_ID|19336gi|50540432|ref|NP_001002682.1| calsequestrin-2 precursor [Danio rerio] 0.0 89
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240 gnl|BL_ORD_ID|42660gi|688610863|ref|XP_009294955.1| PREDICTED: band 4.1-like protein 1 isoform X1 [Danio rerio] 0.0 97
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901 gnl|BL_ORD_ID|39369gi|59858543|ref|NP_001012312.1| gelsolin [Danio rerio] 0.0 92
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023 gnl|BL_ORD_ID|30731gi|528504026|ref|XP_001345885.4| PREDICTED: protein Jumonji [Danio rerio] 0.0 91
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005 gnl|BL_ORD_ID|28851gi|688587725|ref|XP_009289915.1| PREDICTED: phosphatidylinositol binding clathrin assembly protein b isoform X6 [Danio rerio] 0.0 98
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179 gnl|BL_ORD_ID|45364gi|52219062|ref|NP_001004604.1| BCSC-1 [Danio rerio] 0.0 86
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266 gnl|BL_ORD_ID|10854gi|528479736|ref|XP_005165325.1| PREDICTED: cathepsin L1 isoform X1 [Danio rerio] 0.0 97
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912 gnl|BL_ORD_ID|39467gi|116004513|ref|NP_001070618.1| 3-oxoacid CoA transferase 1b [Danio rerio] 0.0 97
Locus_11_Transcript_7/7_Confidence_0.647_Length_1989 gnl|BL_ORD_ID|6732gi|528475412|ref|XP_005164328.1| PREDICTED: cerebellar degeneration-related protein 2-like isoform X2 [Danio rerio] 0.0 96
@BioPower3-IBM ~/Goldfish/Assemblies/HighLength/blastx $ head GAGA_all_merged_k125.LocusList
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240
Locus_3_Transcript_1/1_Confidence_1.000_Length_417
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005
Locus_7_Transcript_2/7_Confidence_0.611_Length_2222
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912
注意第二个文件的所有基因座是如何从 1 开始计数的,而第一个文件跳过了几个,3 和 7。
当 Locus 出现在文件 #1 中时,我需要文件 2 的输出,其中包含文件 1 的列(假设为第 2 列)。如果 File1 中不存在 Locus,我想查看 NA。
到目前为止这是我得到的最接近的,但它没有显示文件 1 中的列:
@BioPower3-IBM ~/Goldfish/Assemblies/HighLength/blastx $ awk 'FNR == NR {keys[]; next} {if ( in keys) {print , } else {print , "NA"} }' GAGA_all_merged_k125_VS_Danio.blastp_results GAGA_all_merged_k125.LocusList | head
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240
Locus_3_Transcript_1/1_Confidence_1.000_Length_417 NA
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005
Locus_7_Transcript_2/7_Confidence_0.611_Length_2222 NA
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912
Notice 3 和Notice 7 有需要的NA,但是,如何让其他人显示file1 中的内容?谢谢,阿德里安
您已接近尾声。什么问题?你这样做:
FNR == NR {keys[]; next}
这不会在关联数组中保存任何内容。替换为:
FNR == NR {keys[] = ; next}
而打印时,</code>不存在:</p>
<pre><code>if ( in keys) {print , }
而是将之前保存在关联数组中的内容:
if ( in keys) {print , keys[]}
所以,它仍然像:
awk '
FNR == NR {keys[] = ; next}
{ if ( in keys) { print , keys[] }
else {print , "NA"}
}
' GAGA_all_merged_k125_VS_Danio.blastp_results GAGA_all_merged_k125.LocusList
基于评论的更新:它与上一个相似。只需删除第一个字段,然后将整行保存在数组中。
awk '
FNR == NR {f1 = ; = ""; keys[f1] = [=15=]; next}
{ if ( in keys) { print , keys[] }
else {print , "NA"}
}
' GAGA_all_merged_k125_VS_Danio.blastp_results GAGA_all_merged_k125.LocusList
它产生:
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223 gnl|BL_ORD_ID|19336gi|50540432|ref|NP_001002682.1| calsequestrin-2 precursor [Danio rerio] 0.0 89
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240 gnl|BL_ORD_ID|42660gi|688610863|ref|XP_009294955.1| PREDICTED: band 4.1-like protein 1 isoform X1 [Danio rerio] 0.0 97
Locus_3_Transcript_1/1_Confidence_1.000_Length_417 NA
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901 gnl|BL_ORD_ID|39369gi|59858543|ref|NP_001012312.1| gelsolin [Danio rerio] 0.0 92
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023 gnl|BL_ORD_ID|30731gi|528504026|ref|XP_001345885.4| PREDICTED: protein Jumonji [Danio rerio] 0.0 91
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005 gnl|BL_ORD_ID|28851gi|688587725|ref|XP_009289915.1| PREDICTED: phosphatidylinositol binding clathrin assembly protein b isoform X6 [Danio rerio] 0.0 98
Locus_7_Transcript_2/7_Confidence_0.611_Length_2222 NA
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179 gnl|BL_ORD_ID|45364gi|52219062|ref|NP_001004604.1| BCSC-1 [Danio rerio] 0.0 86
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266 gnl|BL_ORD_ID|10854gi|528479736|ref|XP_005165325.1| PREDICTED: cathepsin L1 isoform X1 [Danio rerio] 0.0 97
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912 gnl|BL_ORD_ID|39467gi|116004513|ref|NP_001070618.1| 3-oxoacid CoA transferase 1b [Danio rerio] 0.0 97