使用 Bash 从每个基因的 fasta 序列中提取位置 2-7
Extract positions 2-7 from a fasta sequence for each gene using Bash
我有一个包含 geneID 子集的文件,以及一个包含所有 geneID 及其序列的 fasta 文件。对于子集文件中的每个基因,我想从每个 fasta 序列的开头获取位置 2-7。理想情况下,输出文件为 'pos 2-7' '\t' 'geneID'.
示例子集:
mmu-let-7g-5p MIMAT0000121
mmu-let-7i-5p MIMAT0000122
法斯塔文件:
>mmu-let-7g-5p MIMAT0000121
UGAGGUAGUAGUUUGUACAGUU
>mmu-let-7i-5p MIMAT0000122
UGAGGUAGUAGUUUGUGCUGUU
>mmu-let-7f-5p MIMAT0000525
UGAGGUAGUAGAUUGUAUAGUU
想要的输出:
GAGGUA mmu-let-7g-5p MIMAT0000121
GAGGUA mmu-let-7i-5p MIMAT0000122
第一部分(提取基因子集的 fasta 序列)我已经使用 grep -w -A 1 -f
完成了。不确定如何获得 pos 2-7 并使输出看起来像现在使用 Bash.
您能否尝试仅在 GNU awk
.
中使用显示的示例进行跟踪、编写和测试
awk '
FNR==NR{
a[]=
next
}
/^>/{
ind=substr(,2)
}
/^>/ && (ind in a){
found=1
val=ind OFS a[ind]
next
}
found{
print substr([=10=],2,6) OFS val
val=found=""
}
' gene fastafile
说明: 为以上添加详细说明。
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when gene Input_file is being read.
a[]= ##Creating array a with index of and value of here.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if line starts from > then do following.
ind=substr(,2) ##Creating ind which has substring from 2nd charcters to all values of first field.
}
/^>/ && (ind in a){ ##Checking if line starts with > and ind is present in array a then do following.
found=1 ##Setting found to 1 here.
val=ind OFS a[ind] ##Creating val which has ind OFS and value of a with index of ind.
next ##next will skip all further statements from here.
}
found{ ##Checking condition if found is NOT NULL then do following.
print substr([=11=],2,6) OFS val ##Printing sub string from 2nd to 7th character OFS and val here.
val=found="" ##Nullifying val and found here.
}
' gene fastafile ##Mentioning Input_file names here.
另一个 awk:
$ awk '
{
gsub(/ +$/,"") # clean trailing space from sample data
}
NR==FNR { # process subset file as it is smaller
a[[=10=]] # hash keys
next
} # process fasta file
/^>/ && ((p=substr([=10=],2)) in a) { # if string found in hash
if(getline>0) # read next record
print substr([=10=],2,6),p # and print
}' subset fasta
输出:
GAGGUA mmu-let-7g-5p MIMAT0000121
GAGGUA mmu-let-7i-5p MIMAT0000122
用 GNU awk
测试过,但我认为它适用于任何 awk
:
$ awk 'NR==FNR{a[[=10=]]; next}
in a{print substr(, 2, 6), }
' gene.txt RS='>' FS='\n' OFS='\t' fasta.txt
GAGGUA mmu-let-7g-5p MIMAT0000121
GAGGUA mmu-let-7i-5p MIMAT0000122
NR==FNR{a[[=14=]]; next}
构建数组,每行内容作为传递给 awk
的第一个文件的键
RS='>' FS='\n' OFS='\t'
这些将设置输入记录分隔符为 >
,输入字段分隔符为换行符,输出字段分隔符仅用于第二个文件的制表符(因为这些变量是在第一个文件名之后分配的)
in a{print substr(, 2, 7), }
如果第一个字段作为键存在于数组 a
中,打印所需的详细信息
如果行尾可以有尾随空白字符,请使用:
$ awk 'NR==FNR{sub(/[[:space:]]+$/, ""); a[[=11=]]; next}
in a{print substr(, 2, 6), }
' gene.txt RS='>' FS='[[:space:]]*\n' OFS='\t' fasta.txt
我有一个包含 geneID 子集的文件,以及一个包含所有 geneID 及其序列的 fasta 文件。对于子集文件中的每个基因,我想从每个 fasta 序列的开头获取位置 2-7。理想情况下,输出文件为 'pos 2-7' '\t' 'geneID'.
示例子集:
mmu-let-7g-5p MIMAT0000121
mmu-let-7i-5p MIMAT0000122
法斯塔文件:
>mmu-let-7g-5p MIMAT0000121
UGAGGUAGUAGUUUGUACAGUU
>mmu-let-7i-5p MIMAT0000122
UGAGGUAGUAGUUUGUGCUGUU
>mmu-let-7f-5p MIMAT0000525
UGAGGUAGUAGAUUGUAUAGUU
想要的输出:
GAGGUA mmu-let-7g-5p MIMAT0000121
GAGGUA mmu-let-7i-5p MIMAT0000122
第一部分(提取基因子集的 fasta 序列)我已经使用 grep -w -A 1 -f
完成了。不确定如何获得 pos 2-7 并使输出看起来像现在使用 Bash.
您能否尝试仅在 GNU awk
.
awk '
FNR==NR{
a[]=
next
}
/^>/{
ind=substr(,2)
}
/^>/ && (ind in a){
found=1
val=ind OFS a[ind]
next
}
found{
print substr([=10=],2,6) OFS val
val=found=""
}
' gene fastafile
说明: 为以上添加详细说明。
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when gene Input_file is being read.
a[]= ##Creating array a with index of and value of here.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if line starts from > then do following.
ind=substr(,2) ##Creating ind which has substring from 2nd charcters to all values of first field.
}
/^>/ && (ind in a){ ##Checking if line starts with > and ind is present in array a then do following.
found=1 ##Setting found to 1 here.
val=ind OFS a[ind] ##Creating val which has ind OFS and value of a with index of ind.
next ##next will skip all further statements from here.
}
found{ ##Checking condition if found is NOT NULL then do following.
print substr([=11=],2,6) OFS val ##Printing sub string from 2nd to 7th character OFS and val here.
val=found="" ##Nullifying val and found here.
}
' gene fastafile ##Mentioning Input_file names here.
另一个 awk:
$ awk '
{
gsub(/ +$/,"") # clean trailing space from sample data
}
NR==FNR { # process subset file as it is smaller
a[[=10=]] # hash keys
next
} # process fasta file
/^>/ && ((p=substr([=10=],2)) in a) { # if string found in hash
if(getline>0) # read next record
print substr([=10=],2,6),p # and print
}' subset fasta
输出:
GAGGUA mmu-let-7g-5p MIMAT0000121
GAGGUA mmu-let-7i-5p MIMAT0000122
用 GNU awk
测试过,但我认为它适用于任何 awk
:
$ awk 'NR==FNR{a[[=10=]]; next}
in a{print substr(, 2, 6), }
' gene.txt RS='>' FS='\n' OFS='\t' fasta.txt
GAGGUA mmu-let-7g-5p MIMAT0000121
GAGGUA mmu-let-7i-5p MIMAT0000122
NR==FNR{a[[=14=]]; next}
构建数组,每行内容作为传递给awk
的第一个文件的键
RS='>' FS='\n' OFS='\t'
这些将设置输入记录分隔符为>
,输入字段分隔符为换行符,输出字段分隔符仅用于第二个文件的制表符(因为这些变量是在第一个文件名之后分配的)in a{print substr(, 2, 7), }
如果第一个字段作为键存在于数组a
中,打印所需的详细信息
如果行尾可以有尾随空白字符,请使用:
$ awk 'NR==FNR{sub(/[[:space:]]+$/, ""); a[[=11=]]; next}
in a{print substr(, 2, 6), }
' gene.txt RS='>' FS='[[:space:]]*\n' OFS='\t' fasta.txt