使用 Bash 从每个基因的 fasta 序列中提取位置 2-7

Question

我有一个包含 geneID 子集的文件，以及一个包含所有 geneID 及其序列的 fasta 文件。对于子集文件中的每个基因，我想从每个 fasta 序列的开头获取位置 2-7。理想情况下，输出文件为 'pos 2-7' '\t' 'geneID'.

示例子集：

mmu-let-7g-5p MIMAT0000121  
mmu-let-7i-5p MIMAT0000122

法斯塔文件：

>mmu-let-7g-5p MIMAT0000121 
UGAGGUAGUAGUUUGUACAGUU
>mmu-let-7i-5p MIMAT0000122 
UGAGGUAGUAGUUUGUGCUGUU
>mmu-let-7f-5p MIMAT0000525 
UGAGGUAGUAGAUUGUAUAGUU

想要的输出：

GAGGUA   mmu-let-7g-5p MIMAT0000121
GAGGUA   mmu-let-7i-5p MIMAT0000122

第一部分（提取基因子集的 fasta 序列）我已经使用 grep -w -A 1 -f 完成了。不确定如何获得 pos 2-7 并使输出看起来像现在使用 Bash.

Answer 1

您能否尝试仅在 GNU awk.

中使用显示的示例进行跟踪、编写和测试

awk '
FNR==NR{
  a[]=
  next
}
/^>/{
  ind=substr(,2)
}
/^>/ && (ind in a){
  found=1
  val=ind OFS a[ind]
  next
}
found{
  print substr([=10=],2,6) OFS val
  val=found=""
}
' gene fastafile

说明： 为以上添加详细说明。

awk '                               ##Starting awk program from here.
FNR==NR{                            ##Checking condition FNR==NR which will be TRUE when gene Input_file is being read.
  a[]=                          ##Creating array a with index of  and value of  here.
  next                              ##next will skip all further statements from here.
}
/^>/{                               ##Checking condition if line starts from > then do following.
  ind=substr(,2)                  ##Creating ind which has substring from 2nd charcters to all values of first field.
}
/^>/ && (ind in a){                 ##Checking if line starts with > and ind is present in array a then do following.
  found=1                           ##Setting found to 1 here.
  val=ind OFS a[ind]                ##Creating val which has ind OFS and value of a with index of ind.
  next                              ##next will skip all further statements from here.
}
found{                              ##Checking condition if found is NOT NULL then do following.
  print substr([=11=],2,6) OFS val      ##Printing sub string from 2nd to 7th character OFS and val here.
  val=found=""                      ##Nullifying val and found here.
}
' gene fastafile                    ##Mentioning Input_file names here.

Answer 2

另一个 awk:

$ awk '
{
    gsub(/ +$/,"")                 # clean trailing space from sample data 
}
NR==FNR {                          # process subset file as it is smaller
    a[[=10=]]                          # hash keys
    next                        
}                                  # process fasta file
/^>/ && ((p=substr([=10=],2)) in a) {  # if string found in hash
    if(getline>0)                  # read next record
        print substr([=10=],2,6),p     # and print
}' subset fasta

输出：

GAGGUA mmu-let-7g-5p MIMAT0000121
GAGGUA mmu-let-7i-5p MIMAT0000122

Answer 3

用 GNU awk 测试过，但我认为它适用于任何 awk:

$ awk 'NR==FNR{a[[=10=]]; next}
        in a{print substr(, 2, 6), }
      ' gene.txt RS='>' FS='\n' OFS='\t' fasta.txt
GAGGUA  mmu-let-7g-5p MIMAT0000121
GAGGUA  mmu-let-7i-5p MIMAT0000122

NR==FNR{a[[=14=]]; next} 构建数组，每行内容作为传递给 awk
RS='>' FS='\n' OFS='\t' 这些将设置输入记录分隔符为 >，输入字段分隔符为换行符，输出字段分隔符仅用于第二个文件的制表符（因为这些变量是在第一个文件名之后分配的）
in a{print substr(, 2, 7), } 如果第一个字段作为键存在于数组 a 中，打印所需的详细信息

如果行尾可以有尾随空白字符，请使用：

$ awk 'NR==FNR{sub(/[[:space:]]+$/, ""); a[[=11=]]; next}
        in a{print substr(, 2, 6), }
      ' gene.txt RS='>' FS='[[:space:]]*\n' OFS='\t' fasta.txt

使用 Bash 从每个基因的 fasta 序列中提取位置 2-7

Extract positions 2-7 from a fasta sequence for each gene using Bash

awk

command-line

fasta