使用 awk 打印 header 名称和子字符串

Question

我尝试使用此代码打印基因名称的 header，然后根据其位置提取子字符串，但它不起作用

>output_file
cat input_file | while read row; do
        echo $row > temp
        geneName=`awk '{print }' tmp`
        startPos=`awk '{print }' tmp`
        endPOs=`awk '{print }' tmp`
                for i in temp; do
                echo ">${geneName}" >> genes_fasta ;
                echo "awk '{val=substr([=11=],${startPos},${endPOs});print val}' fasta" >> genes_fasta
        done
done

input_file

nad5_exon1 250405 250551
nad5_exon2 251490 251884
nad5_exon3 195620 195641
nad5_exon4 154254 155469
nad5_exon5 156319 156548

法斯塔

atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc............

这是我错误的输出文件

>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta

输出应如下所示：

>name1
atgcatgcatgcatgcatgcat
>name2
tgcatgcatgcatgcat
>name3
gcatgcatgcatgcatgcat
>namen....

Answer 1

如果我没理解错的话：

awk 'NR==FNR {fasta = fasta [=10=]; next}
    {
        printf(">%s %s\n", , substr(fasta, ,  -  + 1))
    }' fasta input_file > genes_fasta

它首先读取 fasta 文件并将序列存储在变量 fasta 中。
然后逐行读取input_file，提取fasta的子串，从</code>开始，长度为<code> - + 1。（注意 substr 函数的第三个参数是长度，而不是 endpos。）

希望这对您有所帮助。

Answer 2

您可以通过一次调用 awk 来完成此操作，这比在 shell 脚本中循环并调用 awk 4 次 [=46] 效率高几个数量级=].因为你有 bash，你可以简单地使用 命令替换 并将 fasta 的内容重定向到一个 awk 变量，然后简单地输出标题和包含 fasta 文件中开头到结尾字符的子字符串。

例如：

awk -v fasta=$(<fasta) '{print ">" ; print substr(fasta,,-+1)}' input

或在 BEGIN 规则中使用 getline：

awk 'BEGIN{getline fasta<"fasta"}
{print ">" ; print substr(fasta,,-+1)}' input

示例输入文件

注意：开始值和结束值已减少以适应示例的 129 个字符：

$ cat input
rad5_exon1 1 17
rad5_exon2 23 51
rad5_exon3 110 127
rad5_exon4 38 62
rad5_exon5 59 79

以及您示例的前 129 个字符 fasta

$ cat fasta
atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc

例子Use/Output

$ awk -v fasta=$(<fasta) '{print ">" ; print substr(fasta,,-+1)}' input
>rad5_exon1
atgcatgcatgcatgca
>rad5_exon2
gcatgcatgcatgcatgcatgcatgcatg
>rad5_exon3
tgcatgcatgcatgcatg
>rad5_exon4
tgcatgcatgcatgcatgcatgcat
>rad5_exon5
gcatgcatgcatgcatgcatg

检查一下，如果我理解您的问题要求，请告诉我。如果您对解决方案还有其他疑问，也请告诉我。

Answer 3

成功了！这是从 fasta 文件中提取子字符串的脚本

cat genes_and_bounderies1 | while read row; do
        echo $row > temp
        geneName=`awk '{print }' temp`
        startPos=`awk '{print }' temp`
        endPos=`awk '{print }' temp`
        length=$(expr $endPos - $startPos)
                for i in temp; do
                echo ">${geneName}" >> genes_fasta
                awk -v S=$startPos -v L=$length '{print substr([=10=],S,L)}' unwraped_${fasta} >> genes_fasta
        done
done

使用 awk 打印 header 名称和子字符串

using awk to print header name and a substring

unix

bash

bioinformatics

genome

google-genomics