在文本 (fasta) 文件中将换行符向下游移动 5 个位置

Move new line character 5 positions downstream in a text (fasta) file

我正在尝试转换这样的文本文件(fasta 格式):

>seq1
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
ATGATGATGGAATGAGGAT
TTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGG
TTGCAATGCGCGTATTTAT
TTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAGGCTGT
AAAAAAAAAAAAAAAGGGG

objective 是将换行符向下游移动 5 个位置,但以 >

开头的行除外
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG

我想使用 AWK,但我不确定如何进行。我正在考虑与此类似的事情:

awk '{for(i=1;i<=NR;i++){ if( ~ /^>/){¿?¿?¿?}}}'

你知道我该如何解决这个问题吗?

我会按照以下方式进行,让 file.txt 内容为

>seq1
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
ATGATGATGGAATGAGGAT
TTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGG
TTGCAATGCGCGTATTTAT
TTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAGGCTGT
AAAAAAAAAAAAAAAGGGG

然后

awk 'BEGIN{width=24}/>/&&x{print x;x=""}/>/{print;next}{x = x [=11=]}length(x)>=width{print substr(x,1,width);x=substr(x,width+1)}END{print x}' file.txt

给出输出

>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG

说明:我将宽度设置为 24,这是所需字符的数量,如果找到 > 并且 x 中存储了某些内容,请打印它并将 x 值设置为空字符串,如果遇到带有 > 的行,请打印它并转到下一行。对于每一行,将当前行内容附加到 x,如果 xlength 等于或大于 width,则打印 width x 的第一个字符和从 x 中删除这些字符。处理完所有行后打印 x。免责声明解决方案:此解决方案假定当前宽度与所需宽度之间的比率小于 0.5

(GNU Awk 5.0.1)

假设:

  • 所有数据行都将扩展到最多 24 个字符

一个awk想法:

awk -v width=24 '                               # pass width in as awk variable "width"
function print_sequence() {
    if (sequence)                               # if sequence is not blank
       while (sequence) {                       # while sequence is not blank
             print substr(sequence,1,width)     # print 1st 24 characters
             sequence=substr(sequence,width+1)  # remove 1st 24 characters
       }
}

/^>/ { print_sequence()                         # flush previous set of data to stdout
       print                                    # print current input line
       next                                     # process next input line
     }
     { sequence=sequence  }                   # append data to our "sequence" variable

END  { print_sequence() }                       # flush last set of data to stdout
' fasta.in > fasta.out

这会生成:

$ cat fasta.out
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG

您可以尝试另一种方法,使用 awk 的字段和记录分隔符:

awk -v width=24 '
  BEGIN {
    FS="\n"                            # Set the Field separator to newline
    RS=">"                             # Set the Record separator to ">"
    ORS=OFS=""                         # Set the Output Record and Field separator to an empty string
  }

  NR>1 {                               # Using ">" as a record separator the first record is empty, so skip
    header=                          # Using "\n" as the Field separator,  contains the header, save it in a variable
    =OFS                             # Assign an empty string to  so the record gets recalculated and the body becomes [=10=] i
                                       # with all newlines are removed, since OFS == ""
    gsub(".{" width "}", "&" FS)       # Append every "width" characters with a newline (FS)
    print RS header FS [=10=] FS           # Print a ">", the header, a newline, the body and a newline
  }
' fasta_in > fasta_out

假设以 > 开头的行不超过 24 个字符:

$ awk '{printf "%s", (/^>/ ? sep [=10=] ORS : [=10=]); sep=ORS} END{print ""}' file | fold -w24
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG