在文本 (fasta) 文件中将换行符向下游移动 5 个位置
Move new line character 5 positions downstream in a text (fasta) file
我正在尝试转换这样的文本文件(fasta 格式):
>seq1
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
ATGATGATGGAATGAGGAT
TTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGG
TTGCAATGCGCGTATTTAT
TTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAGGCTGT
AAAAAAAAAAAAAAAGGGG
objective 是将换行符向下游移动 5 个位置,但以 >
开头的行除外
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG
我想使用 AWK,但我不确定如何进行。我正在考虑与此类似的事情:
awk '{for(i=1;i<=NR;i++){ if( ~ /^>/){¿?¿?¿?}}}'
你知道我该如何解决这个问题吗?
我会按照以下方式进行,让 file.txt
内容为
>seq1
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
ATGATGATGGAATGAGGAT
TTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGG
TTGCAATGCGCGTATTTAT
TTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAGGCTGT
AAAAAAAAAAAAAAAGGGG
然后
awk 'BEGIN{width=24}/>/&&x{print x;x=""}/>/{print;next}{x = x [=11=]}length(x)>=width{print substr(x,1,width);x=substr(x,width+1)}END{print x}' file.txt
给出输出
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG
说明:我将宽度设置为 24,这是所需字符的数量,如果找到 >
并且 x
中存储了某些内容,请打印它并将 x
值设置为空字符串,如果遇到带有 >
的行,请打印它并转到下一行。对于每一行,将当前行内容附加到 x
,如果 x
的 length
等于或大于 width
,则打印 width
x 的第一个字符和从 x
中删除这些字符。处理完所有行后打印 x
。免责声明解决方案:此解决方案假定当前宽度与所需宽度之间的比率小于 0.5
(GNU Awk 5.0.1)
假设:
- 所有数据行都将扩展到最多 24 个字符
一个awk
想法:
awk -v width=24 ' # pass width in as awk variable "width"
function print_sequence() {
if (sequence) # if sequence is not blank
while (sequence) { # while sequence is not blank
print substr(sequence,1,width) # print 1st 24 characters
sequence=substr(sequence,width+1) # remove 1st 24 characters
}
}
/^>/ { print_sequence() # flush previous set of data to stdout
print # print current input line
next # process next input line
}
{ sequence=sequence } # append data to our "sequence" variable
END { print_sequence() } # flush last set of data to stdout
' fasta.in > fasta.out
这会生成:
$ cat fasta.out
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG
您可以尝试另一种方法,使用 awk 的字段和记录分隔符:
awk -v width=24 '
BEGIN {
FS="\n" # Set the Field separator to newline
RS=">" # Set the Record separator to ">"
ORS=OFS="" # Set the Output Record and Field separator to an empty string
}
NR>1 { # Using ">" as a record separator the first record is empty, so skip
header= # Using "\n" as the Field separator, contains the header, save it in a variable
=OFS # Assign an empty string to so the record gets recalculated and the body becomes [=10=] i
# with all newlines are removed, since OFS == ""
gsub(".{" width "}", "&" FS) # Append every "width" characters with a newline (FS)
print RS header FS [=10=] FS # Print a ">", the header, a newline, the body and a newline
}
' fasta_in > fasta_out
假设以 >
开头的行不超过 24 个字符:
$ awk '{printf "%s", (/^>/ ? sep [=10=] ORS : [=10=]); sep=ORS} END{print ""}' file | fold -w24
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG
我正在尝试转换这样的文本文件(fasta 格式):
>seq1
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
ATGATGATGGAATGAGGAT
TTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGG
TTGCAATGCGCGTATTTAT
TTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAGGCTGT
AAAAAAAAAAAAAAAGGGG
objective 是将换行符向下游移动 5 个位置,但以 >
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG
我想使用 AWK,但我不确定如何进行。我正在考虑与此类似的事情:
awk '{for(i=1;i<=NR;i++){ if( ~ /^>/){¿?¿?¿?}}}'
你知道我该如何解决这个问题吗?
我会按照以下方式进行,让 file.txt
内容为
>seq1
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
ATGATGATGGAATGAGGAT
TTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGG
TTGCAATGCGCGTATTTAT
TTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAGGCTGT
AAAAAAAAAAAAAAAGGGG
然后
awk 'BEGIN{width=24}/>/&&x{print x;x=""}/>/{print;next}{x = x [=11=]}length(x)>=width{print substr(x,1,width);x=substr(x,width+1)}END{print x}' file.txt
给出输出
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG
说明:我将宽度设置为 24,这是所需字符的数量,如果找到 >
并且 x
中存储了某些内容,请打印它并将 x
值设置为空字符串,如果遇到带有 >
的行,请打印它并转到下一行。对于每一行,将当前行内容附加到 x
,如果 x
的 length
等于或大于 width
,则打印 width
x 的第一个字符和从 x
中删除这些字符。处理完所有行后打印 x
。免责声明解决方案:此解决方案假定当前宽度与所需宽度之间的比率小于 0.5
(GNU Awk 5.0.1)
假设:
- 所有数据行都将扩展到最多 24 个字符
一个awk
想法:
awk -v width=24 ' # pass width in as awk variable "width"
function print_sequence() {
if (sequence) # if sequence is not blank
while (sequence) { # while sequence is not blank
print substr(sequence,1,width) # print 1st 24 characters
sequence=substr(sequence,width+1) # remove 1st 24 characters
}
}
/^>/ { print_sequence() # flush previous set of data to stdout
print # print current input line
next # process next input line
}
{ sequence=sequence } # append data to our "sequence" variable
END { print_sequence() } # flush last set of data to stdout
' fasta.in > fasta.out
这会生成:
$ cat fasta.out
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG
您可以尝试另一种方法,使用 awk 的字段和记录分隔符:
awk -v width=24 '
BEGIN {
FS="\n" # Set the Field separator to newline
RS=">" # Set the Record separator to ">"
ORS=OFS="" # Set the Output Record and Field separator to an empty string
}
NR>1 { # Using ">" as a record separator the first record is empty, so skip
header= # Using "\n" as the Field separator, contains the header, save it in a variable
=OFS # Assign an empty string to so the record gets recalculated and the body becomes [=10=] i
# with all newlines are removed, since OFS == ""
gsub(".{" width "}", "&" FS) # Append every "width" characters with a newline (FS)
print RS header FS [=10=] FS # Print a ">", the header, a newline, the body and a newline
}
' fasta_in > fasta_out
假设以 >
开头的行不超过 24 个字符:
$ awk '{printf "%s", (/^>/ ? sep [=10=] ORS : [=10=]); sep=ORS} END{print ""}' file | fold -w24
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG