Trim multi fasta 文件中的前 N 个碱基,使用 awk 并以最大宽度格式打印
Trim first N bases in multi fasta file with awk and print with max width format
背景
multi fasta 格式包含多条序列记录,每条记录以单行描述开头,然后是几行序列(RNA、DNA、蛋白质)。描述行以大于号开头,“>”后面是序列的标识,该行的其余部分是记录的描述(两者都是可选的)。
在 fasta 文件中,行序列很常见 格式化为最大宽度
输入示例,最大宽度="70":
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
MMTYIVFILSTIFVVSFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGY
TTAMATEPYPEAWTSNKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEA
MGIAALYSYGTWLVVVTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
MTYTLFLLSVILVMGFVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYT
TAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSI
GAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
MTYTLFLLSVILVMGFVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYT
TAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSI
GAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN
我的努力
我写了一个 gawk 脚本,它确实在序列中留下了 trim(切割前 N 个碱基),如果序列的长度小于 trim 则不打印它。
gawk -v left_trim=15 -v width=70 '
BEGIN{RS=">"}
NR==1{next}; #for blank record
{
split ([=11=], a, "\n");
sequence="";
for(i=2; i<=length(a); i++){
sequence=sequence a[i];
};
if(length(sequence)>left_trim) {
printf(">%s\n%s\n",a[1], substr(sequence, left_trim+1));
}
}' test.fasta
结果
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTSNKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVVVTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN
预计
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
问题
如何实现宽度格式?我尝试使用 ">%s\n%.70s\n"
,打印前 70 个字母,">%s\n%70s\n"
打印所有
我该如何改进此连接?退出函数?
sequence="";
for(i=2; i<=length(a); i++){
sequence=sequence a[i];
};
要回答您的特定问题,您可以使用 *
格式修饰符指定输出字段的宽度:
$ awk 'BEGIN{printf "%s\n", "foo"}'
foo
$ awk 'BEGIN{printf "%*s\n", 10, "foo"}'
foo
不,没有 join
函数可以将数组重新组合成一个字符串(与 split()
相反)但是如果你只是让 awk 将记录分成字段而不是你手动将记录拆分为一个元素数组,然后您可以为任何字段分配一个新值,awk 会将这些字段重新编译为 $0,就像我在下面的第一个解决方案中删除第一个 [ 时故意产生的副作用一样=25=] 与 = ""
.
下面是我处理任务的方式:
$ cat tst.awk
BEGIN { RS=">"; FS="\n"; OFS="" }
NR>1 {
print RS
= ""
for ( start=left_trim+1; start<=length(); start+=width ) {
print substr([=11=],start,width)
}
}
$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
或者如果您愿意:
$ cat tst.awk
/^>/ { prtRec(); rec=""; print; next }
{ rec = rec [=12=] }
END { prtRec() }
function prtRec( start) {
for ( start=left_trim+1; start<=length(rec); start+=width ) {
print substr(rec,start,width)
}
}
$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
背景
multi fasta 格式包含多条序列记录,每条记录以单行描述开头,然后是几行序列(RNA、DNA、蛋白质)。描述行以大于号开头,“>”后面是序列的标识,该行的其余部分是记录的描述(两者都是可选的)。
在 fasta 文件中,行序列很常见 格式化为最大宽度
输入示例,最大宽度="70":
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus] MMTYIVFILSTIFVVSFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGY TTAMATEPYPEAWTSNKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEA MGIAALYSYGTWLVVVTGWSLLIGVLVIMEVTRGN >gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki] MTYTLFLLSVILVMGFVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYT TAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSI GAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN >gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus] MTYTLFLLSVILVMGFVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYT TAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSI GAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN
我的努力
我写了一个 gawk 脚本,它确实在序列中留下了 trim(切割前 N 个碱基),如果序列的长度小于 trim 则不打印它。
gawk -v left_trim=15 -v width=70 '
BEGIN{RS=">"}
NR==1{next}; #for blank record
{
split ([=11=], a, "\n");
sequence="";
for(i=2; i<=length(a); i++){
sequence=sequence a[i];
};
if(length(sequence)>left_trim) {
printf(">%s\n%s\n",a[1], substr(sequence, left_trim+1));
}
}' test.fasta
结果
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus] SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTSNKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVVVTGWSLLIGVLVIMEVTRGN >gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki] FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN >gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus] FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN
预计
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus] SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV VTGWSLLIGVLVIMEVTRGN >gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki] FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV TGWTLLVGVYIVIEIARGN >gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus] FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV TGWTLLVGVYIVIEIARGN
问题
如何实现宽度格式?我尝试使用
">%s\n%.70s\n"
,打印前 70 个字母,">%s\n%70s\n"
打印所有我该如何改进此连接?退出函数?
sequence=""; for(i=2; i<=length(a); i++){ sequence=sequence a[i]; };
要回答您的特定问题,您可以使用 *
格式修饰符指定输出字段的宽度:
$ awk 'BEGIN{printf "%s\n", "foo"}'
foo
$ awk 'BEGIN{printf "%*s\n", 10, "foo"}'
foo
不,没有 join
函数可以将数组重新组合成一个字符串(与 split()
相反)但是如果你只是让 awk 将记录分成字段而不是你手动将记录拆分为一个元素数组,然后您可以为任何字段分配一个新值,awk 会将这些字段重新编译为 $0,就像我在下面的第一个解决方案中删除第一个 [ 时故意产生的副作用一样=25=] 与 = ""
.
下面是我处理任务的方式:
$ cat tst.awk
BEGIN { RS=">"; FS="\n"; OFS="" }
NR>1 {
print RS
= ""
for ( start=left_trim+1; start<=length(); start+=width ) {
print substr([=11=],start,width)
}
}
$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
或者如果您愿意:
$ cat tst.awk
/^>/ { prtRec(); rec=""; print; next }
{ rec = rec [=12=] }
END { prtRec() }
function prtRec( start) {
for ( start=left_trim+1; start<=length(rec); start+=width ) {
print substr(rec,start,width)
}
}
$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN