vi 正则表达式:查找特定染色体及其包含 KSP 蛋白片段的片段

vi regex: Find specific chromosomes and their fragments containing KSP protein fragments

有一个蛋白质序列的fasta文件,想找到含有KSP三个氨基酸的2号或5号染色体。如何编写模式字符串。

下面是 fasta 文件的简要概述:

>AT1G05230.1
MFEPNMLLAAMNNADSNNHNYNHEDNNNEGFLRDDEFDSPNTKSGSENQEGGSGNDQDPLHPNKKKRYHRHTQLQIQEME
. . . .
DFLRDENSRNEWDILSNGGVVQEMAHIANGRDTGNCVSLLRVNSANSSQSNMLILQESCTDPTASFVIYAPVDIVAMNIV
LNGGDPDYVALLPSGFAILPDGNANSGAPGGDGGSLLTVAFQILVDSVPTAKLSLGSVATVNNLIACTVERIKASMSCET
A*

>AT1G05230.2
MFEPNMLLAAMNNADSNNHNYNHEDNNNEGFLRDDEFDSPNTKSGSENQEGGSGNDQDPLHPNKKKRYHRHTQLQIQEME
. . . . . .
DFLRDENSRNEWDILSNGGVVQEMAHIANGRDTGNCVSLLRVNSANSSQSNMLILQESCTDPTASFVIYAPVDIVAMNIV
LNGGDPDYVALLPSGFAILPDGNANSGAPGGDGGSLLTVAFQILVDSVPTAKLSLGSVATVNNLIACTVERIKASMSCET
A*

>AT2G35940.1
MAAYFHGNPPEISAGSDGGLQTLILMNPTTYVQYTQQDNDSNNNNNSNNSNNNNTNTNTNNNNSSFVFLDSHAPQPNASQ
..........
....KSP......TNYHMNPNHNGDLEGVTGMQGSPKRLRTSDETMMQPINADFSSNEKLTMKILEERQGIRSDGGYPFM
..........
NGGSSTTTAHSSAAAAAAYNGMNIQNQKRYVAQLLPDFVA*

>AT2G35940.2
MAAYFHGNPPEISAGSDGGLQTLILMNPTTYVQYTQQDNDSNNNNNSNNSNNNNTNTNTNNNNSSFVFLDSHAPQPNASQ
................................................................................
RAWLFEHFLHPYPKDSDKHMLAKQTGLTRSQVSNWFINARVRLWKPMVEEMYMEEMKEQAKNMGSMEKTPLDQSNEDSAS
.....KSP..................EGVTGMQGSPKRLRTSDETMMQPINADFSSNEKLTMKILEERQGIRSDGGYPFM
................................................................................
NGGSSTTTAHSSAAAAAAYNGMNIQNQKRYVAQLLPDFVA*

>AT3G03660.1
MDQEQTPHSPTRHSRSPPSSASGSTSAEPVRSRWSPKPEQILILESIFHSGMVNPPKEETVRIRKMLEKFGAVGDANVFY
................................................................................
VPLPTDEFGFLMHSLQHGEAYFLVPRQT*

>AT3G11260.1
MSFSVKGRSLRGNNNGGTGTKCGRWNPTVEQLKILTDLFRAGLRTPTTDQIQKISTELSFYGKIESKNVFYWFQNHKARE
................................................................................
PYSSCGAEMEHPPPLDLRLSFL*

>AT3G61890.1
MEEGDFFNCCFSEISSGMTMNKKKMKKSNNQKRFSEEQIKSLELIFESETRLEPRKKVQVARELGLQPRQVAIWFQNKRA
...KSP..........................................................................
RLDQGSVLCNDGDYNNNIKTEYFGFEEETDHELMNIVEKADDSCLTSSENWGGFNSDSLLDQSSSNYPNWWEFWS*

................................................................................
(lots of sequences)
................................................................................

>AT5G11060.1
MAFHNNHFNHFTDQQQHQPPPPPQQQQQQHFQESAPPNWLLRSDNNFLNLHTAASAAATSSDSPSSAAANQWLSRSSSFL
................................................................................
SVLKSWWQSHSKWPYPTEEDKARLVQETGLQLKQINNWFINQRKRNWHSNPSSSTVSKNKRRSNAGENSGRDR*

>AT5G15150.1
MYMYEEERNNINNNQEGLRLEMAFPQHGFMFQQLHEDNAHHLPSPTSLPSCPPHLFYGGGGNYMMNRSMSFTGVSDHHHL
..KSP...........TTTNNMNDQDQVGEEDNLSDDGSHMMLGEKKKRLNLEQVRALEKSFELGNKLEPERKMQLAKAL
QNRRARWKTKQLERDYDSLKKQFDVLKSDNDSLLAHNKKLHAELVALKKHDRKESAKIKREFAEASWSNNGSTENNHNNN
SSDANHVSMIKDLFPSSIRSATATTTSTHIDHQIVQDQDQGFCNMFNGIDETTSASYWAWPDQQQQHHNHHQFN*

首先我可以写匹配2号或5号染色体的模式串,比如>AT[25]G。 我这样写模式串(>AT[25]G.*KSP.*)匹配满足条件的序列失败了

顺便说一句,所有序列都以大于号>开始,以星号*结束,所有氨基酸都大写。

例如,预期结果将是 2 号和 5 号染色体上 KSP 所有三个氨基酸的序列

>AT2G35940.1
MAAYFHGNPPEISAGSDGGLQTLILMNPTTYVQYTQQDNDSNNNNNSNNSNNNNTNTNTNNNNSSFVFLDSHAPQPNASQ
..........
....KSP......TNYHMNPNHNGDLEGVTGMQGSPKRLRTSDETMMQPINADFSSNEKLTMKILEERQGIRSDGGYPFM
..........
NGGSSTTTAHSSAAAAAAYNGMNIQNQKRYVAQLLPDFVA*

>AT2G35940.2
MAAYFHGNPPEISAGSDGGLQTLILMNPTTYVQYTQQDNDSNNNNNSNNSNNNNTNTNTNNNNSSFVFLDSHAPQPNASQ
................................................................................
RAWLFEHFLHPYPKDSDKHMLAKQTGLTRSQVSNWFINARVRLWKPMVEEMYMEEMKEQAKNMGSMEKTPLDQSNEDSAS
.....KSP..................EGVTGMQGSPKRLRTSDETMMQPINADFSSNEKLTMKILEERQGIRSDGGYPFM
................................................................................
NGGSSTTTAHSSAAAAAAYNGMNIQNQKRYVAQLLPDFVA*

>AT5G15150.1
MYMYEEERNNINNNQEGLRLEMAFPQHGFMFQQLHEDNAHHLPSPTSLPSCPPHLFYGGGGNYMMNRSMSFTGVSDHHHL
..KSP...........TTTNNMNDQDQVGEEDNLSDDGSHMMLGEKKKRLNLEQVRALEKSFELGNKLEPERKMQLAKAL
QNRRARWKTKQLERDYDSLKKQFDVLKSDNDSLLAHNKKLHAELVALKKHDRKESAKIKREFAEASWSNNGSTENNHNNN
SSDANHVSMIKDLFPSSIRSATATTTSTHIDHQIVQDQDQGFCNMFNGIDETTSASYWAWPDQQQQHHNHHQFN*

如何在vim中写正则表达式来匹配它们,希望大家能帮帮我,非常感谢大家看完我的问题

那是多行搜索。尝试类似以下内容并根据需要进行修改。我在匹配字符 类.

中包含换行符、制表符和字母数字

^>AT[25]G[\t\n[:alnum:].]*KSP[\t\n[:alnum:].]*\*$