如何根据日期对 FASTA 文件进行排序?
How to sort a FASTA file based on date?
我有一个如下所示的 FASTA 文件
>Spike|hCoV-19/Wuhan/WIV04/2019|2019-12-30|EPI_ISL_402124|Original|hCoV-19^^Hubei|Human|Wuhan Jinyintan Hospital|Wuhan Institute of Virology|Shi|China
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
>Spike|hCoV-19/Philippines/PH-PGC-03696/2020|2020-12-23|EPI_ISL_2155626|Original|hCoV-19^^Central Luzon|Human|Research Institute for Tropical Medicine|Philippine Genome Center|Tablizo|Philippines
MFVFLVLLPLVFSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYYPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
>Spike|hCoV-19/Belgium/UZA-UA-8350/2021|2021-01-22|EPI_ISL_940774|Original|hCoV-19^^Berchem|Human|Platform BIS UZA/UAntwerpen|UAntwerp|Xavier|Belgium
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNTVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAQHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCT*
我需要根据日期列对这些序列进行排序,我在堆栈溢出中发现了这段代码,但它无法完成
的工作
from Bio.SeqIO.FastaIO import SimpleFastaParser
import pandas as pd
with open('F:/newone.fasta') as fasta_file:
identifiers = []
lengths = []
seq = []
for title, sequence in SimpleFastaParser(fasta_file):
identifiers.append(title.split(None, 3)[0])
lengths.append(len(sequence))
seq.append(sequence)
#converting lists to pandas Series
s1 = pd.Series(identifiers, name='ID')
s2 = pd.Series(lengths, name='length')
s3 = pd.Series(seq, name='seq')
Qfasta = pd.DataFrame(dict(ID=s1, length=s2)).set_index(['ID'])
这是第二个代码出现的错误,我不知道为什么会这样
IndexError Traceback (most recent call last)
in <module>
12 SeqIO.write(records, output_file, "fasta")
13
---> 14 sort_fasta(input_file, output_file)
in sort_fasta(input_file, output_file)
8 def get_data(id_name):
9 return (id_name.split("|")[2], seguid(id_name))
---> 10 dict_fasta = SeqIO.index(input_file, "fasta", key_function=get_data)
11 records = (dict_fasta[i] for i in sorted(list(dict_fasta), reverse=True, key = lambda d: list(map(int, d[0].split('-')))))
12 SeqIO.write(records, output_file, "fasta")
~\anaconda3\envs\deeplearning\lib\site-packages\Bio\SeqIO\__init__.py in index(filename, format, alphabet, key_function)
873 key_function,
874 )
--> 875 return _IndexedSeqFileDict(
876 proxy_class(filename, format), key_function, repr, "SeqRecord"
877 )
~\anaconda3\envs\deeplearning\lib\site-packages\Bio\File.py in __init__(self, random_access_proxy, key_function, repr, obj_repr)
185 offset_iter = random_access_proxy
186 offsets = {}
--> 187 for key, offset, length in offset_iter:
188 # Note - we don't store the length because I want to minimise the
189 # memory requirements. With the SQLite backend the length is kept
~\anaconda3\envs\deeplearning\lib\site-packages\Bio\File.py in <genexpr>(.0)
181 self._obj_repr = obj_repr
182 if key_function:
--> 183 offset_iter = ((key_function(k), o, l) for (k, o, l) in random_access_proxy)
184 else:
185 offset_iter = random_access_proxy
in get_data(id_name)
7 def sort_fasta(input_file, output_file):
8 def get_data(id_name):
----> 9 return (id_name.split("|")[2], seguid(id_name))
10 dict_fasta = SeqIO.index(input_file, "fasta", key_function=get_data)
11 records = (dict_fasta[i] for i in sorted(list(dict_fasta), reverse=True, key = lambda d: list(map(int, d[0].split('-')))))
我该怎么办?
您可以在 \n>
上拆分字符串,并使用 sorted
和 re.search
的组合对提取的日期进行排序,将日期设置为 key
。
使用 reverse=True
作为 sorted
的选项以首先获取最近的日期。
我假设字符串 fasta
作为输入。
import re
sorted_fasta = ('>'+'\n>'.join(sorted(fasta[1:].strip().split('\n>'),
key=lambda s: re.search(r'\|\d{4}-\d{2}-\d{2}\|',
s).group()
)
)
)
示例输入:
>xxx|2020-12-30|xxx
NNN
>yyy|2020-12-23|yyy
NNN
>zzz|2021-01-22|zzz
NNN
匹配输出:
>yyy|2020-12-23|yyy
NNN
>xxx|2020-12-30|xxx
NNN
>zzz|2021-01-22|zzz
NNN
使用以下代码,使用 SeqIO 索引函数对输入文件中的 fasta 条目进行排序并保存在输出文件中。因此,该函数也应该适用于内存无法容纳的大文件。
import re
from Bio import SeqIO
from Bio.SeqUtils.CheckSum import seguid
input_file = "fasta.fasta"
output_file = "out.fasta"
def sort_fasta(input_file: str, output_file: str) -> None:
def get_index_key(id_name: str) -> tuple:
try:
key = (re.search(r'\d{4}-\d{2}-\d{2}', id_name).group(), seguid(id_name))
except AttributeError:
key = ('0001-01-01', seguid(id_name))
return key
dict_fasta = SeqIO.index(input_file, "fasta", key_function=get_index_key)
sorted_keys_by_date = sorted(list(dict_fasta), reverse=True, key = lambda d: list(map(int, d[0].split('-'))))
records = (dict_fasta[i] for i in sorted_keys_by_date if i[0] != '0001-01-01')
SeqIO.write(records, output_file, "fasta")
sort_fasta(input_file, output_file)
我有一个如下所示的 FASTA 文件
>Spike|hCoV-19/Wuhan/WIV04/2019|2019-12-30|EPI_ISL_402124|Original|hCoV-19^^Hubei|Human|Wuhan Jinyintan Hospital|Wuhan Institute of Virology|Shi|China
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
>Spike|hCoV-19/Philippines/PH-PGC-03696/2020|2020-12-23|EPI_ISL_2155626|Original|hCoV-19^^Central Luzon|Human|Research Institute for Tropical Medicine|Philippine Genome Center|Tablizo|Philippines
MFVFLVLLPLVFSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYYPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
>Spike|hCoV-19/Belgium/UZA-UA-8350/2021|2021-01-22|EPI_ISL_940774|Original|hCoV-19^^Berchem|Human|Platform BIS UZA/UAntwerpen|UAntwerp|Xavier|Belgium
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNTVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAQHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCT*
我需要根据日期列对这些序列进行排序,我在堆栈溢出中发现了这段代码,但它无法完成
的工作from Bio.SeqIO.FastaIO import SimpleFastaParser
import pandas as pd
with open('F:/newone.fasta') as fasta_file:
identifiers = []
lengths = []
seq = []
for title, sequence in SimpleFastaParser(fasta_file):
identifiers.append(title.split(None, 3)[0])
lengths.append(len(sequence))
seq.append(sequence)
#converting lists to pandas Series
s1 = pd.Series(identifiers, name='ID')
s2 = pd.Series(lengths, name='length')
s3 = pd.Series(seq, name='seq')
Qfasta = pd.DataFrame(dict(ID=s1, length=s2)).set_index(['ID'])
这是第二个代码出现的错误,我不知道为什么会这样
IndexError Traceback (most recent call last)
in <module>
12 SeqIO.write(records, output_file, "fasta")
13
---> 14 sort_fasta(input_file, output_file)
in sort_fasta(input_file, output_file)
8 def get_data(id_name):
9 return (id_name.split("|")[2], seguid(id_name))
---> 10 dict_fasta = SeqIO.index(input_file, "fasta", key_function=get_data)
11 records = (dict_fasta[i] for i in sorted(list(dict_fasta), reverse=True, key = lambda d: list(map(int, d[0].split('-')))))
12 SeqIO.write(records, output_file, "fasta")
~\anaconda3\envs\deeplearning\lib\site-packages\Bio\SeqIO\__init__.py in index(filename, format, alphabet, key_function)
873 key_function,
874 )
--> 875 return _IndexedSeqFileDict(
876 proxy_class(filename, format), key_function, repr, "SeqRecord"
877 )
~\anaconda3\envs\deeplearning\lib\site-packages\Bio\File.py in __init__(self, random_access_proxy, key_function, repr, obj_repr)
185 offset_iter = random_access_proxy
186 offsets = {}
--> 187 for key, offset, length in offset_iter:
188 # Note - we don't store the length because I want to minimise the
189 # memory requirements. With the SQLite backend the length is kept
~\anaconda3\envs\deeplearning\lib\site-packages\Bio\File.py in <genexpr>(.0)
181 self._obj_repr = obj_repr
182 if key_function:
--> 183 offset_iter = ((key_function(k), o, l) for (k, o, l) in random_access_proxy)
184 else:
185 offset_iter = random_access_proxy
in get_data(id_name)
7 def sort_fasta(input_file, output_file):
8 def get_data(id_name):
----> 9 return (id_name.split("|")[2], seguid(id_name))
10 dict_fasta = SeqIO.index(input_file, "fasta", key_function=get_data)
11 records = (dict_fasta[i] for i in sorted(list(dict_fasta), reverse=True, key = lambda d: list(map(int, d[0].split('-')))))
我该怎么办?
您可以在 \n>
上拆分字符串,并使用 sorted
和 re.search
的组合对提取的日期进行排序,将日期设置为 key
。
使用 reverse=True
作为 sorted
的选项以首先获取最近的日期。
我假设字符串 fasta
作为输入。
import re
sorted_fasta = ('>'+'\n>'.join(sorted(fasta[1:].strip().split('\n>'),
key=lambda s: re.search(r'\|\d{4}-\d{2}-\d{2}\|',
s).group()
)
)
)
示例输入:
>xxx|2020-12-30|xxx
NNN
>yyy|2020-12-23|yyy
NNN
>zzz|2021-01-22|zzz
NNN
匹配输出:
>yyy|2020-12-23|yyy
NNN
>xxx|2020-12-30|xxx
NNN
>zzz|2021-01-22|zzz
NNN
使用以下代码,使用 SeqIO 索引函数对输入文件中的 fasta 条目进行排序并保存在输出文件中。因此,该函数也应该适用于内存无法容纳的大文件。
import re
from Bio import SeqIO
from Bio.SeqUtils.CheckSum import seguid
input_file = "fasta.fasta"
output_file = "out.fasta"
def sort_fasta(input_file: str, output_file: str) -> None:
def get_index_key(id_name: str) -> tuple:
try:
key = (re.search(r'\d{4}-\d{2}-\d{2}', id_name).group(), seguid(id_name))
except AttributeError:
key = ('0001-01-01', seguid(id_name))
return key
dict_fasta = SeqIO.index(input_file, "fasta", key_function=get_index_key)
sorted_keys_by_date = sorted(list(dict_fasta), reverse=True, key = lambda d: list(map(int, d[0].split('-'))))
records = (dict_fasta[i] for i in sorted_keys_by_date if i[0] != '0001-01-01')
SeqIO.write(records, output_file, "fasta")
sort_fasta(input_file, output_file)