如何根据日期对 FASTA 文件进行排序?

How to sort a FASTA file based on date?

我有一个如下所示的 FASTA 文件

>Spike|hCoV-19/Wuhan/WIV04/2019|2019-12-30|EPI_ISL_402124|Original|hCoV-19^^Hubei|Human|Wuhan Jinyintan Hospital|Wuhan Institute of Virology|Shi|China
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
>Spike|hCoV-19/Philippines/PH-PGC-03696/2020|2020-12-23|EPI_ISL_2155626|Original|hCoV-19^^Central Luzon|Human|Research Institute for Tropical Medicine|Philippine Genome Center|Tablizo|Philippines
MFVFLVLLPLVFSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYYPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
>Spike|hCoV-19/Belgium/UZA-UA-8350/2021|2021-01-22|EPI_ISL_940774|Original|hCoV-19^^Berchem|Human|Platform BIS UZA/UAntwerpen|UAntwerp|Xavier|Belgium
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNTVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAQHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCT*

我需要根据日期列对这些序列进行排序,我在堆栈溢出中发现了这段代码,但它无法完成

的工作
from Bio.SeqIO.FastaIO import SimpleFastaParser
import pandas as pd

with open('F:/newone.fasta') as fasta_file: 
    identifiers = []
    lengths = []
    seq = []
    for title, sequence in SimpleFastaParser(fasta_file):
        identifiers.append(title.split(None, 3)[0])  
        lengths.append(len(sequence))
        seq.append(sequence)
#converting lists to pandas Series    
s1 = pd.Series(identifiers, name='ID')
s2 = pd.Series(lengths, name='length')
s3 = pd.Series(seq, name='seq')


Qfasta = pd.DataFrame(dict(ID=s1, length=s2)).set_index(['ID'])

这是第二个代码出现的错误,我不知道为什么会这样

IndexError                                Traceback (most recent call last)
in <module>
     12     SeqIO.write(records, output_file, "fasta")
     13 
---> 14 sort_fasta(input_file, output_file)

 in sort_fasta(input_file, output_file)
      8     def get_data(id_name):
      9         return (id_name.split("|")[2], seguid(id_name))
---> 10     dict_fasta = SeqIO.index(input_file, "fasta", key_function=get_data)
     11     records = (dict_fasta[i] for i in sorted(list(dict_fasta), reverse=True, key = lambda d: list(map(int, d[0].split('-')))))
     12     SeqIO.write(records, output_file, "fasta")

~\anaconda3\envs\deeplearning\lib\site-packages\Bio\SeqIO\__init__.py in index(filename, format, alphabet, key_function)
    873         key_function,
    874     )
--> 875     return _IndexedSeqFileDict(
    876         proxy_class(filename, format), key_function, repr, "SeqRecord"
    877     )

~\anaconda3\envs\deeplearning\lib\site-packages\Bio\File.py in __init__(self, random_access_proxy, key_function, repr, obj_repr)
    185             offset_iter = random_access_proxy
    186         offsets = {}
--> 187         for key, offset, length in offset_iter:
    188             # Note - we don't store the length because I want to minimise the
    189             # memory requirements. With the SQLite backend the length is kept

~\anaconda3\envs\deeplearning\lib\site-packages\Bio\File.py in <genexpr>(.0)
    181         self._obj_repr = obj_repr
    182         if key_function:
--> 183             offset_iter = ((key_function(k), o, l) for (k, o, l) in random_access_proxy)
    184         else:
    185             offset_iter = random_access_proxy

in get_data(id_name)
      7 def sort_fasta(input_file, output_file):
      8     def get_data(id_name):
----> 9         return (id_name.split("|")[2], seguid(id_name))
     10     dict_fasta = SeqIO.index(input_file, "fasta", key_function=get_data)
     11     records = (dict_fasta[i] for i in sorted(list(dict_fasta), reverse=True, key = lambda d: list(map(int, d[0].split('-')))))

我该怎么办?

您可以在 \n> 上拆分字符串,并使用 sortedre.search 的组合对提取的日期进行排序,将日期设置为 key

使用 reverse=True 作为 sorted 的选项以首先获取最近的日期。

我假设字符串 fasta 作为输入。

import re
sorted_fasta = ('>'+'\n>'.join(sorted(fasta[1:].strip().split('\n>'),
                                      key=lambda s: re.search(r'\|\d{4}-\d{2}-\d{2}\|',
                                                              s).group()
                                      )
                               )
               )

示例输入:

>xxx|2020-12-30|xxx
NNN
>yyy|2020-12-23|yyy
NNN
>zzz|2021-01-22|zzz
NNN

匹配输出:

>yyy|2020-12-23|yyy
NNN
>xxx|2020-12-30|xxx
NNN
>zzz|2021-01-22|zzz
NNN

使用以下代码,使用 SeqIO 索引函数对输入文件中的 fasta 条目进行排序并保存在输出文件中。因此,该函数也应该适用于内存无法容纳的大文件。

import re
from Bio import SeqIO
from Bio.SeqUtils.CheckSum import seguid

input_file = "fasta.fasta"
output_file = "out.fasta"

def sort_fasta(input_file: str, output_file: str) -> None:
    def get_index_key(id_name: str) -> tuple:
        try:
            key = (re.search(r'\d{4}-\d{2}-\d{2}', id_name).group(), seguid(id_name))
        except AttributeError:
            key = ('0001-01-01', seguid(id_name))
        return key
    dict_fasta = SeqIO.index(input_file, "fasta", key_function=get_index_key)
    sorted_keys_by_date = sorted(list(dict_fasta), reverse=True, key = lambda d: list(map(int, d[0].split('-'))))
    records = (dict_fasta[i] for i in sorted_keys_by_date if i[0] != '0001-01-01')
    SeqIO.write(records, output_file, "fasta")

sort_fasta(input_file, output_file)