如何使用 biopython 将 fasta 文件中的 seqID 替换为新的 seqID

how to replace seqIDs in a fasta file with new seqIDs using biopython

我有一个 fasta 文件,内容如下:

>00009c1cc42953fb4702f6331325c7cc
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGTTGTTAAGTCAGTGGTGAAATCGTGTGGCTCAACCATACGGAGCCATTGAAACTGGCGACCTTGAGTGTAAACGAGGTAGGCGGAATGTGACGTGTAGCGGTGAAATGCTTAGATATGTCACAGAACCCCGATTGCGAAGGCAGCTTACCAGCATACAACTGAC
>000118a5e731455e942c61a82a40367a623088d0
AGAGTTTTATCCTGGCTCAGGATGAACGCTAGCGGCAGGCCTAATACATGCAAGTCGGACGGGATCTAAATTTAAGCTTGCTTAAGTTTAGTGAGAGTGGCGCACGGGTGCGTAACGCGTGAGCAACCTACCCATATCAGGGGGATAGCCCGAAGAAATTCGGATTAACACCGCATAACACAGCAATCTCGCATGAGATCACTGTTAAATATTTATAGGATATGGATGGGCTCGCGTGACATTAGCTAGTTGGTAAGGTAACGGCTTACCAAGGCAACGATGTCTAGGGGCTCTGAGAGGAGAATCCCCCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTAAGGATTATTGGTCAATGGAGGGAACTCTGAACCAGCCATGCCGCGTGCAGGATGACTGCCCTATGGGTTGTAAACTGCTTTTGTCTGGGAATAAACCTTGATTCGTGAATCAAGCTGAATGTACCAGAAGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCTTTATAAGTCAGAGGTGAAAGACGGCAGCTTAACTGTCGCAGTGCCTTTGATACTGTATAGCTTGAATATCGTTGAAGATGGCGGAATGAGACAAGTAGCGGTGAAATGCATAGATATGTCTCAGAACTCCGATTGCGAAGGCAGCTGTCTAAGCGGCAATTGACGCTGATGCACGAAAGCGTGGGGATCAAACAGGATTAGATACCCTGGTAGTCCACGCCCTAAACGATGATAACTGGATGTTGGCGATACACAGTCAGCGTCTTAGCGAAAGCGTTAAGTTATCCACCTGGGGAGTACGCCCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGAGGAGCATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAAAGTTAGTGAATGCGACAGAGACGTCTCAGTCCTTCGGGACACGAAACTAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATGTTTAGTTGCCAGCATGTAATGATGGGGACTCTAAACAGACTGCCTGCGTAAGCAGCGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGTCCGGGGCTACACACGTGCTACAATGGATGGTACAGCGGGCAGCTACACAGCAATGTGATGCTAATCTCTAAAAGCCATTCACAGTTCGGATAGGGGTCTGCAACTCGACCCCATGAAGTTGGATTCGCTAGTAATCGCGTATCAGCAATGACGCGGT

我想基本上像这样将微生物分类法添加到 seq ID:

d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidales_RF16_group; g__Bacteroidales_RF16_group; s__uncultured_bacterium|00009c1cc42953fb4702f6331325c7cc

d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Sphingobacteriales; f__Sphingobacteriaceae; g__Sphingobacterium; s__uncultured_bacterium|000118a5e731455e942c61a82a40367a623088d0

其中原始 seqID 使用 | 附加到分类法中作为分隔符。

这是我的原始代码,我在其中列出了新的 seqID 列表,其中包含我命名为 'newids_list':

的附加分类法
with open('allmergedrep-seqsf.fasta') as original, open('allmergedrep-seqsf2.fasta', 'w') as corrected:
    for seq_record in SeqIO.parse(original, 'fasta'):
        if seq_record.id in newids_list:
            seq_record.id = seq_record.description = newids_list[seq_record.id]
        SeqIO.write(seq_record, corrected, 'fasta')

我从一个分类文件中制作了 newids_list,该文件与 fasta 文件具有相同的 seqID,并且其顺序已经相同。任何帮助将不胜感激!

编辑:

这是新的 fasta 文件的结果(只显示前两个序列)

>00009c1cc42953fb4702f6331325c7cc
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGTTGT
TAAGTCAGTGGTGAAATCGTGTGGCTCAACCATACGGAGCCATTGAAACTGGCGACCTTG
AGTGTAAACGAGGTAGGCGGAATGTGACGTGTAGCGGTGAAATGCTTAGATATGTCACAG
AACCCCGATTGCGAAGGCAGCTTACCAGCATACAACTGAC
>000118a5e731455e942c61a82a40367a623088d0
AGAGTTTTATCCTGGCTCAGGATGAACGCTAGCGGCAGGCCTAATACATGCAAGTCGGAC
GGGATCTAAATTTAAGCTTGCTTAAGTTTAGTGAGAGTGGCGCACGGGTGCGTAACGCGT
GAGCAACCTACCCATATCAGGGGGATAGCCCGAAGAAATTCGGATTAACACCGCATAACA
CAGCAATCTCGCATGAGATCACTGTTAAATATTTATAGGATATGGATGGGCTCGCGTGAC
ATTAGCTAGTTGGTAAGGTAACGGCTTACCAAGGCAACGATGTCTAGGGGCTCTGAGAGG
AGAATCCCCCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTAAGG
ATTATTGGTCAATGGAGGGAACTCTGAACCAGCCATGCCGCGTGCAGGATGACTGCCCTA
TGGGTTGTAAACTGCTTTTGTCTGGGAATAAACCTTGATTCGTGAATCAAGCTGAATGTA
CCAGAAGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGA
GCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCTTTATAAGTCAGAGGTGA
AAGACGGCAGCTTAACTGTCGCAGTGCCTTTGATACTGTATAGCTTGAATATCGTTGAAG
ATGGCGGAATGAGACAAGTAGCGGTGAAATGCATAGATATGTCTCAGAACTCCGATTGCG
AAGGCAGCTGTCTAAGCGGCAATTGACGCTGATGCACGAAAGCGTGGGGATCAAACAGGA
TTAGATACCCTGGTAGTCCACGCCCTAAACGATGATAACTGGATGTTGGCGATACACAGT
CAGCGTCTTAGCGAAAGCGTTAAGTTATCCACCTGGGGAGTACGCCCGCAAGGGTGAAAC
TCAAAGGAATTGACGGGGGCCCGCACAAGCGGAGGAGCATGTGGTTTAATTCGATGATAC
GCGAGGAACCTTACCCGGGCTTGAAAGTTAGTGAATGCGACAGAGACGTCTCAGTCCTTC
GGGACACGAAACTAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTA
AGTCCCGCAACGAGCGCAACCCCTATGTTTAGTTGCCAGCATGTAATGATGGGGACTCTA
AACAGACTGCCTGCGTAAGCAGCGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCT
TACGTCCGGGGCTACACACGTGCTACAATGGATGGTACAGCGGGCAGCTACACAGCAATG
TGATGCTAATCTCTAAAAGCCATTCACAGTTCGGATAGGGGTCTGCAACTCGACCCCATG
AAGTTGGATTCGCTAGTAATCGCGTATCAGCAATGACGCGGT

它似乎与上面相同,但只是重新格式化了不同的格式..比如自动换行之类的。 但基本上seqID是一样的。

这里还有我的 newids_list(第一对新 ID)供参考:

['d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidales_RF16_group; g__Bacteroidales_RF16_group; s__uncultured_bacterium|00009c1cc42953fb4702f6331325c7cc', 'd__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Sphingobacteriales; f__Sphingobacteriaceae; g__Sphingobacterium; s__uncultured_bacterium|000118a5e731455e942c61a82a40367a623088d0', 'd__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridia_UCG-014; f__Clostridia_UCG-014; g__Clostridia_UCG-014; s__uncultured_bacterium|0001536d70650564fec0c62905eeb73c']

我基本上是在尝试在 seqID 之前添加分类法,它们都由“|”连接。 谢谢!

代码的主要问题是您将 list 视为 dict(您的 new_list)并且 ID 实际上不在 [=14= 中],所以你其实不是运行重命名的。

下面是我将如何重命名以帮助您入门的示例

# define new_list as dict with keys being sequence ids and values the taxonomy
new_list = {id: tax for id, tax in zip(LIST_OF_SEQ_IDS, LIST_OF_TAX)} # you need to provide this somehow

original = [s for s in SeqIO.parse('allmergedrep-seqsf.fasta', 'fasta')]
corrected = []
for s in original:
  # here we put the requested ID format
  # note, that the FASTA ID usually do not contain spaces
  s.id = '{}|{}'.format(new_list[s.id], s.id)
  
  # BioPython sometimes adds IDs also here (and in some cases also to "s.name")
  s.description = ''
  
  corrected.append(s)

SeqIO.write(corrected, 'allmergedrep-seqsf2.fasta', 'fasta')

如果您的 new_list 确实是相同的顺序并且已经包含您想要的序列,那么您为什么不这样做:

with open('allmergedrep-seqsf.fasta') as original, open('allmergedrep-seqsf2.fasta', 'w') as corrected:
    for seq_record, new_name in zip(SeqIO.parse(original, 'fasta'), new_list):
        seq_record.id = new_name
        seq_record.description = '' # do you need that taxonomy twice?
        SeqIO.write(seq_record, corrected, 'fasta')