How can I fix this error: "BiopythonWarning: Partial codon, len(sequence) not a multiple of three."?

How can I fix this error: "BiopythonWarning: Partial codon, len(sequence) not a multiple of three."?

对于一项作业,我需要编写一段代码,将 rna 序列从 fasta 文件翻译成氨基酸序列。但是,我不断收到以下警告消息: “ BiopythonWarning:部分密码子,len(sequence) 不是三的倍数。明确 trim 序列或在翻译前添加尾随 N。这可能在将来成为错误。”

我尝试添加尾随 N,但它似乎仍然不起作用。我认为我的代码可能有错误,但我不确定在哪里。

这是我的代码:

from Bio.Seq import Seq
from Bio import SeqIO
seq_records = SeqIO.parse('rna.fasta', 'fasta')
amino_acids1 = []
amino_acids2 = []
amino_acids3 = []

for record in seq_records:

# starting from nucleotide 1
if len(record) %3 ==0:
     amino_acids1.append(record.translate())
elif (len(record)+1) %3 ==0:
    recordN = record + Seq('N')
    amino_acids1.append(recordN.translate())
elif (len(record)+2) %3 ==0:   
    recordNN = record + Seq('N') + Seq('N')
    amino_acids1.append(recordNN.translate())
print("FIRST")
print(amino_acids1)
with open('rna_out.fasta', 'w') as p_file: 
    SeqIO.write(amino_acids1, p_file, 'fasta')


# starting from nucleotide 2
record2 = record[1:]
if len(record2) %3 ==0:
     amino_acids2.append(record2.translate())
elif (len(record2)+1) %3 ==0:
    record2N = record + Seq('N')
    amino_acids2.append(record2N.translate())
elif (len(record2)+2) %3 ==0:   
    record2NN = record + Seq('N') + Seq('N')
    amino_acids2.append(record2NN.translate() )
print("SECOND")
print(amino_acids2)
with open('rna_out.fasta', 'w') as p_file: 
    SeqIO.write(amino_acids2, p_file, 'fasta')


# starting from nucleotide 3
record3 = record[2:]
if len(record3) %3 ==0:
    amino_acids3.append(record3.translate())
elif (len(record3)+1) %3 ==0:
    record3N = record + Seq('N')
    amino_acids3.append(record3N.translate())
elif (len(record3)+2) %3 ==0:
    record3NN = record + Seq('N') + Seq('N')
    amino_acids3.append(record3NN.translate())
print("THIRD")
print(amino_acids3)
with open('rna_out.fasta', 'w') as p_file: 
    SeqIO.write(amino_acids3, p_file, 'fasta')

通常,这将为 fasta 文件中的每个序列提供 3 种可能的翻译。但是,输出似乎不正确。

这些是前 3 行,应该是 fasta 文件中第一个序列的 3 个不同翻译:

第一个 [SeqRecord(seq=Seq('GAKRTDRTSVINKLSLLYTSCETIDCYIFFL', HasStopCodon(ExtendedIUPACProtein(), '')), id='', name='', description='', dbxrefs=[])] 第二 [SeqRecord(seq=Seq('GAKRTDRTSVINKLSLLYTSCETIDCYIFFL', HasStopCodon(ExtendedIUPACProtein(), '')), id='', name='', description='', dbxrefs=[])] 第三 [SeqRecord(seq=Seq('CQKNSDVVVGHQTVVALHVMRNDLLYLFP', HasStopCodon(ExtendedIUPACProtein(), '')), id='', name='', description='', dbxrefs=[])]

我不知道哪里错了,但这绝对不是一个正确的翻译。如果你知道我在哪里犯了错误,我将非常感谢你的帮助!!

您的方法可能有效,但您的代码中存在复制和粘贴错误:

record2 = record[1:]
if len(record2) %3 ==0:
     amino_acids2.append(record2.translate())
elif (len(record2)+1) %3 ==0:
    record2N = record + Seq('N')

注意最后一行的record应该是record2。你至少犯了四次这个错误。我相信代码 @Chris_Rands 会指导您对问题有宝贵的见解,例如也翻译反向补充,但我不推荐该代码中的 pad_seq() 函数。

下面是 pad_seq() 的返工,已集成到您的代码中:

from Bio.Seq import Seq
from Bio import SeqIO

def pad_seq(sequence):
    """ Pad sequence to multiple of 3 with N """

    remainder = len(sequence) % 3

    return sequence if remainder == 0 else sequence + Seq('N' * (3 - remainder))

seq_records = SeqIO.parse('rna.fasta', 'fasta')

amino_acids1 = []
amino_acids2 = []
amino_acids3 = []

for record in seq_records:

    # starting from nucleotide 1
    amino_acids1.append(pad_seq(record).translate())
    print("FIRST")
    print(amino_acids1)
    # ...

    # starting from nucleotide 2
    record2 = record[1:]
    amino_acids2.append(pad_seq(record2).translate())
    print("SECOND")
    print(amino_acids2)
    # ...

    # starting from nucleotide 3
    record3 = record[2:]
    amino_acids3.append(pad_seq(record3).translate())
    print("THIRD")
    print(amino_acids3)
    # ...