将核苷酸序列转换为氨基酸序列

Question

我有一个脚本，它使用基因的位置和链信息（互补、正向）来提取核苷酸序列。提取后，脚本使用翻译 table 和密码子起始位置将核苷酸序列转换为氨基酸序列，并将其与原始氨基酸序列进行比较。理论上应该是匹配的，但是我还没匹配到。

作为示例，我将使用此大肠杆菌大肠杆菌 genbank 文件：http://www.ncbi.nlm.nih.gov/nuccore/AB011549 从第 396 行 on/near 开始如下：

CDS         complement(25341..26294)
/gene="repFIB"
/note="Replication protein RepFIB (P307 replicon);
similar to PIR accession number A32310"
/codon_start=1
/transl_table=11
/product="RepFIB"
/protein_id="BAA31780.2"
/db_xref="GI:4589743"
/translation="MTLTPNNNNTVQPVALMRLGVFVPTLKSLKNSKKNTLSRTDATEELTRLSLARAEGFDKVEITGPRLDMDNDFKTWVGIIHSFARHNVIGDKVELPFVEFAKLCGIPSSQSSRRLRERISPSLKRIAGTVISFSRTDEKHTREYITHLVQSAYYDTERDIVQLQADPRLFELYQFDRKVLLQLKAINALKRRESAQALYTFIESLPRDPAPVSLARLRARLNLKSPVFSQNQTVRRAMEQLREIGYLDYTEIQRGRTKLFCIHYRRPRLKAPNDESKENPLPPSPAEKVSPEMAEKLALLEKLGITLDDLEKLFKSR"

这告诉我翻译table是#11，该基因从25341位开始到26294位结束，该基因来自互补链，实际氨基酸序列是最后一个条目（例如，/translation).

import Bio
import os
from Bio import GenBank
from Bio import SeqIO
from Bio import SeqFeature

genome_file = 'GenbankFiles_E_coli/AB011549.gb'

#complement strand
nAA = 'MTLTPNNNNTVQPVALMRLGVFVPTLKSLKNSKKNTLSRTDATEELTRLSLARAEGFDKVEITGPRLDMDNDFKTWVGIIHSFARHNVIGDKVELPFVEFAKLCGIPSSQSSRRLRERISPSLKRIAGTVISFSRTDEKHTREYITHLVQSAYYDTERDIVQLQADPRLFELYQFDRKVLLQLKAINALKRRESAQALYTFIESLPRDPAPVSLARLRARLNLKSPVFSQNQTVRRAMEQLREIGYLDYTEIQRGRTKLFCIHYRRPRLKAPNDESKENPLPPSPAEKVSPEMAEKLALLEKLGITLDDLEKLFKSR'

#Forward strand
NA = 'MLLALLSSTDNFCLSSTELSERLDVSRTYITRACDSLEKFGFIKRMESKEDRRSKNIYLTSDGNLYLQRTTRIYGRYLKKYGATLQMMKSKHLK'

gene_nucleotide = ''

def match(sequence, index):
    print('match')
    print(index, sequence)
    exit()

def get_aminoAcid(nucleotide_sequence, amino_sequence, start_position, stop_position, string=''):
    index = -2
    calls = [nucleotide_sequence[start_position+index:stop_position+index].translate(table=11), nucleotide_sequence[start_position+index:stop_position+index].complement().translate(table=11), nucleotide_sequence[start_position+index:stop_position+index].reverse_complement().translate(table=11)]

    while index < 3:
        print(index)
        stringT = str(nucleotide_sequence[start_position+index:stop_position+index].translate(table=11))
        stringC = str(nucleotide_sequence[start_position+index:stop_position+index].complement().translate(table=11))
        stringRC = str(nucleotide_sequence[start_position+index:stop_position+index].reverse_complement().translate(table=11))

        if stringT == amino_sequence:
            match(stringT, index)
        if stringC  == amino_sequence:
            match(stringC, index)
        if stringRC  == amino_sequence:
            match(stringRC, index)

        #uncomment to see actual translations
        #print('translate:', stringT, '\n')
        #print('complement:', stringC, '\n')
        #print('reverse_complement:', stringRC, '\n')

        index+=1

record = next(SeqIO.parse(genome_file, 'genbank'))
sequence = record.seq
start  = 25341 
stop = 26294 
get_aminoAcid(sequence, nAA, start, stop)

#This is from a forward strand
start  = 23512
stop = 23796 
get_aminoAcid(sequence, NA, start, stop)

请注意，我的脚本着眼于以下内容：正向链 (.translate(table=11))、互补链 (.complement().translate( table=11)) 和反向补码 (*.reverse_complement().translate(table=11))。

尽管 genbank 文件指出该基因从位置 25341 开始并且在位置 1 有一个开放阅读框（例如，codon_start=1），但我检查了从阅读框开始的所有三个翻译 - 2、-1、0（这是原来的起始位置）、+1、+2……我仍然找不到互补链的匹配项。但是，此脚本确实适用于正向核苷酸链（请参阅使用不同编码序列信息的最后几行）。

这是我的输出：

/usr/local/anaconda3/lib/python3.4/site-packages/Bio/Seq.py:2041: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future. BiopythonWarning)
-2
-1
0
1
2
-2 #starting over using different CDS (forward)
-1
match
-1 MLLALLSSTDNFCLSSTELSERLDVSRTYITRACDSLEKFGFIKRMESKEDRRSKNIYLTSDGNLYLQRTTRIYGRYLKKYGATLQMMKSKHL

因此，将正向链核苷酸序列转换成匹配的氨基酸序列是可行的，但补码不行，我也不知道为什么（这可能是我在 Biopython 中犯的一个错误）。任何人都可以提供一些关于我做错了什么以及如何纠正它的见解吗？

关于核苷酸如何转化为氨基酸的非常的简要说明：codon/amino酸由三个核苷酸组成。看这里的氨基酸图表：How to complete getting substrings of a genome encoding a given amino acid sequence 如果我的核苷酸序列是GCG，从圆心开始到边缘，G（内圈）C（中圈）G（外圈）对应丙氨酸（A）。

Answer 1

我有点尴尬地说问题不是我的代码，而是我的解释。 CDS 特征（例如 CDS complement(xx...xxxx)）可能来自互补链，但我不需要先转换它（例如 reverse_complement(), complement()）。我只是直接翻译它（例如，translate(table=11)).

第一个或最后一个氨基酸可能存在一些问题（如评论中所述），但在大多数情况下，问题已解决。

将核苷酸序列转换为氨基酸序列

Converting nucleotide sequence to amino acid sequence

python

bioinformatics

biopython