从具有 Python 中 IUPAC 歧义的 MSA fasta 文件中获得共识

Question

我有一个与主题几乎相似的问题：https://www.biostars.org/p/154993/

我有一个带有对齐序列的 fasta 文件，我想使用 IUPAC 代码生成共识。

到目前为止我写了：

import sys
from Bio import AlignIO
from Bio.Align import AlignInfo

alignment = AlignIO.read("input.fasta", 'fasta')
summary_align = AlignInfo.SummaryInfo(alignment)
summary_align.dumb_consensus(0.3)

“0.3”是共识中要考虑的等位基因频率阈值。这意味着对于每一列，如果一个碱基至少出现在比对的 30% 中，它将被考虑在内；如果有多个碱基符合这一标准，则使用相应的 IUPAC 歧义代码。

但是“dumb_consensus”只生成最高代表碱基，实际上不使用 IUPAC 代码。

那么你有办法使用 Biopython（或一般的 Python）来达成这样的共识吗？

谢谢

Answer 1

原始求解。虽然，GitHub 上的 Biopython 代码看起来并不好。您可以根据自己的目标扩展它。

import sys
from typing import Optional
from Bio import AlignIO
from Bio.Align import AlignInfo
from Bio.Seq import Seq
import Bio

def getWitConsensus(alignment: Optional[Bio.Align.MultipleSeqAlignment], threshold=0.25) -> Bio.Seq.Seq:
    alphabet = {"A" : "A",
                "C" : "C",
                "G" : "G",
                "T" : "T",
                "AG" : "R",
                "CT" : "Y",
                "CG" : "S",
                "AT" : "W",
                "GT" : "K",
                "AC" : "M",
                "CGT" : "B",
                "AGT" : "D",
                "ACT" : "H",
                "ACG" : "V",
                "ACGT" : "N"}
    
    threshold *= len(alignment)
    consensus = list()
    
    for i in range(alignment.get_alignment_length()):
        frequences = {"A" : 0, "C" : 0, "T" : 0, "G" : 0}
        for record in alignment:
            frequences[record[i]] += 1
        print(frequences)
        consensus.append(alphabet["".join(sorted(key for key, value in frequences.items() if value >= threshold))])
    return Seq("".join(consensus))

alignment = AlignIO.read(r"input.fas", 'fasta')
getWitConsensus(alignment)

从具有 Python 中 IUPAC 歧义的 MSA fasta 文件中获得共识

Get consensus from a MSA fasta file with IUPAC ambiguities in Python

python

text-processing

bioinformatics

biopython