从具有 Python 中 IUPAC 歧义的 MSA fasta 文件中获得共识

Get consensus from a MSA fasta file with IUPAC ambiguities in Python

我有一个与主题几乎相似的问题:https://www.biostars.org/p/154993/

我有一个带有对齐序列的 fasta 文件,我想使用 IUPAC 代码生成共识。

到目前为止我写了:

import sys
from Bio import AlignIO
from Bio.Align import AlignInfo

alignment = AlignIO.read("input.fasta", 'fasta')
summary_align = AlignInfo.SummaryInfo(alignment)
summary_align.dumb_consensus(0.3)

“0.3”是共识中要考虑的等位基因频率阈值。这意味着对于每一列,如果一个碱基至少出现在比对的 30% 中,它将被考虑在内;如果有多个碱基符合这一标准,则使用相应的 IUPAC 歧义代码。

但是“dumb_consensus”只生成最高代表碱基,实际上不使用 IUPAC 代码。

那么你有办法使用 Biopython(或一般的 Python)来达成这样的共识吗?

谢谢

原始求解。虽然,GitHub 上的 Biopython 代码看起来并不好。您可以根据自己的目标扩展它。

import sys
from typing import Optional
from Bio import AlignIO
from Bio.Align import AlignInfo
from Bio.Seq import Seq
import Bio

def getWitConsensus(alignment: Optional[Bio.Align.MultipleSeqAlignment], threshold=0.25) -> Bio.Seq.Seq:
    alphabet = {"A" : "A",
                "C" : "C",
                "G" : "G",
                "T" : "T",
                "AG" : "R",
                "CT" : "Y",
                "CG" : "S",
                "AT" : "W",
                "GT" : "K",
                "AC" : "M",
                "CGT" : "B",
                "AGT" : "D",
                "ACT" : "H",
                "ACG" : "V",
                "ACGT" : "N"}
    
    threshold *= len(alignment)
    consensus = list()
    
    for i in range(alignment.get_alignment_length()):
        frequences = {"A" : 0, "C" : 0, "T" : 0, "G" : 0}
        for record in alignment:
            frequences[record[i]] += 1
        print(frequences)
        consensus.append(alphabet["".join(sorted(key for key, value in frequences.items() if value >= threshold))])
    return Seq("".join(consensus))

alignment = AlignIO.read(r"input.fas", 'fasta')
getWitConsensus(alignment)