Trim 序列基于比对

Question

我正在尝试使用 BioPython 将 ClustalW 生成的 MSA（多序列比对）文件编辑为共识序列之前的 trim 序列。 xxx指的是这里不相关的其他碱基

示例如下 I/O :

输入

ITS_primer_fw               --------------------------------CGCGTCCACTMTCCAGTT
RBL67ITS_full_sequence      CCACCCCAACAAGGGCGGCCACGCGGTCCGCTCGCGTCCACTCTCCAGTTxxxxxxxxxxxxxxxxxxxxxxx
PRL2010                     ACACCCCCGAAAGGGCGTCC------CCTGCTCGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
BBF32_3                     ACACACCCACAAGGGCGAGCAGGCG----GCTCGCGTCCACTATCCAGTTxxxxxxxxxxxxxx
BBFCG32                     CAACACCACACCGGGCGAGCGGG-------CTCGCGTCCACTGTCGAGTTxxxxxxxxxxxxxxxxxxxxxx

预期输出

ITS_primer_fw               CGCGTCCACTMTCCAGTT
RBL67ITS_full_sequence      CGCGTCCACTCTCCAGTTxxxxxxxxxxxxxxxxxxxx
PRL2010                     CGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxxxxxxx
BBF32_3                     CGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxxxxx
BBFCG32                     CGCGTCCACTGTCGAGTTxxxxxxxxxxxxxxxxxxxx

AlignIO 的记录代码仅描述了一种通过将比对视为数组来提取序列的方法。在这个例子中

align = AlignIO.read(input_file, "clustal")
sub_alignment = align[:,20:]

我能够从第 20 个核苷酸开始提取由所有序列 (:) 组成的子比对。我正在寻找一种方法来将示例中的 20 替换为共有序列的第一个核苷酸的位置。

Answer 1

感谢 Biostars 用户找到了答案。

twitch 正在查看列以找到起点，这将在最后一个 '-' 之后符合预期。默认情况下，我的对齐第一行是最短的，并且在对齐之前以“-”开头。

代码如下：

aln = AlignIO.read(input_file, "clustal")
for col in range(aln.get_alignment_length()):  # search into column
    res = list(aln[:,col])
    if not '-' in res:
        position = col                         # find start point
        print('First full column is {}'.format(col))
        break
print(aln[:,position::])                       # print the whole alignment starting from the position variable found

Trim 序列基于比对

Trim sequences based on alignment

python

regex

numpy

bioinformatics

biopython