从基因组创建树状图
Create a Dendogram from Genome
我想玩转基因组数据:
Species_A = ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag
Species_B = ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag
Species_C = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag
Species_D = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg
Species_E = ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag
我想根据上面的基因组序列,根据这些生物彼此之间的相关程度来创建一个树状图。我首先做的是计算每个物种的 a、c、t 和 g 的数量,然后我创建了一个数组,然后绘制了一个树状图:
gen_size1 = len(Species_A)
a1 = float(Species_A.count('a'))/float(gen_size1)
c1 = float(Species_A.count('c'))/float(gen_size1)
g1 = float(Species_A.count('g'))/float(gen_size1)
t1 = float(Species_A.count('t'))/float(gen_size1)
.
.
.
gen_size5 = len(Species_E)
a5 = float(Species_E.count('a'))/float(gen_size5)
c5 = float(Species_E.count('c'))/float(gen_size5)
g5 = float(Species_E.count('g'))/float(gen_size5)
t5 = float(Species_E.count('t'))/float(gen_size5)
my_genes = np.array([[a1,c1,g1,t1],[a2,c2,g2,t2],[a3,c3,g3,t3],[a4,c4,g4,t4],[a5,c5,g5,t5]])
plt.subplot(1,2,1)
plt.title("Mononucleotide")
linkage_matrix = linkage(my_genes, "single")
print linkage_matrix
dendrogram(linkage_matrix,truncate_mode='lastp', color_threshold=1, labels=[Species_A, Species_B, Species_C, Species_D, Species_E], show_leaf_counts=True)
plt.show()
物种 A 和 B 是同一生物体的变种,我希望它们都应该从根部的一个共同进化枝中分离出来,物种 C 和 D 也是如此,它们应该从根部的另一个共同进化枝中分离出来,然后是物种 E 从主根中分离出来,因为它与物种 A 到 D 无关。不幸的是,树状图结果与物种 A 和 E 从一个共同的进化枝中分离出来,然后是另一个进化枝中的物种 C、D 和 B(相当混乱) ).
我读过有关基因组序列的层次聚类,但我观察到它只适用于二维系统,不幸的是我有 4 个维度,即 a、c、t 和 g。还有其他策略吗?谢谢您的帮助!
嗯,使用 SciPy you could use a custom distance (my bet is on Needleman-Wunsch or Smith-Waterman as a start). Here is an example of how to play with your input data. You should also check how to define a custom distance in SciPy. Once you have it set, you can use a more advanced alignment approach like MAFFT。您可以提取基因组之间的关系,并在创建树状图时使用它们。
这是生物信息学中一个相当普遍的问题,因此您应该使用像 BioPython 这样内置了此功能的生物信息学库。
首先你用你的序列创建一个多 FASTA 文件:
import os
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_dna
sequences = ['ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag',
'ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag',
'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag',
'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg',
'ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag']
my_records = [SeqRecord(Seq(sequence, generic_dna),
id='Species_{}'.format(letter), description='Species_{}'.format(letter))
for sequence, letter in zip(sequences, 'ABCDE')]
root_dir = r"C:\Users\BioGeek\Documents\temp"
filename = 'my_sequences'
fasta_path = os.path.join(root_dir, '{}.fasta'.format(filename))
SeqIO.write(my_records, fasta_path, "fasta")
这将创建如下所示的文件 C:\Users\BioGeek\Documents\temp\my_sequences.fasta
:
>Species_A
ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag
>Species_B
ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag
>Species_C
ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag
>Species_D
ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg
>Species_E
ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag
接下来,使用命令行工具ClustalW
进行多序列比对:
from Bio.Align.Applications import ClustalwCommandline
clustalw_exe = r"C:\path\to\clustalw-2.1\clustalw2.exe"
assert os.path.isfile(clustalw_exe), "Clustal W executable missing"
clustalw_cline = ClustalwCommandline(clustalw_exe, infile=fasta_path)
stdout, stderr = clustalw_cline()
print stdout
这会打印:
CLUSTAL 2.1 Multiple Sequence Alignments
Sequence format is Pearson
Sequence 1: Species_A 50 bp
Sequence 2: Species_B 50 bp
Sequence 3: Species_C 50 bp
Sequence 4: Species_D 50 bp
Sequence 5: Species_E 50 bp
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score: 90
Sequences (1:3) Aligned. Score: 94
Sequences (1:4) Aligned. Score: 88
Sequences (1:5) Aligned. Score: 84
Sequences (2:3) Aligned. Score: 90
Sequences (2:4) Aligned. Score: 84
Sequences (2:5) Aligned. Score: 78
Sequences (3:4) Aligned. Score: 94
Sequences (3:5) Aligned. Score: 82
Sequences (4:5) Aligned. Score: 82
Guide tree file created: [C:\Users\BioGeek\Documents\temp\my_sequences.dnd]
There are 4 groups
Start of Multiple Alignment
Aligning...
Group 1: Sequences: 2 Score:912
Group 2: Sequences: 2 Score:921
Group 3: Sequences: 4 Score:865
Group 4: Sequences: 5 Score:855
Alignment Score 2975
CLUSTAL-Alignment file created [C:\Users\BioGeek\Documents\temp\my_sequences.aln]
my_sequences.dnd
文件ClustalW
创建,是一个标准的Newick tree file和Bio.Phylo
可以解析这些:
from Bio import Phylo
newick_path = os.path.join(root_dir, '{}.dnd'.format(filename))
tree = Phylo.read(newick_path, "newick")
Phylo.draw_ascii(tree)
打印:
____________ Species_A
____|
| |_____________________________________ Species_B
|
_| ____ Species_C
|_________|
| |_________________________ Species_D
|
|__________________________________________________________________ Species_E
或者,如果您安装了 matplotlib
或 pylab
,您可以使用 draw
函数创建图形:
tree.rooted = True
Phylo.draw(tree, branch_labels=lambda c: c.branch_length)
产生:
此树状图清楚地说明了您观察到的情况:物种 A 和 B 是同一生物体的变体,并且都从根的共同进化枝中分化出来。物种 C 和 D 也是如此,它们都从根部的另一个共同进化枝中分化出来。最后,物种 E 从主根分叉,因为它与物种 A 到 D 无关。
我想玩转基因组数据:
Species_A = ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag
Species_B = ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag
Species_C = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag
Species_D = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg
Species_E = ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag
我想根据上面的基因组序列,根据这些生物彼此之间的相关程度来创建一个树状图。我首先做的是计算每个物种的 a、c、t 和 g 的数量,然后我创建了一个数组,然后绘制了一个树状图:
gen_size1 = len(Species_A)
a1 = float(Species_A.count('a'))/float(gen_size1)
c1 = float(Species_A.count('c'))/float(gen_size1)
g1 = float(Species_A.count('g'))/float(gen_size1)
t1 = float(Species_A.count('t'))/float(gen_size1)
.
.
.
gen_size5 = len(Species_E)
a5 = float(Species_E.count('a'))/float(gen_size5)
c5 = float(Species_E.count('c'))/float(gen_size5)
g5 = float(Species_E.count('g'))/float(gen_size5)
t5 = float(Species_E.count('t'))/float(gen_size5)
my_genes = np.array([[a1,c1,g1,t1],[a2,c2,g2,t2],[a3,c3,g3,t3],[a4,c4,g4,t4],[a5,c5,g5,t5]])
plt.subplot(1,2,1)
plt.title("Mononucleotide")
linkage_matrix = linkage(my_genes, "single")
print linkage_matrix
dendrogram(linkage_matrix,truncate_mode='lastp', color_threshold=1, labels=[Species_A, Species_B, Species_C, Species_D, Species_E], show_leaf_counts=True)
plt.show()
物种 A 和 B 是同一生物体的变种,我希望它们都应该从根部的一个共同进化枝中分离出来,物种 C 和 D 也是如此,它们应该从根部的另一个共同进化枝中分离出来,然后是物种 E 从主根中分离出来,因为它与物种 A 到 D 无关。不幸的是,树状图结果与物种 A 和 E 从一个共同的进化枝中分离出来,然后是另一个进化枝中的物种 C、D 和 B(相当混乱) ).
我读过有关基因组序列的层次聚类,但我观察到它只适用于二维系统,不幸的是我有 4 个维度,即 a、c、t 和 g。还有其他策略吗?谢谢您的帮助!
嗯,使用 SciPy you could use a custom distance (my bet is on Needleman-Wunsch or Smith-Waterman as a start). Here is an example of how to play with your input data. You should also check how to define a custom distance in SciPy. Once you have it set, you can use a more advanced alignment approach like MAFFT。您可以提取基因组之间的关系,并在创建树状图时使用它们。
这是生物信息学中一个相当普遍的问题,因此您应该使用像 BioPython 这样内置了此功能的生物信息学库。
首先你用你的序列创建一个多 FASTA 文件:
import os
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_dna
sequences = ['ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag',
'ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag',
'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag',
'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg',
'ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag']
my_records = [SeqRecord(Seq(sequence, generic_dna),
id='Species_{}'.format(letter), description='Species_{}'.format(letter))
for sequence, letter in zip(sequences, 'ABCDE')]
root_dir = r"C:\Users\BioGeek\Documents\temp"
filename = 'my_sequences'
fasta_path = os.path.join(root_dir, '{}.fasta'.format(filename))
SeqIO.write(my_records, fasta_path, "fasta")
这将创建如下所示的文件 C:\Users\BioGeek\Documents\temp\my_sequences.fasta
:
>Species_A
ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag
>Species_B
ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag
>Species_C
ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag
>Species_D
ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg
>Species_E
ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag
接下来,使用命令行工具ClustalW
进行多序列比对:
from Bio.Align.Applications import ClustalwCommandline
clustalw_exe = r"C:\path\to\clustalw-2.1\clustalw2.exe"
assert os.path.isfile(clustalw_exe), "Clustal W executable missing"
clustalw_cline = ClustalwCommandline(clustalw_exe, infile=fasta_path)
stdout, stderr = clustalw_cline()
print stdout
这会打印:
CLUSTAL 2.1 Multiple Sequence Alignments
Sequence format is Pearson
Sequence 1: Species_A 50 bp
Sequence 2: Species_B 50 bp
Sequence 3: Species_C 50 bp
Sequence 4: Species_D 50 bp
Sequence 5: Species_E 50 bp
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score: 90
Sequences (1:3) Aligned. Score: 94
Sequences (1:4) Aligned. Score: 88
Sequences (1:5) Aligned. Score: 84
Sequences (2:3) Aligned. Score: 90
Sequences (2:4) Aligned. Score: 84
Sequences (2:5) Aligned. Score: 78
Sequences (3:4) Aligned. Score: 94
Sequences (3:5) Aligned. Score: 82
Sequences (4:5) Aligned. Score: 82
Guide tree file created: [C:\Users\BioGeek\Documents\temp\my_sequences.dnd]
There are 4 groups
Start of Multiple Alignment
Aligning...
Group 1: Sequences: 2 Score:912
Group 2: Sequences: 2 Score:921
Group 3: Sequences: 4 Score:865
Group 4: Sequences: 5 Score:855
Alignment Score 2975
CLUSTAL-Alignment file created [C:\Users\BioGeek\Documents\temp\my_sequences.aln]
my_sequences.dnd
文件ClustalW
创建,是一个标准的Newick tree file和Bio.Phylo
可以解析这些:
from Bio import Phylo
newick_path = os.path.join(root_dir, '{}.dnd'.format(filename))
tree = Phylo.read(newick_path, "newick")
Phylo.draw_ascii(tree)
打印:
____________ Species_A
____|
| |_____________________________________ Species_B
|
_| ____ Species_C
|_________|
| |_________________________ Species_D
|
|__________________________________________________________________ Species_E
或者,如果您安装了 matplotlib
或 pylab
,您可以使用 draw
函数创建图形:
tree.rooted = True
Phylo.draw(tree, branch_labels=lambda c: c.branch_length)
产生:
此树状图清楚地说明了您观察到的情况:物种 A 和 B 是同一生物体的变体,并且都从根的共同进化枝中分化出来。物种 C 和 D 也是如此,它们都从根部的另一个共同进化枝中分化出来。最后,物种 E 从主根分叉,因为它与物种 A 到 D 无关。