如何使用 MDAnalysis 计算蛋白质的质心?
How to calculate center of mass of proteins using MDAnalysis?
我的处境有点不寻常。根据残基名称,有七种不同的蛋白质存储在一个文件中。每种蛋白质都有不同的序列长度。现在我需要计算每个蛋白质的质心并生成一个时间序列 data.I 知道如何处理单个蛋白质,但不知道如何处理多个蛋白质系统。对于单一蛋白质,我可以这样做:
import MDAnalysis as mda
import numpy as np
u = mda.Universe('lp400start.gro')
u1 = mda.Merge(u.select_atoms("not resname W and not resname WF and not resname ION"))
u1.load_new('lp400.xtc')
protein = u1.select_atoms("protein")
arr = np.empty((protein.n_residues, u1.trajectory.n_frames, 3))
for ts in u.trajectory:
arr[:, ts.frame] = protein.center_of_mass(compound='residues')
数据文件公开可用 here。可以使用grep "^ *1[A-Z]" -B1 lp400final.gro | grep -v "^ *1[A-Z]"
检查拓扑文件中的残基序列,输出:
38ALA BB 74 52.901 33.911 6.318
--
38ALA BB 148 41.842 29.381 7.211
--
137GLY BB 455 36.756 4.287 3.284
--
137GLY BB 762 44.721 60.377 3.112
--
252HIS SC3 1368 28.682 37.936 6.727
--
252HIS SC3 1974 18.533 46.506 6.314
--
576PHE SC3 3263 48.937 38.538 4.013
--
576PHE SC3 4552 18.513 25.948 3.800
--
1092PRO SC1 6470 42.510 40.992 6.775
--
1092PRO SC1 8388 14.709 4.759 6.370
--
1016LEU SC110524 57.264 56.308 2.632
--
1016LEU SC112660 50.716 14.698 2.728
--
1285LYS SC215345 0.793 33.529 1.509
第一个蛋白质的序列长度为 38 个残基,它有自己的副本,然后是第二个蛋白质,依此类推。现在我想在每个时间范围内获得每个蛋白质的 COM,并将其构建成一个时间序列。除了蛋白质拓扑文件外,还包含 DPPC 粒子。 Could someone help me how to do this?
谢谢!
为了确保输出轨迹是正确的,它看起来像这样enter link description here
我将从 TPR 文件加载系统以维护债券信息。然后 MDAnalysis 可以确定片段(即您的蛋白质)。然后遍历片段以确定COM时间序列:
import MDAnalysis as mda
import numpy as np
# files from https://doi.org/10.5281/zenodo.846428
TPR = "lp400.tpr"
XTC = "lp400.xtc"
# build reduced universe to match XTC
# (ignore warnings that no coordinates are found for the TPR)
u0 = mda.Universe(TPR)
u = mda.Merge(u0.select_atoms("not resname W and not resname WF and not resname ION"))
u.load_new(XTC)
# segments (exclude the last one, which is DPPC and not protein)
protein_segments = u.segments[:-1]
# build the fragments
# (a dictionary with the key as the protein name -- I am using an
# OrderedDict so that the order is the same as in the TPR)
from collections import OrderedDict
protein_fragments = OrderedDict((seg.segid[6:], seg.atoms.fragments) for seg in protein_segments)
# analyze trajectory (with a nice progress bar)
timeseries = []
for ts in mda.log.ProgressBar(u.trajectory):
coms = []
for name, proteins in protein_fragments.items():
# loop over all the different proteins;
# unwrap to get the true COM under PBC (double check!!)
coms.extend([p.center_of_mass(unwrap=True) for p in proteins])
timeseries.append(coms)
timeseries = np.array(timeseries)
备注
- 仔细检查
unwrap=True
是否在做正确的事情(这是必要的——我没有检查是否有任何蛋白质跨周期性边界分裂)。
- 展开很慢;如果您不需要它,它会 运行 更快。
- 生成的数组是一个形状为
(N_timesteps, M_proteins, 3)
的 3d 数组,即 (10001, 14, 3)
.
protein_fragments
的内容是
OrderedDict([('EPHA', (<AtomGroup with 74 atoms>, <AtomGroup with 74 atoms>)),
('OMPA', (<AtomGroup with 307 atoms>, <AtomGroup with 307 atoms>)),
('OMPG', (<AtomGroup with 606 atoms>, <AtomGroup with 606 atoms>)),
('BTUB', (<AtomGroup with 1289 atoms>, <AtomGroup with 1289 atoms>)),
('ATPS', (<AtomGroup with 1918 atoms>, <AtomGroup with 1918 atoms>)),
('GLPF', (<AtomGroup with 2136 atoms>, <AtomGroup with 2136 atoms>)),
('FOCA', (<AtomGroup with 2685 atoms>, <AtomGroup with 2685 atoms>))])
我的处境有点不寻常。根据残基名称,有七种不同的蛋白质存储在一个文件中。每种蛋白质都有不同的序列长度。现在我需要计算每个蛋白质的质心并生成一个时间序列 data.I 知道如何处理单个蛋白质,但不知道如何处理多个蛋白质系统。对于单一蛋白质,我可以这样做:
import MDAnalysis as mda
import numpy as np
u = mda.Universe('lp400start.gro')
u1 = mda.Merge(u.select_atoms("not resname W and not resname WF and not resname ION"))
u1.load_new('lp400.xtc')
protein = u1.select_atoms("protein")
arr = np.empty((protein.n_residues, u1.trajectory.n_frames, 3))
for ts in u.trajectory:
arr[:, ts.frame] = protein.center_of_mass(compound='residues')
数据文件公开可用 here。可以使用grep "^ *1[A-Z]" -B1 lp400final.gro | grep -v "^ *1[A-Z]"
检查拓扑文件中的残基序列,输出:
38ALA BB 74 52.901 33.911 6.318
--
38ALA BB 148 41.842 29.381 7.211
--
137GLY BB 455 36.756 4.287 3.284
--
137GLY BB 762 44.721 60.377 3.112
--
252HIS SC3 1368 28.682 37.936 6.727
--
252HIS SC3 1974 18.533 46.506 6.314
--
576PHE SC3 3263 48.937 38.538 4.013
--
576PHE SC3 4552 18.513 25.948 3.800
--
1092PRO SC1 6470 42.510 40.992 6.775
--
1092PRO SC1 8388 14.709 4.759 6.370
--
1016LEU SC110524 57.264 56.308 2.632
--
1016LEU SC112660 50.716 14.698 2.728
--
1285LYS SC215345 0.793 33.529 1.509
第一个蛋白质的序列长度为 38 个残基,它有自己的副本,然后是第二个蛋白质,依此类推。现在我想在每个时间范围内获得每个蛋白质的 COM,并将其构建成一个时间序列。除了蛋白质拓扑文件外,还包含 DPPC 粒子。 Could someone help me how to do this?
谢谢!
为了确保输出轨迹是正确的,它看起来像这样enter link description here
我将从 TPR 文件加载系统以维护债券信息。然后 MDAnalysis 可以确定片段(即您的蛋白质)。然后遍历片段以确定COM时间序列:
import MDAnalysis as mda
import numpy as np
# files from https://doi.org/10.5281/zenodo.846428
TPR = "lp400.tpr"
XTC = "lp400.xtc"
# build reduced universe to match XTC
# (ignore warnings that no coordinates are found for the TPR)
u0 = mda.Universe(TPR)
u = mda.Merge(u0.select_atoms("not resname W and not resname WF and not resname ION"))
u.load_new(XTC)
# segments (exclude the last one, which is DPPC and not protein)
protein_segments = u.segments[:-1]
# build the fragments
# (a dictionary with the key as the protein name -- I am using an
# OrderedDict so that the order is the same as in the TPR)
from collections import OrderedDict
protein_fragments = OrderedDict((seg.segid[6:], seg.atoms.fragments) for seg in protein_segments)
# analyze trajectory (with a nice progress bar)
timeseries = []
for ts in mda.log.ProgressBar(u.trajectory):
coms = []
for name, proteins in protein_fragments.items():
# loop over all the different proteins;
# unwrap to get the true COM under PBC (double check!!)
coms.extend([p.center_of_mass(unwrap=True) for p in proteins])
timeseries.append(coms)
timeseries = np.array(timeseries)
备注
- 仔细检查
unwrap=True
是否在做正确的事情(这是必要的——我没有检查是否有任何蛋白质跨周期性边界分裂)。 - 展开很慢;如果您不需要它,它会 运行 更快。
- 生成的数组是一个形状为
(N_timesteps, M_proteins, 3)
的 3d 数组,即(10001, 14, 3)
. protein_fragments
的内容是OrderedDict([('EPHA', (<AtomGroup with 74 atoms>, <AtomGroup with 74 atoms>)), ('OMPA', (<AtomGroup with 307 atoms>, <AtomGroup with 307 atoms>)), ('OMPG', (<AtomGroup with 606 atoms>, <AtomGroup with 606 atoms>)), ('BTUB', (<AtomGroup with 1289 atoms>, <AtomGroup with 1289 atoms>)), ('ATPS', (<AtomGroup with 1918 atoms>, <AtomGroup with 1918 atoms>)), ('GLPF', (<AtomGroup with 2136 atoms>, <AtomGroup with 2136 atoms>)), ('FOCA', (<AtomGroup with 2685 atoms>, <AtomGroup with 2685 atoms>))])