如何使用 RDKit 计算 SMILE 结构列表的分子指纹和相似度?
How to use RDKit to calculte molecular fingerprint and similarity of a list of SMILE structures?
我正在使用 RDKit 根据两个具有 SMILE 结构的分子列表之间的 Tanimoto 系数计算分子相似性。
现在我可以从两个单独的 csv 文件中提取 SMILE 结构。我想知道如何将这些结构放入RDKit的指纹模块中,以及如何计算两个分子列表之间一对一的相似度?
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
ms = [Chem.MolFromSmiles('CCOC'), Chem.MolFromSmiles('CCO'), ... Chem.MolFromSmiles('COC')]
fps = [FingerprintMols.FingerprintMol(x) for x in ms]
DataStructs.FingerprintSimilarity(fps[0],fps[1])
我想将我拥有的所有SMILE结构(超过10,000个)放入'ms'列表并获取它们的指纹。
然后我会比较两个列表中每对分子之间的相似性,也许这里需要一个for循环?
提前致谢!
我使用 pandas 数据帧到 select 并用我的结构打印出列表,并将我的列表保存到 list_1 和 list_2 中。运行到ms1行报错如下:
TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string<wchar_t,
std::char_traits<wchar_t>, std::allocator<wchar_t> > from this Python object of type float
然后我查看了文件,微笑栏里只有SMILES。但是当我手动将一些分子结构放入列表中进行测试时,仍然存在关于
的错误
fpArgs['minSize'].
例如gadodiamide的SMILES为"O=C1[O-][Gd+3]234567[O]=C(C[N]2(CC[N]3(CC([O-]4)=O)CC[N]5(CC(=[O]6)NC)CC(=O)[O-]7)C1)NC",错误代码如下(当运行 fps线时):
ArgumentError: Python argument types in
rdkit.Chem.rdmolops.RDKFingerprint(NoneType, int, int, int, int, int, float, int)
did not match C++ signature:
RDKFingerprint(RDKit::ROMol mol, unsigned int minPath=1,
unsigned int maxPath=7, unsigned int fpSize=2048, unsigned int nBitsPerHash=2,
bool useHs=True, double tgtDensity=0.0, unsigned int minSize=128, bool branchedPaths=True,
bool useBondOrder=True, boost::python::api::object atomInvariants=0, boost::python::api::object fromAtoms=0,
boost::python::api::object atomBits=None, boost::python::api::object bitInfo=None).
如果原始 csv 文件如下所示,如何在输出文件中包含分子名称以及相似度值:
名字,微笑,价值,价值 2
分子 1,CCOCN(C)(C),0.25,A
分子 2、CCO、1.12、B
分子 3、COC、2.25、C
我添加了这些代码以在输出文件中包含分子名称,这些是关于名称的一些数组值错误(特别是对于 d2):
name_1 = df_1['id1']
name_2 = df_2['id2']
name_3 = pd.concat([name_1, name_2])
# create a list for the dataframe
d1, qu, d2, ta, sim = [], [], [], [], []
for n in range(len(fps)-1):
s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:])
#print(c_smiles[n], c_smiles[n+1:])
for m in range(len(s)):
qu.append(c_smiles[n])
ta.append(c_smiles[n+1:][m])
sim.append(s[m])
d1.append(name_3[n])
d2.append(name_3[n+1:][m])
#print()
d = {'ID_1':d1, 'query':qu, 'ID_2':d2, 'target':ta, 'Similarity':sim}
df_final = pd.DataFrame(data=d)
df_final = df_final.sort_values('Similarity', ascending=False)
for index, row in df.iterrows():
print (row["ID_1"], row["query"], row["ID_2"], row["target"], row["Similarity"])
print(df_final)
# save as csv
df_final.to_csv('RESULT_3.csv', index=False, sep=',')
编辑了答案以捕捉所有评论。
RDKit 具有批量相似性功能,因此您可以将一个指纹与一系列指纹进行比较。只需遍历指纹列表即可。
如果 CSV 看起来像这样
第一个带有无效 SMILES 的 csv
smiles,value,value2
CCOCN(C)(C),0.25,A
CCO,1.12,B
COC,2.25,C
带有正确 SMILES 的第二个 csv
smiles,value,value2
CCOCC,0.55,D
CCCO,2.58,E
CCCCO,5.01,F
这就是读出SMILES,删除无效的,做无重复指纹相似度,保存排序值的方法。
from rdkit import Chem
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
import pandas as pd
# read and Conconate the csv's
df_1 = pd.read_csv('first.csv')
df_2 = pd.read_csv('second.csv')
df_3 = pd.concat([df_1, df_2])
# proof and make a list of SMILES
df_smiles = df_3['smiles']
c_smiles = []
for ds in df_smiles:
try:
cs = Chem.CanonSmiles(ds)
c_smiles.append(cs)
except:
print('Invalid SMILES:', ds)
print()
# make a list of mols
ms = [Chem.MolFromSmiles(x) for x in c_smiles]
# make a list of fingerprints (fp)
fps = [FingerprintMols.FingerprintMol(x) for x in ms]
# the list for the dataframe
qu, ta, sim = [], [], []
# compare all fp pairwise without duplicates
for n in range(len(fps)-1): # -1 so the last fp will not be used
s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:]) # +1 compare with the next to the last fp
print(c_smiles[n], c_smiles[n+1:]) # witch mol is compared with what group
# collect the SMILES and values
for m in range(len(s)):
qu.append(c_smiles[n])
ta.append(c_smiles[n+1:][m])
sim.append(s[m])
print()
# build the dataframe and sort it
d = {'query':qu, 'target':ta, 'Similarity':sim}
df_final = pd.DataFrame(data=d)
df_final = df_final.sort_values('Similarity', ascending=False)
print(df_final)
# save as csv
df_final.to_csv('third.csv', index=False, sep=',')
打印输出:
Invalid SMILES: CCOCN(C)(C)C
CCO ['COC', 'CCOCC', 'CCCO', 'CCCCO']
COC ['CCOCC', 'CCCO', 'CCCCO']
CCOCC ['CCCO', 'CCCCO']
CCCO ['CCCCO']
query target Similarity
9 CCCO CCCCO 0.769231
2 CCO CCCO 0.600000
1 CCO CCOCC 0.500000
7 CCOCC CCCO 0.466667
3 CCO CCCCO 0.461538
8 CCOCC CCCCO 0.388889
4 COC CCOCC 0.333333
5 COC CCCO 0.272727
0 CCO COC 0.250000
6 COC CCCCO 0.214286
我正在使用 RDKit 根据两个具有 SMILE 结构的分子列表之间的 Tanimoto 系数计算分子相似性。 现在我可以从两个单独的 csv 文件中提取 SMILE 结构。我想知道如何将这些结构放入RDKit的指纹模块中,以及如何计算两个分子列表之间一对一的相似度?
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
ms = [Chem.MolFromSmiles('CCOC'), Chem.MolFromSmiles('CCO'), ... Chem.MolFromSmiles('COC')]
fps = [FingerprintMols.FingerprintMol(x) for x in ms]
DataStructs.FingerprintSimilarity(fps[0],fps[1])
我想将我拥有的所有SMILE结构(超过10,000个)放入'ms'列表并获取它们的指纹。 然后我会比较两个列表中每对分子之间的相似性,也许这里需要一个for循环?
提前致谢!
我使用 pandas 数据帧到 select 并用我的结构打印出列表,并将我的列表保存到 list_1 和 list_2 中。运行到ms1行报错如下:
TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string<wchar_t,
std::char_traits<wchar_t>, std::allocator<wchar_t> > from this Python object of type float
然后我查看了文件,微笑栏里只有SMILES。但是当我手动将一些分子结构放入列表中进行测试时,仍然存在关于
的错误fpArgs['minSize'].
例如gadodiamide的SMILES为"O=C1[O-][Gd+3]234567[O]=C(C[N]2(CC[N]3(CC([O-]4)=O)CC[N]5(CC(=[O]6)NC)CC(=O)[O-]7)C1)NC",错误代码如下(当运行 fps线时):
ArgumentError: Python argument types in
rdkit.Chem.rdmolops.RDKFingerprint(NoneType, int, int, int, int, int, float, int)
did not match C++ signature:
RDKFingerprint(RDKit::ROMol mol, unsigned int minPath=1,
unsigned int maxPath=7, unsigned int fpSize=2048, unsigned int nBitsPerHash=2,
bool useHs=True, double tgtDensity=0.0, unsigned int minSize=128, bool branchedPaths=True,
bool useBondOrder=True, boost::python::api::object atomInvariants=0, boost::python::api::object fromAtoms=0,
boost::python::api::object atomBits=None, boost::python::api::object bitInfo=None).
如果原始 csv 文件如下所示,如何在输出文件中包含分子名称以及相似度值:
名字,微笑,价值,价值 2
分子 1,CCOCN(C)(C),0.25,A
分子 2、CCO、1.12、B
分子 3、COC、2.25、C
我添加了这些代码以在输出文件中包含分子名称,这些是关于名称的一些数组值错误(特别是对于 d2):
name_1 = df_1['id1']
name_2 = df_2['id2']
name_3 = pd.concat([name_1, name_2])
# create a list for the dataframe
d1, qu, d2, ta, sim = [], [], [], [], []
for n in range(len(fps)-1):
s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:])
#print(c_smiles[n], c_smiles[n+1:])
for m in range(len(s)):
qu.append(c_smiles[n])
ta.append(c_smiles[n+1:][m])
sim.append(s[m])
d1.append(name_3[n])
d2.append(name_3[n+1:][m])
#print()
d = {'ID_1':d1, 'query':qu, 'ID_2':d2, 'target':ta, 'Similarity':sim}
df_final = pd.DataFrame(data=d)
df_final = df_final.sort_values('Similarity', ascending=False)
for index, row in df.iterrows():
print (row["ID_1"], row["query"], row["ID_2"], row["target"], row["Similarity"])
print(df_final)
# save as csv
df_final.to_csv('RESULT_3.csv', index=False, sep=',')
编辑了答案以捕捉所有评论。
RDKit 具有批量相似性功能,因此您可以将一个指纹与一系列指纹进行比较。只需遍历指纹列表即可。
如果 CSV 看起来像这样
第一个带有无效 SMILES 的 csv
smiles,value,value2
CCOCN(C)(C),0.25,A
CCO,1.12,B
COC,2.25,C
带有正确 SMILES 的第二个 csv
smiles,value,value2
CCOCC,0.55,D
CCCO,2.58,E
CCCCO,5.01,F
这就是读出SMILES,删除无效的,做无重复指纹相似度,保存排序值的方法。
from rdkit import Chem
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
import pandas as pd
# read and Conconate the csv's
df_1 = pd.read_csv('first.csv')
df_2 = pd.read_csv('second.csv')
df_3 = pd.concat([df_1, df_2])
# proof and make a list of SMILES
df_smiles = df_3['smiles']
c_smiles = []
for ds in df_smiles:
try:
cs = Chem.CanonSmiles(ds)
c_smiles.append(cs)
except:
print('Invalid SMILES:', ds)
print()
# make a list of mols
ms = [Chem.MolFromSmiles(x) for x in c_smiles]
# make a list of fingerprints (fp)
fps = [FingerprintMols.FingerprintMol(x) for x in ms]
# the list for the dataframe
qu, ta, sim = [], [], []
# compare all fp pairwise without duplicates
for n in range(len(fps)-1): # -1 so the last fp will not be used
s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:]) # +1 compare with the next to the last fp
print(c_smiles[n], c_smiles[n+1:]) # witch mol is compared with what group
# collect the SMILES and values
for m in range(len(s)):
qu.append(c_smiles[n])
ta.append(c_smiles[n+1:][m])
sim.append(s[m])
print()
# build the dataframe and sort it
d = {'query':qu, 'target':ta, 'Similarity':sim}
df_final = pd.DataFrame(data=d)
df_final = df_final.sort_values('Similarity', ascending=False)
print(df_final)
# save as csv
df_final.to_csv('third.csv', index=False, sep=',')
打印输出:
Invalid SMILES: CCOCN(C)(C)C
CCO ['COC', 'CCOCC', 'CCCO', 'CCCCO']
COC ['CCOCC', 'CCCO', 'CCCCO']
CCOCC ['CCCO', 'CCCCO']
CCCO ['CCCCO']
query target Similarity
9 CCCO CCCCO 0.769231
2 CCO CCCO 0.600000
1 CCO CCOCC 0.500000
7 CCOCC CCCO 0.466667
3 CCO CCCCO 0.461538
8 CCOCC CCCCO 0.388889
4 COC CCOCC 0.333333
5 COC CCCO 0.272727
0 CCO COC 0.250000
6 COC CCCCO 0.214286