如何解释从 Chem.RDKFingerprint(mol) 获得的特征
How can I interpret the features obtained from Chem.RDKFingerprint(mol)
我已完成以下操作以从 mol 文件中获取指纹。通过转换 fp.ToBitString() 给我一个长度为 2048 的向量。当我计算时,1 与分子中的原子数相同。我们如何解释这个向量? link 对解释的任何建议都很好。
mol = Chem.MolFromSmiles(ms)
fp = Chem.RDKFingerprint(mol)
fp.ToBitString()
这是我得到的向量
'00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000'
据我所知,RDKFingerprint
是一个“类似日光”的子结构指纹,它使用一个位向量,其中每个位由分子内特定子结构的存在来设置。默认设置 (maxPath
default=7) 考虑最长为 7 个键的子结构。由于没有预定义的子结构集,不可能为每个现有模式设置一个位,因此每个密钥都被视为伪随机数生成器 ('hashing') 的种子。它的输出是一组位 (nBitsPerHash
, default=2),数字介于 0 和 fpSize
default=2048 之间,用于设置指纹中的相应位。
RDKit 有一个很好的工具来解释位集:
from rdkit.Chem import Draw
from rdkit import Chem
smiles = 'OC(CN1C=NC=N1)(CN1C=NC=N1)C1=C(F)C=C(F)C=C1'
mol = Chem.MolFromSmiles(smiles)
bit_info = {}
fp = Chem.RDKFingerprint(mol, maxPath=5, bitInfo=bit_info)
print(list(fp.GetOnBits())[:10]) # print the first 10 bits set to 1
# using the bit_info dictionary populated by RDKit prepare a visualisation
Draw.DrawRDKitBit(mol, 60, bit_info)
# draw multiple bits (12)
tpls = [(mol, x, bit_info) for x in bit_info]
Draw.DrawRDKitBits(tpls[:12], molsPerRow=4, legends=[str(x) for x in bit_info][:12])
部分推荐读物:
我已完成以下操作以从 mol 文件中获取指纹。通过转换 fp.ToBitString() 给我一个长度为 2048 的向量。当我计算时,1 与分子中的原子数相同。我们如何解释这个向量? link 对解释的任何建议都很好。
mol = Chem.MolFromSmiles(ms)
fp = Chem.RDKFingerprint(mol)
fp.ToBitString()
这是我得到的向量
'00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000'
据我所知,RDKFingerprint
是一个“类似日光”的子结构指纹,它使用一个位向量,其中每个位由分子内特定子结构的存在来设置。默认设置 (maxPath
default=7) 考虑最长为 7 个键的子结构。由于没有预定义的子结构集,不可能为每个现有模式设置一个位,因此每个密钥都被视为伪随机数生成器 ('hashing') 的种子。它的输出是一组位 (nBitsPerHash
, default=2),数字介于 0 和 fpSize
default=2048 之间,用于设置指纹中的相应位。
RDKit 有一个很好的工具来解释位集:
from rdkit.Chem import Draw
from rdkit import Chem
smiles = 'OC(CN1C=NC=N1)(CN1C=NC=N1)C1=C(F)C=C(F)C=C1'
mol = Chem.MolFromSmiles(smiles)
bit_info = {}
fp = Chem.RDKFingerprint(mol, maxPath=5, bitInfo=bit_info)
print(list(fp.GetOnBits())[:10]) # print the first 10 bits set to 1
# using the bit_info dictionary populated by RDKit prepare a visualisation
Draw.DrawRDKitBit(mol, 60, bit_info)
# draw multiple bits (12)
tpls = [(mol, x, bit_info) for x in bit_info]
Draw.DrawRDKitBits(tpls[:12], molsPerRow=4, legends=[str(x) for x in bit_info][:12])
部分推荐读物: