如何解释从 Chem.RDKFingerprint(mol) 获得的特征

How can I interpret the features obtained from Chem.RDKFingerprint(mol)

我已完成以下操作以从 mol 文件中获取指纹。通过转换 fp.ToBitString() 给我一个长度为 2048 的向量。当我计算时,1 与分子中的原子数相同。我们如何解释这个向量? link 对解释的任何建议都很好。

mol = Chem.MolFromSmiles(ms)
fp = Chem.RDKFingerprint(mol)
fp.ToBitString()

这是我得到的向量

'00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000'

据我所知,RDKFingerprint 是一个“类似日光”的子结构指纹,它使用一个位向量,其中每个位由分子内特定子结构的存在来设置。默认设置 (maxPath default=7) 考虑最长为 7 个键的子结构。由于没有预定义的子结构集,不可能为每个现有模式设置一个位,因此每个密钥都被视为伪随机数生成器 ('hashing') 的种子。它的输出是一组位 (nBitsPerHash, default=2),数字介于 0 和 fpSize default=2048 之间,用于设置指纹中的相应位。

RDKit 有一个很好的工具来解释位集:

from rdkit.Chem import Draw
from rdkit import Chem

smiles = 'OC(CN1C=NC=N1)(CN1C=NC=N1)C1=C(F)C=C(F)C=C1'
mol = Chem.MolFromSmiles(smiles)

bit_info = {}
fp = Chem.RDKFingerprint(mol, maxPath=5, bitInfo=bit_info)
print(list(fp.GetOnBits())[:10])  # print the first 10 bits set to 1

# using the bit_info dictionary populated by RDKit prepare a visualisation
Draw.DrawRDKitBit(mol, 60, bit_info)

# draw multiple bits (12)
tpls = [(mol, x, bit_info) for x in bit_info]
Draw.DrawRDKitBits(tpls[:12], molsPerRow=4, legends=[str(x) for x in bit_info][:12])

部分推荐读物: