如果字母来自（微笑）字符串而不是来自元素列表，则过滤行

Question

问题

如果字符串中的任何字母（原子）对大写不敏感，来自元素 H, He, Li, Be, B 的数据帧，如何过滤掉 SMILES 字符串？这是一个截断的列表，其中有 80 个。

背景

我有一个包含 SMILES 字符串的数据库：

The simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings.

（更多信息 Wikipedia link）

这样做的目的是从数据库中去除稀有元素和有机金属。

我从一个字符串开始，在处理数据帧之前测试代码。我编写循环来测试字符串中的字符。

strings = "[O+]#C[Ni-4](C#[O+])(C#[O+])C#[O+]"
list = ['Ni']
for i in list:
    if i in strings:
        print(i)

如何迭代数据框和过滤器？

Answer 1

对于list/simplified版本，反其道而行之。使用列表在字符串中查找匹配项。

strings = "[O+]#C[Ni-4](C#[O+])(C#[O+])C#[O+]"
list = ['Ni', 'Sc']

for i in list:
    if i in strings:
        print(i)
else:
    print('nah')

> Ni
> nah

要遍历数据帧，请使用 np.where

df = pd.DataFrame({'smiles': ['sdflk', '[O+]#C[Ni-4](C#[O+])(C#[O+])C#[O+]']})
list = ['Ni', 'Sc']

df['element'] = np.where(df.smiles.str.contains('|'.join(list)), 1, 0) # mark element that contains string in the list as 1, else 0
df[df['element'] == 1] # remove rows that have the element

请注意，当数据框包含 Sc1 之类的字符串时，这会出现问题，其中 S 和 c 实际上是指简单芳环上的硫和碳，而不是钪 Sc。所以我们需要一种方法来识别 Sc 并且只有当它没有附加数字时。 负面前瞻 会在这里帮助我们。

df['Sc'] = df['smiles'].str.match('Sc(?!\d)')

如果字母来自（微笑）字符串而不是来自元素列表，则过滤行

filter rows if alphabetical letter from a (smiles) string not from a list of elements

python

bioinformatics

python-3.x

jupyter-notebook