使用分隔符和条件拆分字符串

Question

我正在尝试拆分由 whitespace, +, = 分隔的一般化学反应字符串，其中可能有任意数量的空格。这是一般情况，但我还需要它在 ().

中找到 + 时在括号字符 () 上有条件地拆分

例如：

    reaction= 'C5H6 + O = NC4H5 + CO + H'

应拆分为

     splitresult=['C5H6','O','NC4H5','CO','H']

这种情况在使用filter(None,re.split('[\s+=]',reaction))时看起来很简单。但现在是有条件的分裂。有些反应会有一个 (+M)，我也想将其拆分，只留下 M。在这种情况下，括号内总会有一个 +M

例如：

    reaction='C5H5 + H (+M)= C5H6 (+M)'
    splitresult=['C5H5','H','M','C5H6','M']

但是，在某些情况下，括号不是分隔符。在这些情况下，不会有 +M，而是其他无关紧要的东西。

例如：

    reaction='C5H5 + HO2 = C5H5O(2,4) + OH'
    splitresult=['C5H5','HO2','C5H5O(2,4)','OH']

我最好的猜测是使用否定前瞻和后视来匹配 +M 但我不确定如何将其合并到我上面在简单情况下使用的正则表达式。我的直觉是使用 filter(None,re.split('[(?<=M)\)\((?=\+)=+\s]',reaction)) 之类的东西。非常感谢任何帮助。

Answer 1

使用单个正则表达式拆分字符串似乎过于复杂。单独处理 (+M) 的特殊情况会容易得多：

halfway = re.sub("\(\+M\)", "M", reaction)
result = filter(None, re.split('[\s+=]', halfway))

Answer 2

这是您要查找的正则表达式。

正则表达式： ((?=\(\+)\()|[\s+=]|((?<=M)\))

使用的标志：

g 用于全局搜索。或者根据您的情况使用它们。

解释：

(+

((?=\(\+)\() 检查 (。这涵盖了 (+M) 问题的第一部分。
((?<=M)\)) 检查 ) 如果 M 前面有 )，则该 ) 存在。这涵盖了 (+M) 问题的第二部分。
[\s+=] 检查所有剩余的 whitespaces、+ 和 =。这涵盖了您问题的最后一部分。

注意： digits 被 () 包围的注意由 positive lookahead 和 positive lookbehind 断言确保。

Check Regex101 demo for working

P.S：适合自己，因为我还不是python程序员

Answer 3

您可以使用 re.findall() 代替：

re.findall(pattern, string, flags=0) Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

然后：

import re
reaction0= 'C5H6 + O = NC4H5 + CO + H'
reaction1='C5H5 + H (+M)= C5H6 (+M)'
reaction2='C5H5 + HO2 = C5H5O(2,4) + OH'
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction0)
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction1)
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction2)

但是，如果您更喜欢 re.split() 和 filter()，那么：

import re
reaction0= 'C5H6 + O = NC4H5 + CO + H'
reaction1='C5H5 + H (+M)= C5H6 (+M)'
reaction2='C5H5 + HO2 = C5H5O(2,4) + OH'
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction0))
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction1))
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction2))

findall 的模式与 split 的模式不同，因为 findall 和 split 正在寻找不同的东西； 'the opposite things'，的确如此。

findall，正在寻找你想要的（保留它）。

split，正在寻找你不想要的（摆脱它）。

在findall, '[A-Z0-9]+(?:([1-9],[1-9]))? ' 匹配任何大写字母或数字 > [A-Z0-9], 一次或多次 > +, 后跟一对数字，中间有逗号，括号内 > \([1-9],[1-9]\) （字符类外的文字括号必须用反斜杠 '\' 转义），可选 > ?

\([1-9],[1-9]\) 在 (?: ) 里面，然后, ? （使其成为可选的）； ( )，而不是 (?:) 有效，但是，在这种情况下，(?:) 更好; (?: ) 是一个无捕获组：阅读此内容。

在 split

中用正则表达式试试

使用分隔符和条件拆分字符串

Splitting a string with delimiters and conditions

python

regex

string

split

lookahead