如何在不保留捕获组的情况下使用正则表达式拆分字符串？

Question

我想在 Python.

中使用正则表达式 和反向引用 拆分文本

rexp = re.compile(r"([`]{1,})ABC")
rexp.split("blahblah``ABC``blahblah")

我得到了 ['blahblah', '``', 'blahblah']，但预计 ['blahblah', 'blahblah']。如何在不保留捕获组的情况下拆分字符串？

Answer 1

来自 re.split() 文档：

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

因为你想使用反向引用，所以你无法避免第一个捕获组，但你可以让它们的其余部分成为非捕获组，然后 post- 处理你的拆分以获得你想要的，例如:

rexp = re.compile(r"([`]{1,})->\s*(?:\S+)\s*\|(?:.+?)<-")
rexp.split("blahblah``->Left|Right<-``blahblah")[0::2]  # ['blahblah', 'blahblah']

UPDATE: 我刚刚注意到你同时改变了你的模式，但原理是一样的:

rexp = re.compile(r"([`]{1,})ABC")  # also, if optimizing, equivalent to: (`+)ABC
rexp.split("blahblah``ABC``blahblah")[0::2]  # ['blahblah', 'blahblah']

Answer 2

您可以先用唯一的定界符替换拆分模式，然后再拆分：

>>> s="blahblah``ABC``blahblah"
>>> delim="<-split->"
>>> re.split(delim, re.sub(r"([`]+)ABC", delim, s))
['blahblah', 'blahblah']

这种方法的优点是您无需假设拆分模式在字符串中的位置。

您还可以使用更快的 Python 拆分，因为您已将正则表达式目标转换为固定字符串：

>>> re.sub(r"([`]+)ABC", delim, s).split(delim)
['blahblah', 'blahblah']

更新

显示这个的时间与接受的答案一样快：

import re

def f1(s):
    rexp = re.compile(r"([`]{1,})ABC")
    return rexp.split(s)[0::2]
    
def f2(s):
    delim="<-split->"  
    rexp1=re.compile(r"([`]+)ABC")  
    rexp2=re.compile(delim)
    return rexp2.split(rexp1.sub(delim, s))

def f3(s):
    delim="<-split->"  
    rexp=re.compile(r"([`]+)ABC")  
    return rexp.sub(delim, s).split(delim) 

if __name__=='__main__':
    import timeit    
    for case, x in (('small',1000),('med',10000),('large',1000000)):  
        s="blahblah``ABC``blahblah"*x
        print("Case {}, {:,} x, All equal: {}".format(case,x,(f1(s)==f2(s)==f3(s))))
        for f in (f1,f2,f3):
            print("   {:^10s}{:.4f} secs".format(f.__name__, timeit.timeit("f(s)", setup="from __main__ import f, s", number=10)))

在我的旧 iMac 上，Python 3.6，打印：

Case small, 1,000 x, All equal: True
       f1    0.0049 secs
       f2    0.0048 secs
       f3    0.0045 secs
Case med, 10,000 x, All equal: True
       f1    0.0512 secs
       f2    0.0536 secs
       f3    0.0526 secs
Case large, 1,000,000 x, All equal: True
       f1    5.2092 secs
       f2    5.6808 secs
       f3    5.5388 secs

使用 PyPy，按照我建议的方式进行会更快：

Case small, 1,000 x, All equal: True
       f1    0.0020 secs
       f2    0.0021 secs
       f3    0.0012 secs
Case med, 10,000 x, All equal: True
       f1    0.0325 secs
       f2    0.0288 secs
       f3    0.0217 secs
Case large, 1,000,000 x, All equal: True
       f1    4.4900 secs
       f2    3.0680 secs
       f3    2.1079 secs

所以不确定你所说的对于非常大的输入字符串是什么意思，这是一个可怕的成本...... - 时间显示它是相同的或更快的，即使是巨大的输入字符串。

如何在不保留捕获组的情况下使用正则表达式拆分字符串？

How to split a string with regexp without keeping capture groups?

python

regex

string

regex-group

python-3.x

更新