多个分组定界符的正则表达式拆分

Question

如何对分隔符组合进行分组，例如 1. 或 2)？

例如，给定一个字符串，'1. I like food! 2. She likes 2 baloons.'，你如何将这样的句子分开？

再举一个例子，给定输入

'1) 3D Technical/Process animations, 2) Explainer videos, 3) Product launch videos'

输出应该是

['3D Technical', 'Process animations', 'Explained videos', 'Product launch videos']

我试过了：

a = '1) 3D Technical/Process animations, 2) Explainer videos, 3) Product launch videos'
re.split(r'[1.2.3.,1)2)3)/]+|etc', a)

输出是：

['',
 'D Technical',
 'Process animations',
 ' Explainer videos',
 ' Product launch videos']

Answer 1

这是一种获得预期结果的方法：

import re

a = '1) 3D Technical/Process animations, 2) Explainer videos, 3) Product launch videos'
r = [s for s in map(str.strip,re.split(r',? *[0-9]+(?:\)|\.) ?', a)) if s]

print(*r,sep='\n')
3D Technical/Process animations
Explainer videos
Product launch videos

分隔符的模式 r',? *[0-9]+(?:\)|\.) ?' 可以分解如下：
- ,? 一个可选的尾随逗号
- * 一个可选的 space（或多个）在数字
- [0-9]+ 至少一个数字的序列
- (?:\)|\.) 后跟右括号或句点。开头的 ?: 使其成为非捕获组，因此 re.split 不会将其包含在输出中
- ? 括号或句点后的可选 space（您可能想删除 ? 或将其替换为 +，具体取决于您的实际数据

re.split 的输出映射到 str.strip 以删除 leading/trailing spaces。这是在列表理解中，它将过滤掉空字符串（例如，在第一个分隔符之前）

如果不带编号的逗号或斜杠也用作分隔符，您可以将其添加到模式中：

def splitItems(a):
    pattern = r'/|,|(?:,? *[0-9]+(?:\)|\.) ?)'
    return [s for s in map(str.strip,re.split(pattern, a)) if s]

输出：

a = '3D Technical/Process animations, Explainer videos, Product launch videos'
print(*splitItems(a),sep='\n')

3D Technical/Process animations
Explainer videos
Product launch videos


a = '1. Hello 2. Hi'
print(*splitItems(a),sep='\n')
Hello
Hi

a = "Great, what's up?! , Awesome"
print(*splitItems(a),sep='\n')
Great
what's up?!
Awesome

a = '1. Medicines2. Devices 3.Products'
print(*splitItems(a),sep='\n')
Medicines
Devices
Products

a = 'ABC/DEF/FGH'
print(*splitItems(a),sep='\n')
ABC
DEF
FGH

如果您的分隔符是非此即彼模式的列表（意味着只有一个模式始终适用于给定的字符串），那么您可以在循环中按优先顺序尝试它们，return 第一个拆分产生不止一个部分：

def splitItems(a):
    for pattern in ( r'(?:,? *[0-9]+(?:\)|\.) ?)', r',', r'/' ):
        result = [*map(str.strip,re.split(pattern, a))]
        if len(result)>1: break
    return [s for s in result if s]

输出：

# same as all the above and this one:

a = '1. Arrangement of Loans for Listed Corporates and their Group Companies, 2. Investment Services wherein we assist Corporates, Family Offices, Business Owners and Professionals to invest their   Surplus Funds to invest in different products such as Stocks, Mutual Funds, Bonds, Fixed Deposit, Gold Bonds,PMS etc 3. Estate Planning'
print(*splitItems(a),sep='\n')

Arrangement of Loans for Listed Corporates and their Group Companies
Investment Services wherein we assist Corporates, Family Offices, Business Owners and Professionals to invest their   Surplus Funds to invest in different products such as Stocks, Mutual Funds, Bonds, Fixed Deposit, Gold Bonds,PMS etc
Estate Planning

多个分组定界符的正则表达式拆分

Regex splitting of multiple grouped delimeters

python

python-re