在逗号、'and's、'or's 上拆分字符串

Question

我想从自然编写的字符串列表转到 python 列表。

示例输入：

s1 = 'make the cake, walk the dog, and pick-up poo.'
s2 = 'flour, egg-whites and sand.'

输出：

split1 = ['make the cake', 'walk the dog', 'pick-up poo']
split2 = ['flour', 'egg-whites', 'sand']

我想在逗号（和句点）、'and' 和 'or' 上拆分字符串，同时删除拆分和空字符串。由于牛津逗号的使用缺乏标准化，我不能只用逗号分隔。

我尝试了以下方法：

import re
[x.strip() for x in re.split('([A-Za-z -]+)', s1) if x not in ['', ',', '.']]

给出：

['make the cake', 'walk the dog', 'and pick-up poo']

这很接近。但是对于 s2 它给出：

['flour', 'egg-whites and sand']

我可以对元素进行一些 post 处理，以按 (and|or) 连续拆分元素，但我真的很想用逗号、and's 和 or's 的集合来标记化。

我已经尝试了一些花哨的正则表达式拆分来对 and 之类的东西进行负面展望，但它不想拆分那个词。

[x.strip() for x in re.split('([A-Za-z -]+(?!and))', s2) if x not in ['', ',', '.']]
[x.strip() for x in re.split('([A-Za-z -]+(?!\band\b))', s2) if x not in ['', ',', '.']]

这也给出了

['flour', 'egg-whites and sand']

我知道有很多边缘情况，但我觉得我很接近，只是遗漏了一些小东西。

Answer 1

您可以使用

\s*(?:\b(?:and|or)\b|[,.])\s*

见regex demo。详情：

\s* - 0+ 个空格
(?:\b(?:and|or)\b|[,.]) - 整个单词 and 或 or，或者 comma/period
\s* - 0+ 个空格

看到一个Python demo:

import re
rx = re.compile(r"\s*(?:\b(?:and|or)\b|[,.])\s*")
strings = ["make the cake, walk the dog, and pick-up poo.", "flour, egg-whites and sand."]
for s in strings:
    print( list(filter(None, rx.split(s))) )

请注意，逗号或句号在后面或用数字括起来时通常会被“排除”，您可以考虑将 [.,] 替换为 [,.](?!\d) 或 [,.](?!(?<=\d[,.])\d)。

Answer 2

我认为你需要在传递中处理这个问题：

应用标点拆分
应用连词拆分

这适用于您提供的两个测试用例

在逗号、'and's、'or's 上拆分字符串

Split strings on commas, 'and's, 'or's

python

regex

string

tokenize