python 全部重新拆分 space 和标点符号（撇号除外）

Question

我想用除撇号以外的所有空格和标点符号来拆分字符串。最好仍然使用单引号作为分隔符，除非它是撇号。我也想保留分隔符。示例字符串
words = """hello my name is 'joe.' what's your's"""

这是我到目前为止的重新模式 splitted = re.split(r"[^'-\w]",words.lower()) 我尝试在 ^ 字符后加上单引号，但它不起作用。

我想要的输出是这样的。 splitted = [hello,my,name,is,joe,.,what's,your's]

Answer 1

我喜欢正则表达式高尔夫！

words = """hello my name is 'joe.' what's your's"""
splitted = re.findall(r"\b(?:\w'\w|\w)+\b", words)

括号中的部分是匹配由字母包围的撇号或单个字母的组。

编辑：

这样比较灵活：

re.findall(r"\b(?:(?<=\w)'(?=\w)|\w)+\b", words)

尽管此时它变得有点不可读，但在实践中您可能应该使用 Woodford 的答案。

Answer 2

拆分后简单地处理您的列表可能更简单，而不首先考虑它们：

>>> words = """hello my name is 'joe.' what's your's"""
>>> split_words = re.split(r"[ ,.!?]", words.lower())  # add punctuation you want to split on
>>> split_words
['hello', 'my', 'name', 'is', "'joe.'", "what's", "your's"]
>>> [word.strip("'") for word in split_words]
['hello', 'my', 'name', 'is', 'joe.', "what's", "your's"]

Answer 3

一种选择是使用环视在所需位置进行拆分，并使用捕获组将要保留在拆分中的内容。

拆分后，您可以从结果列表中删除空条目。

\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])

模式匹配

\s+ 匹配 1 个或多个空白字符
| 或
(?<=\s)' 匹配 ' 前面有一个空白字符
| 或
'(?=\s) 匹配 ' 后跟空白字符
| 或
(?<=\w)([,.!?]) 捕获组 1 中的 , . ! ? 之一，当前面有一个单词字符时

看到一个regex demo and a Python demo.

例子

import re

pattern = r"\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])"
words = """hello my name is 'joe.' what's your's"""
result = [s for s in re.split(pattern, words) if s]
print(result)

输出

['hello', 'my', 'name', 'is', 'joe', '.', "what's", "your's"]

python 全部重新拆分 space 和标点符号（撇号除外）

python re split at all space and punctuation except for the apostrophe

python

string

split

apostrophe

python-re