如何拆分包括标点符号的句子

Question

如果我有句子 sentence = 'There is light!' 并且我要用 mysentence = sentence.split() 拆分这句话，我将如何得到 print(mysentence) 的 'There, is, light, !' 的输出？我特别想做的是拆分包括所有标点符号的句子，或者只是一个选定标点符号的列表。我得到了一些代码，但程序识别的是单词中的字符，而不是单词中的字符。

out = "".join(c for c in punct1 if c not in ('!','.',':'))
out2 = "".join(c for c in punct2 if c not in ('!','.',':'))
out3 = "".join(c for c in punct3 if c not in ('!','.',':'))

如果不识别单词中的每个字符，但不识别单词本身，我将如何使用它。因此，"Hello how are you?" 的输出应该变成 "Hello, how, are, you, ?" 这样做的任何方式

Answer 1

您可以使用带有 re.findall 的 \w+|[^\w\s]+ 正则表达式来获取这些块：

\w+|[^\w\s]

见regex demo

图案详情:

\w+ - 1 个或多个单词字符（字母、数字或下划线）
| - 或
[^\w\s] - 除了单词/空格之外的 1 个字符

Python demo:

import re
p = re.compile(r'\w+|[^\w\s]')
s = "There is light!"
print(p.findall(s))

注意：如果你想把下划线当作标点符号，你需要使用类似[a-zA-Z0-9]+|[^A-Za-z0-9\s]的模式。

更新（评论后）

为确保将撇号作为单词的一部分进行匹配，请将 (?:'\w+)* 或 (?:'\w+)? 添加到上述模式中的 \w+：

import re
p = re.compile(r"\w+(?:'\w+)*|[^\w\s]")
s = "There is light!? I'm a human"
print(p.findall(s))

见updated demo

(?:'\w+)* 匹配零个或多个（*，如果您使用 ?，它将匹配 1 次或 0 次出现的撇号后跟 1 个以上的单词字符。

如何拆分包括标点符号的句子

How to split sentence including punctuation

string

split

punctuation

python-3.x