在 python 中用新行、符号和带空格的正则表达式拆分字符串

Question

我是正则表达式库的新手，我正在尝试使用这样的文本

"""constructor SquareGame new(){
let square=square;
}"""

这会输出一个列表：

['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=',  'square', ';', '}']

我需要创建一个由空格、换行符和此符号分隔的标记列表 {}()[].;,+-*/&|<>=~。

我使用了 re.findall('[,;.()={}]+|\S+|\n', text)，但似乎只用空格和换行符来分隔标记。

Answer 1

您可以使用

re.findall(r'\w+|[^\w \t]', text)

为了避免匹配any Unicode horizontal whitespace 使用

re.findall(r'\w+|[^\w \t\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]', text)

见regex demo。详情:

\w+ - 1 个或多个单词字符
| - 或
[^\w \t] - 不是 space 的单个非单词字符和制表符字符（因此，匹配所有垂直白色 space）。

您可以添加更多水平白色space 字符以排除到 [^\w \t] 字符 class 中，请参阅 Match whitespace but not newlines 中的列表。正则表达式看起来像 \w+|[^\w \t\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000].

查看 Python demo:

import re
pattern = r"\w+|[^\w \t]"
text = "constructor SquareGame new(){\nlet square=square;\n}"
print ( re.findall(pattern, text) )
# => ['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=', 'square', ';', '\n', '}']

Answer 2

此正则表达式只会根据您指定的字符进行匹配，我认为这是一种更安全的方法。

>>> re.findall(r"\w+|[{}()\[\].;,+\-*/&|<>=~\n]", text)
['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=', 'square', ';', '\n', '}'

在 python 中用新行、符号和带空格的正则表达式拆分字符串

Split string with regex by new lines, symbols and withspaces in python

python

regex

whitespace

symbols

tokenize