如何标记正则表达式模式本身？

Question

很多关于如何使用正则表达式标记某些字符串的问题。

不过，我正在寻找标记化正则表达式模式本身的方法，我确定有一些关于该主题的帖子，但我找不到它们。

示例：

^\w$                    -> ['^', '\w', '&']
[3-7]*                  -> ['[3-7]*']
\w+\s\w+                -> ['\w+', '\s', '\w+']
(xyz)*\s[a-zA-Z]+[0-9]? -> ['(xyz)*','\s','[a-zA-Z]+','[0-9]?']

我假设这项工作是在调用某些 regexp 函数时在 python 中完成的。

Answer 1

一个起点：PyPy project has an implementation of Python in (mostly) Python. The re.py in the source distribution calls sre-compile.c:_compile() 完成工作。您也许可以破解它以提供您想要的输出形式。

编辑另外，Javascript XRegExp library parses regexes in an extended syntax and renders them to standard syntax. The parser routine 可能会对您有所帮助。

如何标记正则表达式模式本身？

How to tokenize the regexp pattern ITSELF?

python

regex

tokenize

python-2.7