使用正则表达式将文本拆分为标记时保留特殊标记

Question

我有这篇文章 'I love this but I have a! question to?'，目前正在使用

token_pattern = re.compile(r"(?u)\b\w+\b")
token_pattern.findall(text)

当使用这个正则表达式时，我得到

['I','love', 'this', 'but', 'I', 'have', 'a', 'question', 'to']

我不是写这个正则表达式的人，我对正则表达式一无所知（试图从示例中理解但只是放弃尝试）现在我需要以某种方式更改这个正则表达式以保留问题和感叹号，并将它们也拆分为唯一的标记，所以它将 return 这个列表

['I','love', 'this', 'but', 'I', 'have', 'a', '!', 'question', 'to', '?']

关于如何做到这一点的任何建议。

Answer 1

试试这个：

token_pattern = re.compile(r"(?u)[^\w ]|\b\w+\b")
token_pattern.findall(text)

它也将所有非字母数字字符作为一个匹配项进行匹配。

如果您真的只需要问号和感叹号，您可以将正则表达式更改为

token_pattern = re.compile(r"(?u)[!?]|\b\w+\b")
token_pattern.findall(text)

Keeping special marks when splitting text into tokens using regex