正则表达式：匹配前后只需要 space 个字符

Question

我正在为文本段落使用 Regex 分词器，我想提取前后只有白色 space 的所有单词。这是我的代码：

tokenizer = RegexpTokenizer('[0-9a-z][^\s\']*[a-z]')

例如，“我们没有 500 美元”这句话最终会变成“我们没有美元”。我想删除“don”，因为它不以白色 space 结尾。我该怎么做？

Answer 1

您可以使用积极的前瞻和后视来实现这一点

代码：

重新导入

pattern = r"(?:(?<=^)|(?<=\s))([a-zA-Z0-9]+)(?:(?=\s)|(?=$))"
print(re.findall(pattern, "we don't have 500 dollars"))
print(re.findall(pattern, "Your money's no good here, Mr. Torrance"))

输出：

['we', 'have', '500', 'dollars']
['Your', 'no', 'good', 'Torrance']

你可以在这里尝试一下 https://regex101.com/r/IeLC88/3

正则表达式：匹配前后只需要 space 个字符

Regex: Only want space character before and after match

python

regex

tokenize

nltk