如何使用正则表达式替换 Python 中文本中的标记

Question

给定标记列表，我想用空格替换标记化文本中的所有标记。

例如，给定 ['a', 'is'] 和 'this is a test'，结果应为 'this test'。

我尝试了 How can I do multiple substitutions using regex in python? 中的代码，但输出是 'th test'。

此外，列表很长（大约1k个令牌）并且文本文件很大。所以速度也很重要。

Answer 1

这应该可以解决您的问题并且速度合理。令牌列表转换为集合，因此可以在 O(1) 时间内完成查找：

tokens = ['a', 'is']
tokenized_text = 'this is a test'

val = ' '.join(word for word in tokenized_text.split(' ')
               if word not in set(tokens))
print(val)

版画

this test

如何使用正则表达式替换 Python 中文本中的标记

How to use regex to replace tokens in text in Python

python

python-re