NLP，忽略不相关的词

NLP, ignoring irrelevant words

我开发了一个从单词中提取护照号码的简单提取器（例如，输入 - '135 35 0 得到输出 - 1353500）

但是如何过滤掉不相关的词，例如 'ok'、'mhm' 等等？

例如人类可以说 'ok it is 1353500'，而机器人会从 'ok'、'it'、'is' 中提取一些无意义的数字，这很糟糕。问题是如何忽略那些非数字词？

这些基本上是stopwords.To删除它们，你需要下载包含所有英文停用词的nltk包

from nltk.corpus import stopwords
w = stopwords.words('english')
#lets say data is a string which has your sentence
for word in w:
   if word in data:
       data.replace(word,'')