NLTK 去除无效词

Question

在 python NLTK 库中，您可以将句子标记为单个单词和标点符号。它会标记非英语和语法正确的单词。我怎样才能删除这些标记，以便我只剩下语法正确的实际英语单词？

示例：

import nltk

sentence = "The word hello is gramatically correct but henlo is not"
sentence_tokenised = nltk.tokenize.word_tokenize(sentence)
print(sentence_tokenised)

这会生成：

['The', 'word', 'hello', 'is', 'gramatically', 'correct', 'but', 'henlo', 'is', 'not']

'henlo' 不是英文单词。是否有一个函数可以解析这些标记并删除无效的单词，如 'henlo'?

Answer 1

基于 NLTK 文档here：

A tokenizer that divides a string into substrings by splitting on the specified string (defined in subclasses).

所以它的作用，它只是将一个字符串分成子字符串。如果你想过滤不在nltk.corpus.words()中的单词你可以下载单词一次：

import nltk
nltk.download('words')

然后是：

import nltk
from nltk.corpus import words

sentence = "The word hello is gramatically correct but henlo is not"
sentence_tokenised = nltk.tokenize.word_tokenize(sentence)

output = list(filter(lambda x: x in words.words(), sentence_tokenised))

输出：

['The', 'word', 'hello', 'is', 'correct', 'but', 'is', 'not']

NLTK 去除无效词

NLTK Remove invalid words

python

nltk

python-3.x