NLTK 去除无效词

NLTK Remove invalid words

在 python NLTK 库中,您可以将句子标记为单个单词和标点符号。 它会标记非英语和语法正确的单词。我怎样才能删除这些标记,以便我只剩下语法正确的实际英语单词?

示例:

import nltk

sentence = "The word hello is gramatically correct but henlo is not"
sentence_tokenised = nltk.tokenize.word_tokenize(sentence)
print(sentence_tokenised)

这会生成:

['The', 'word', 'hello', 'is', 'gramatically', 'correct', 'but', 'henlo', 'is', 'not']

'henlo' 不是英文单词。是否有一个函数可以解析这些标记并删除无效的单词,如 'henlo'?

基于 NLTK 文档here

A tokenizer that divides a string into substrings by splitting on the specified string (defined in subclasses).

所以它的作用,它只是将一个字符串分成子字符串。如果你想过滤不在nltk.corpus.words()中的单词你可以下载单词一次:

import nltk
nltk.download('words')

然后是:

import nltk
from nltk.corpus import words

sentence = "The word hello is gramatically correct but henlo is not"
sentence_tokenised = nltk.tokenize.word_tokenize(sentence)

output = list(filter(lambda x: x in words.words(), sentence_tokenised))

输出:

['The', 'word', 'hello', 'is', 'correct', 'but', 'is', 'not']