NLTK 去除无效词
NLTK Remove invalid words
在 python NLTK 库中,您可以将句子标记为单个单词和标点符号。
它会标记非英语和语法正确的单词。我怎样才能删除这些标记,以便我只剩下语法正确的实际英语单词?
示例:
import nltk
sentence = "The word hello is gramatically correct but henlo is not"
sentence_tokenised = nltk.tokenize.word_tokenize(sentence)
print(sentence_tokenised)
这会生成:
['The', 'word', 'hello', 'is', 'gramatically', 'correct', 'but', 'henlo', 'is', 'not']
'henlo' 不是英文单词。是否有一个函数可以解析这些标记并删除无效的单词,如 'henlo'?
基于 NLTK 文档here:
A tokenizer that divides a string into substrings by splitting on the specified string (defined in subclasses).
所以它的作用,它只是将一个字符串分成子字符串。如果你想过滤不在nltk.corpus.words()
中的单词你可以下载单词一次:
import nltk
nltk.download('words')
然后是:
import nltk
from nltk.corpus import words
sentence = "The word hello is gramatically correct but henlo is not"
sentence_tokenised = nltk.tokenize.word_tokenize(sentence)
output = list(filter(lambda x: x in words.words(), sentence_tokenised))
输出:
['The', 'word', 'hello', 'is', 'correct', 'but', 'is', 'not']
在 python NLTK 库中,您可以将句子标记为单个单词和标点符号。 它会标记非英语和语法正确的单词。我怎样才能删除这些标记,以便我只剩下语法正确的实际英语单词?
示例:
import nltk
sentence = "The word hello is gramatically correct but henlo is not"
sentence_tokenised = nltk.tokenize.word_tokenize(sentence)
print(sentence_tokenised)
这会生成:
['The', 'word', 'hello', 'is', 'gramatically', 'correct', 'but', 'henlo', 'is', 'not']
'henlo' 不是英文单词。是否有一个函数可以解析这些标记并删除无效的单词,如 'henlo'?
基于 NLTK 文档here:
A tokenizer that divides a string into substrings by splitting on the specified string (defined in subclasses).
所以它的作用,它只是将一个字符串分成子字符串。如果你想过滤不在nltk.corpus.words()
中的单词你可以下载单词一次:
import nltk
nltk.download('words')
然后是:
import nltk
from nltk.corpus import words
sentence = "The word hello is gramatically correct but henlo is not"
sentence_tokenised = nltk.tokenize.word_tokenize(sentence)
output = list(filter(lambda x: x in words.words(), sentence_tokenised))
输出:
['The', 'word', 'hello', 'is', 'correct', 'but', 'is', 'not']