如何解决 nltk.corpus.words.words() 中遗漏的单词？

Question

我试图从文本中删除非英语单词。 NLTK 语料库中缺少许多其他词的问题。

我的代码：

import pandas as pd
    
lst = ['I have equipped my house with a new [xxx] HP203X climatisation unit']
df = pd.DataFrame(lst, columns=['Sentences'])
    
import nltk 
nltk.download('words')
words = set(nltk.corpus.words.words())
    
df['Sentences'] = df['Sentences'].apply(lambda x: " ".join(w for w in nltk.wordpunct_tokenize(x) if w.lower() in (words)))
df

输入：I have equipped my house with a new [xxx] HP203X climatisation unit
结果：I have my house with a new unit

应该是：I have equipped my house with a new climatisation unit

我不知道如何完成 nltk.corpus.words.words() 以避免像 equipped、climatisation 这样的词从句子中删除。

Answer 1

您可以使用

words.update(['climatisation', 'equipped'])

这里，words是一个集合，所以.extend(word_list)没有起作用。

如何解决 nltk.corpus.words.words() 中遗漏的单词？

How to solve missing words in nltk.corpus.words.words()?

nlp

corpus

tokenize

nltk