Sklearn - 如何从 txt 文件添加自定义停用词列表

Sklearn - How to add custom stopword list from txt file

我已经使用 Sklearn 完成了 TFIDF,但问题是我不能使用英语单词作为停用词,因为我使用的是马来语(非英语)。我需要导入包含停用词列表的 txt 文件。

stopword.txt

saya
cintakan
awak

tfidf.py

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['Saya benci awak',
          'Saya cinta awak',
          'Saya x happy awak',
          'Saya geram awak',
          'Saya taubat awak']
vocabulary = "taubat".split()
vectorizer = TfidfVectorizer(analyzer='word', vocabulary=vocabulary)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

您可以加​​载特定停用词列表并将其作为参数传递给 TfidfVectorizer。在您的示例中:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['Saya benci awak',
          'Saya cinta awak',
          'Saya x happy awak',
          'Saya geram awak',
          'Saya taubat awak']

# HERE YOU DO YOUR MAGIC: you open your file and load the list of STOP WORDS
stop_words = [unicode(x.strip(), 'utf-8') for x in open('stopword.txt','r').read().split('\n')]

vectorizer = TfidfVectorizer(analyzer='word', stop_words = stop_words)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

输出 stop_words:

{u'taubat': 2.09861228866811, u'happy': 2.09861228866811, u'cinta': 2.09861228866811, u'benci': 2.09861228866811, u'geram': 2.09861228866811}

没有 stop_words 参数的输出:

{u'benci': 2.09861228866811, u'taubat': 2.09861228866811, u'saya': 1.0, u'awak': 1.0, u'geram': 2.09861228866811, u'cinta': 2.09861228866811, u'happy': 2.09861228866811}

Warning: I wouldn't use the param vocabulary because it is telling the TfidfVectorizer to only pay attention to the words specified in it and it's usually harder to be aware of all words that you want to take into account than saying the ones you want to dismiss. So, if you remove the vocabulary param from your example and you add the stop_words param with your list it will work as you expect.

在 Python3 中,我推荐以下过程来获取您自己的停用词列表:

  1. 打开相关文件路径,读取.txt中存储的停用词列表:
with open('C:\Users\mobarget\Google Drive\ACADEMIA\7_FeministDH for Susan\Stop words Letters_improved.txt', 'r') as file:
    my_stopwords=[file.read().replace('\n', ',')]
  1. 在向量化器中参考你的停用词:
vectorizer = text.CountVectorizer(input='filename', stop_words=my_stopwords, min_df=20)