使用 TFidfvectorizer 搜索词组

Question

我正在使用 sklearn 接收给定关键字列表的 TF-IDF。它工作正常，但唯一不起作用的是它不计算诸如“汽车制造商”之类的词组。我该如何解决这个问题？我应该使用不同的模块吗？

Pfa，第一行代码让你看到我使用了哪些模块。提前致谢！

import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from pathlib import Path


# root dir
root = '/Users/Tom/PycharmProjects/TextMining/'
#
words_to_find = ['vehicle', 'automotive', 'car manufacturers']
# tf_idf file writing
wrote_tf_idf_header = False
tf_idf_file_idx = 0
#
vectorizer_tf_idf = TfidfVectorizer(max_df=.65, min_df=1, stop_words=None, use_idf=True, norm=None, vocabulary=words_to_find)
vectorizer_cnt = CountVectorizer(stop_words=None, vocabulary=words_to_find)

Answer 1

您需要在 CountVectorizer 中传递 ngram_range 参数以获得您期望的结果。您可以在此处阅读带有示例的文档。

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

你可以这样解决。

import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from pathlib import Path


# root dir
# root = '/Users/Tom/PycharmProjects/TextMining/'
root = ['car manufacturers vehicle vehicales vehicle automotive car house manufacturers']
#
words_to_find = ['vehicle', 'automotive', 'car manufacturers']
# tf_idf file writing
wrote_tf_idf_header = False
tf_idf_file_idx = 0
#
vectorizer_tf_idf = TfidfVectorizer(max_df=.65, min_df=1, stop_words=None, use_idf=True, norm=None, vocabulary=words_to_find)
vectorizer_cnt = CountVectorizer(stop_words=None, vocabulary=words_to_find, ngram_range=(1,2))
x = vectorizer_cnt.fit_transform(root)
print(vectorizer_cnt.get_feature_names())
print(x.toarray())

输出：

['vehicle', 'automotive', 'car manufacturers']
[[2 1 1]]

使用 TFidfvectorizer 搜索词组

Searching for a word group with TFidfvectorizer

text-mining

tf-idf

scikit-learn

countvectorizer

tfidfvectorizer