了解 TfidfVectorizer 中的前 n 个 tfidf 功能
understanding top n tfidf features in TfidfVectorizer
我想更好地理解 scikit-learn
的 TfidfVectorizer
。下面的代码有两个文档doc1 = The car is driven on the road
,doc2 = The truck is driven on the highway
。通过调用 fit_transform
生成 tf-idf 权重的向量化矩阵。
根据tf-idf
值矩阵,highway,truck,car
不应该是排名靠前的词而不是highway,truck,driven
作为highway = truck= car= 0.63 and driven = 0.44
吗?
#testing tfidfvectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tn = ['The car is driven on the road', 'The truck is driven on the highway']
vectorizer = TfidfVectorizer(tokenizer= lambda x:x.split(),stop_words = 'english')
response = vectorizer.fit_transform(tn)
feature_array = np.array(vectorizer.get_feature_names()) #list of features
print(feature_array)
print(response.toarray())
sorted_features = np.argsort(response.toarray()).flatten()[:-1] #index of highest valued features
print(sorted_features)
#printing top 3 weighted features
n = 3
top_n = feature_array[sorted_features][:n]
print(top_n)
['car' 'driven' 'highway' 'road' 'truck']
[[0.6316672 0.44943642 0. 0.6316672 0. ]
[0. 0.44943642 0.6316672 0. 0.6316672 ]]
[2 4 1 0 3 0 3 1 2]
['highway' 'truck' 'driven']
从结果可以看出,tf-idf矩阵确实给了highway
、truck
、car
(和truck
)更高的分数:
tn = ['The car is driven on the road', 'The truck is driven on the highway']
vectorizer = TfidfVectorizer(stop_words = 'english')
response = vectorizer.fit_transform(tn)
terms = vectorizer.get_feature_names()
pd.DataFrame(response.toarray(), columns=terms)
car driven highway road truck
0 0.631667 0.449436 0.000000 0.631667 0.000000
1 0.000000 0.449436 0.631667 0.000000 0.631667
错误的是您通过展平数组所做的进一步检查。要获得所有行的最高分,您可以改为执行以下操作:
max_scores = response.toarray().max(0).argsort()
np.array(terms)[max_scores[-4:]]
array(['car', 'highway', 'road', 'truck'], dtype='<U7')
其中最高分是 feature_names,在数据框中有 0.63
分。
我想更好地理解 scikit-learn
的 TfidfVectorizer
。下面的代码有两个文档doc1 = The car is driven on the road
,doc2 = The truck is driven on the highway
。通过调用 fit_transform
生成 tf-idf 权重的向量化矩阵。
根据tf-idf
值矩阵,highway,truck,car
不应该是排名靠前的词而不是highway,truck,driven
作为highway = truck= car= 0.63 and driven = 0.44
吗?
#testing tfidfvectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tn = ['The car is driven on the road', 'The truck is driven on the highway']
vectorizer = TfidfVectorizer(tokenizer= lambda x:x.split(),stop_words = 'english')
response = vectorizer.fit_transform(tn)
feature_array = np.array(vectorizer.get_feature_names()) #list of features
print(feature_array)
print(response.toarray())
sorted_features = np.argsort(response.toarray()).flatten()[:-1] #index of highest valued features
print(sorted_features)
#printing top 3 weighted features
n = 3
top_n = feature_array[sorted_features][:n]
print(top_n)
['car' 'driven' 'highway' 'road' 'truck']
[[0.6316672 0.44943642 0. 0.6316672 0. ]
[0. 0.44943642 0.6316672 0. 0.6316672 ]]
[2 4 1 0 3 0 3 1 2]
['highway' 'truck' 'driven']
从结果可以看出,tf-idf矩阵确实给了highway
、truck
、car
(和truck
)更高的分数:
tn = ['The car is driven on the road', 'The truck is driven on the highway']
vectorizer = TfidfVectorizer(stop_words = 'english')
response = vectorizer.fit_transform(tn)
terms = vectorizer.get_feature_names()
pd.DataFrame(response.toarray(), columns=terms)
car driven highway road truck
0 0.631667 0.449436 0.000000 0.631667 0.000000
1 0.000000 0.449436 0.631667 0.000000 0.631667
错误的是您通过展平数组所做的进一步检查。要获得所有行的最高分,您可以改为执行以下操作:
max_scores = response.toarray().max(0).argsort()
np.array(terms)[max_scores[-4:]]
array(['car', 'highway', 'road', 'truck'], dtype='<U7')
其中最高分是 feature_names,在数据框中有 0.63
分。