'Pipeline' 对象在 scikit-learn 中没有属性 'get_feature_names'
'Pipeline' object has no attribute 'get_feature_names' in scikit-learn
我基本上是使用 mini_batch_kmeans 和 kmeans 算法对我的一些文档进行聚类。我只是按照教程是 scikit-learn 网站,下面给出了 link:
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html
他们正在使用一些矢量化方法,其中之一是 HashingVectorizer。在 hashingVectorizer 中,他们正在使用 TfidfTransformer() 方法制作管道。
# Perform an IDF normalization on the output of HashingVectorizer
hasher = HashingVectorizer(n_features=opts.n_features,
stop_words='english', non_negative=True,
norm=None, binary=False)
vectorizer = make_pipeline(hasher, TfidfTransformer())
一旦这样做,我从中得到的矢量化器就没有方法 get_feature_names()。但是因为我将它用于集群,所以我需要使用这个 "get_feature_names()"
来获取 "terms"
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
如何解决这个错误?
我的全部代码如下所示:
X_train_vecs, vectorizer = vector_bow.count_tfidf_vectorizer(_contents)
mini_kmeans_batch = MiniBatchKmeansTechnique()
# MiniBatchKmeans without the LSA dimensionality reduction
mini_kmeans_batch.mini_kmeans_technique(number_cluster=8, X_train_vecs=X_train_vecs,
vectorizer=vectorizer, filenames=_filenames, contents=_contents, is_dimension_reduced=False)
使用 tfidf 传输的计数向量。
def count_tfidf_vectorizer(self,contents):
count_vect = CountVectorizer()
vectorizer = make_pipeline(count_vect,TfidfTransformer())
X_train_vecs = vectorizer.fit_transform(contents)
print("The count of bow : ", X_train_vecs.shape)
return X_train_vecs, vectorizer
和 mini_batch_kmeans class 如下:
class MiniBatchKmeansTechnique():
def mini_kmeans_technique(self, number_cluster, X_train_vecs, vectorizer,
filenames, contents, svd=None, is_dimension_reduced=True):
km = MiniBatchKMeans(n_clusters=number_cluster, init='k-means++', max_iter=100, n_init=10,
init_size=1000, batch_size=1000, verbose=True, random_state=42)
print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X_train_vecs)
print("done in %0.3fs" % (time() - t0))
print()
cluster_labels = km.labels_.tolist()
print("List of the cluster names is : ",cluster_labels)
data = {'filename':filenames, 'contents':contents, 'cluster_label':cluster_labels}
frame = pd.DataFrame(data=data, index=[cluster_labels], columns=['filename', 'contents', 'cluster_label'])
print(frame['cluster_label'].value_counts(sort=True,ascending=False))
print()
grouped = frame['cluster_label'].groupby(frame['cluster_label'])
print(grouped.mean())
print()
print("Top Terms Per Cluster :")
if is_dimension_reduced:
if svd != None:
original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]
else:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_cluster):
print("Cluster %d:" % i, end=' ')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end=',')
print()
print("Cluster %d filenames:" % i, end='')
for file in frame.ix[i]['filename'].values.tolist():
print(' %s,' % file, end='')
print()
来自 make_pipeline
文档:
This is a shorthand for the Pipeline constructor; it does not require, and
does not permit, naming the estimators. Instead, their names will be set
to the lowercase of their types automatically.
因此,为了访问特征名称,在数据拟合后,您可以:
# Perform an IDF normalization on the output of HashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline
hasher = HashingVectorizer(n_features=10,
stop_words='english', non_negative=True,
norm=None, binary=False)
tfidf = TfidfVectorizer()
vectorizer = make_pipeline(hasher, tfidf)
# ...
# fit to the data
# ...
# use the instance's class name to lower
terms = vectorizer.named_steps[tfidf.__class__.__name__.lower()].get_feature_names()
# or to be more precise, as used in `_name_estimators`:
# terms = vectorizer.named_steps[type(tfidf).__name__.lower()].get_feature_names()
# btw TfidfTransformer and HashingVectorizer do not have get_feature_names afaik
希望对您有所帮助,祝您好运!
编辑:在看到您更新的问题和您所遵循的示例后,@Vivek Kumar 是正确的,此代码 terms = vectorizer.get_feature_names()
不会 运行管道但仅当:
vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features,
min_df=2, stop_words='english',
use_idf=opts.use_idf)
Pipeline 没有 get_feature_names() 方法,因为为 Pipeline 实现此方法并不简单 - 需要考虑所有管道步骤以获取功能名称。请参阅 https://github.com/scikit-learn/scikit-learn/issues/6424, https://github.com/scikit-learn/scikit-learn/issues/6425,等等 - 有很多相关的票证,并多次尝试修复它。
如果您的管道很简单(TfidfVectorizer 后跟 MiniBatchKMeans),那么您可以从 TfidfVectorizer 获取特征名称。
如果你想使用 HashingVectorizer,它会更复杂,因为 HashingVectorizer 在设计上没有提供功能名称。 HashingVectorizer 不存储词汇表,而是使用哈希值——这意味着它可以应用于在线设置,并且它不需要任何 RAM——但代价是你得不到特征名称。
尽管如此,仍然可以从 HashingVectorizer 获取特征名称;为此,您需要将其应用于文档样本,存储哪些哈希值对应于哪些单词,并通过这种方式了解这些哈希值的含义,即特征名称是什么。可能会发生冲突,因此无法 100% 确定特征名称是否正确,但通常这种方法可以正常工作。这种方法在 eli5 library; see http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html#debugging-hashingvectorizer for an example. You will have to do something like this, using InvertableHashingVectorizer:
中实现
from eli5.sklearn import InvertableHashingVectorizer
ivec = InvertableHashingVectorizer(vec) # vec is a HashingVectorizer instance
# X_sample is a sample from contents; you can use the
# whole contents array, or just e.g. every 10th element
ivec.fit(content_sample)
hashing_feat_names = ivec.get_feature_names()
然后您可以使用 hashing_feat_names
作为您的特征名称,因为 TfidfTransformer 不会改变输入向量的大小,只会缩放相同的特征。
我基本上是使用 mini_batch_kmeans 和 kmeans 算法对我的一些文档进行聚类。我只是按照教程是 scikit-learn 网站,下面给出了 link: http://scikit-learn.org/stable/auto_examples/text/document_clustering.html
他们正在使用一些矢量化方法,其中之一是 HashingVectorizer。在 hashingVectorizer 中,他们正在使用 TfidfTransformer() 方法制作管道。
# Perform an IDF normalization on the output of HashingVectorizer
hasher = HashingVectorizer(n_features=opts.n_features,
stop_words='english', non_negative=True,
norm=None, binary=False)
vectorizer = make_pipeline(hasher, TfidfTransformer())
一旦这样做,我从中得到的矢量化器就没有方法 get_feature_names()。但是因为我将它用于集群,所以我需要使用这个 "get_feature_names()"
来获取 "terms"terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
如何解决这个错误?
我的全部代码如下所示:
X_train_vecs, vectorizer = vector_bow.count_tfidf_vectorizer(_contents)
mini_kmeans_batch = MiniBatchKmeansTechnique()
# MiniBatchKmeans without the LSA dimensionality reduction
mini_kmeans_batch.mini_kmeans_technique(number_cluster=8, X_train_vecs=X_train_vecs,
vectorizer=vectorizer, filenames=_filenames, contents=_contents, is_dimension_reduced=False)
使用 tfidf 传输的计数向量。
def count_tfidf_vectorizer(self,contents):
count_vect = CountVectorizer()
vectorizer = make_pipeline(count_vect,TfidfTransformer())
X_train_vecs = vectorizer.fit_transform(contents)
print("The count of bow : ", X_train_vecs.shape)
return X_train_vecs, vectorizer
和 mini_batch_kmeans class 如下:
class MiniBatchKmeansTechnique():
def mini_kmeans_technique(self, number_cluster, X_train_vecs, vectorizer,
filenames, contents, svd=None, is_dimension_reduced=True):
km = MiniBatchKMeans(n_clusters=number_cluster, init='k-means++', max_iter=100, n_init=10,
init_size=1000, batch_size=1000, verbose=True, random_state=42)
print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X_train_vecs)
print("done in %0.3fs" % (time() - t0))
print()
cluster_labels = km.labels_.tolist()
print("List of the cluster names is : ",cluster_labels)
data = {'filename':filenames, 'contents':contents, 'cluster_label':cluster_labels}
frame = pd.DataFrame(data=data, index=[cluster_labels], columns=['filename', 'contents', 'cluster_label'])
print(frame['cluster_label'].value_counts(sort=True,ascending=False))
print()
grouped = frame['cluster_label'].groupby(frame['cluster_label'])
print(grouped.mean())
print()
print("Top Terms Per Cluster :")
if is_dimension_reduced:
if svd != None:
original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]
else:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_cluster):
print("Cluster %d:" % i, end=' ')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end=',')
print()
print("Cluster %d filenames:" % i, end='')
for file in frame.ix[i]['filename'].values.tolist():
print(' %s,' % file, end='')
print()
来自 make_pipeline
文档:
This is a shorthand for the Pipeline constructor; it does not require, and
does not permit, naming the estimators. Instead, their names will be set
to the lowercase of their types automatically.
因此,为了访问特征名称,在数据拟合后,您可以:
# Perform an IDF normalization on the output of HashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline
hasher = HashingVectorizer(n_features=10,
stop_words='english', non_negative=True,
norm=None, binary=False)
tfidf = TfidfVectorizer()
vectorizer = make_pipeline(hasher, tfidf)
# ...
# fit to the data
# ...
# use the instance's class name to lower
terms = vectorizer.named_steps[tfidf.__class__.__name__.lower()].get_feature_names()
# or to be more precise, as used in `_name_estimators`:
# terms = vectorizer.named_steps[type(tfidf).__name__.lower()].get_feature_names()
# btw TfidfTransformer and HashingVectorizer do not have get_feature_names afaik
希望对您有所帮助,祝您好运!
编辑:在看到您更新的问题和您所遵循的示例后,@Vivek Kumar 是正确的,此代码 terms = vectorizer.get_feature_names()
不会 运行管道但仅当:
vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features,
min_df=2, stop_words='english',
use_idf=opts.use_idf)
Pipeline 没有 get_feature_names() 方法,因为为 Pipeline 实现此方法并不简单 - 需要考虑所有管道步骤以获取功能名称。请参阅 https://github.com/scikit-learn/scikit-learn/issues/6424, https://github.com/scikit-learn/scikit-learn/issues/6425,等等 - 有很多相关的票证,并多次尝试修复它。
如果您的管道很简单(TfidfVectorizer 后跟 MiniBatchKMeans),那么您可以从 TfidfVectorizer 获取特征名称。
如果你想使用 HashingVectorizer,它会更复杂,因为 HashingVectorizer 在设计上没有提供功能名称。 HashingVectorizer 不存储词汇表,而是使用哈希值——这意味着它可以应用于在线设置,并且它不需要任何 RAM——但代价是你得不到特征名称。
尽管如此,仍然可以从 HashingVectorizer 获取特征名称;为此,您需要将其应用于文档样本,存储哪些哈希值对应于哪些单词,并通过这种方式了解这些哈希值的含义,即特征名称是什么。可能会发生冲突,因此无法 100% 确定特征名称是否正确,但通常这种方法可以正常工作。这种方法在 eli5 library; see http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html#debugging-hashingvectorizer for an example. You will have to do something like this, using InvertableHashingVectorizer:
中实现from eli5.sklearn import InvertableHashingVectorizer
ivec = InvertableHashingVectorizer(vec) # vec is a HashingVectorizer instance
# X_sample is a sample from contents; you can use the
# whole contents array, or just e.g. every 10th element
ivec.fit(content_sample)
hashing_feat_names = ivec.get_feature_names()
然后您可以使用 hashing_feat_names
作为您的特征名称,因为 TfidfTransformer 不会改变输入向量的大小,只会缩放相同的特征。