绘制文档 tfidf 二维图
plot a document tfidf 2D graph
我想为我的句子列表绘制一个二维图,其中 x 轴作为术语,y 轴作为 TFIDF 分数(或文档 ID)。我使用 scikit learn 的 fit_transform() 来获取 scipy 矩阵,但我不知道如何使用该矩阵来绘制图形。我正在尝试绘制图表以查看使用 kmeans 对我的句子进行分类的效果如何。
这是 fit_transform(sentence_list)
的输出:
(文档id, 术语号) tfidf score
(0, 1023) 0.209291711271
(0, 924) 0.174405532933
(0, 914) 0.174405532933
(0, 821) 0.15579574484
(0, 770) 0.174405532933
(0, 763) 0.159719994016
(0, 689) 0.135518787598
这是我的代码:
sentence_list=["Hi how are you", "Good morning" ...]
vectorizer=TfidfVectorizer(min_df=1, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)
num_samples, num_features=vectorized.shape
print "num_samples: %d, num_features: %d" %(num_samples,num_features)
num_clusters=10
km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)
PRINT km.labels_ # Returns a list of clusters ranging 0 to 10
谢谢,
当您使用 Bag of Words 时,您的每个句子都会在长度等于词汇量的高维 space 中表示。如果你想在二维中表示它,你需要减少尺寸,例如使用具有两个组件的 PCA:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
newsgroups_train = fetch_20newsgroups(subset='train',
categories=['alt.atheism', 'sci.space'])
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
])
X = pipeline.fit_transform(newsgroups_train.data).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1], c=data.target)
plt.show() #not required if using ipython notebook
例如,现在您可以计算并绘制群集在此数据上的输入:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1],
marker='x', s=200, linewidths=3, c='r')
plt.show() #not required if using ipython notebook
只需为标签分配一个变量并使用它来表示颜色。前任
km = Kmeans().fit(X)
clusters = km.labels_.tolist()
然后 c=clusters
之前的回答,有一些问题。所以我调整了这些问题并将代码推送到这里。
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
newsgroups_train = fetch_20newsgroups(subset='train',
categories=['alt.atheism', 'sci.space'])
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
])
X = pipeline.fit_transform(newsgroups_train.data).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1], c=newsgroups_train.target)
plt.show()
## Nearest neighbour
kmeans = KMeans(n_clusters=2).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
# plt.hold(True)
plt.scatter(data2D[:,0], data2D[:,1], c=newsgroups_train.target)
plt.scatter(centers2D[:,0], centers2D[:,1],
marker='x', s=200, linewidths=3, c='r')
plt.show()
我想为我的句子列表绘制一个二维图,其中 x 轴作为术语,y 轴作为 TFIDF 分数(或文档 ID)。我使用 scikit learn 的 fit_transform() 来获取 scipy 矩阵,但我不知道如何使用该矩阵来绘制图形。我正在尝试绘制图表以查看使用 kmeans 对我的句子进行分类的效果如何。
这是 fit_transform(sentence_list)
的输出:
(文档id, 术语号) tfidf score
(0, 1023) 0.209291711271
(0, 924) 0.174405532933
(0, 914) 0.174405532933
(0, 821) 0.15579574484
(0, 770) 0.174405532933
(0, 763) 0.159719994016
(0, 689) 0.135518787598
这是我的代码:
sentence_list=["Hi how are you", "Good morning" ...]
vectorizer=TfidfVectorizer(min_df=1, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)
num_samples, num_features=vectorized.shape
print "num_samples: %d, num_features: %d" %(num_samples,num_features)
num_clusters=10
km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)
PRINT km.labels_ # Returns a list of clusters ranging 0 to 10
谢谢,
当您使用 Bag of Words 时,您的每个句子都会在长度等于词汇量的高维 space 中表示。如果你想在二维中表示它,你需要减少尺寸,例如使用具有两个组件的 PCA:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
newsgroups_train = fetch_20newsgroups(subset='train',
categories=['alt.atheism', 'sci.space'])
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
])
X = pipeline.fit_transform(newsgroups_train.data).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1], c=data.target)
plt.show() #not required if using ipython notebook
例如,现在您可以计算并绘制群集在此数据上的输入:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1],
marker='x', s=200, linewidths=3, c='r')
plt.show() #not required if using ipython notebook
只需为标签分配一个变量并使用它来表示颜色。前任
km = Kmeans().fit(X)
clusters = km.labels_.tolist()
然后 c=clusters
之前的回答,有一些问题。所以我调整了这些问题并将代码推送到这里。
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
newsgroups_train = fetch_20newsgroups(subset='train',
categories=['alt.atheism', 'sci.space'])
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
])
X = pipeline.fit_transform(newsgroups_train.data).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1], c=newsgroups_train.target)
plt.show()
## Nearest neighbour
kmeans = KMeans(n_clusters=2).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
# plt.hold(True)
plt.scatter(data2D[:,0], data2D[:,1], c=newsgroups_train.target)
plt.scatter(centers2D[:,0], centers2D[:,1],
marker='x', s=200, linewidths=3, c='r')
plt.show()