在 movielens 数据集上应用 sklearn PCA

Question

我有 movielens dataset which I want to apply PCA on it, but sklearn PCA 功能似乎不正确。
我有 718*8913 矩阵，其中行表示用户，列表示电影这是我的 python 代码：

加载电影名称和电影评级

movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
ratings.drop(['timestamp'], axis=1, inplace=True)
def replace_name(x):
    return movies[movies['movieId']==x].title.values[0]
ratings.movieId = ratings.movieId.map(replace_name)
M = ratings.pivot_table(index=['userId'], columns=['movieId'], values='rating')
df1 = M.replace(np.nan, 0, regex=True)

标准化

X_std = StandardScaler().fit_transform(df1)

应用主成分分析

pca = PCA()
result = pca.fit_transform(X_std)
print result.shape
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.show()

我没有设置任何组件编号，所以我希望新维度的 PCA return 718*8913 矩阵但 pca 结果大小为 718*718，pca.explained_variance_ratio_ 大小为 718，求和它的所有成员都是1，这怎么可能！！！
我有 8913 个特征，它 return 只有 718 个，它们的方差之和等于 1 谁能解释这里出了什么问题？
我的情节图片结果：正如您在上图中看到的那样，它只包含 718 个组件，总和为 1，但我有 8913 个功能，它们去哪儿了？

用更小的例子测试

我什至尝试使用 scikit 学习 PCA 示例，它可以在 pca 的文档页面中找到 Here is the Link 我更改了示例并增加了功能的数量

import numpy as np
from sklearn.decomposition import PCA
import pandas as pd
X = np.array([[-1, -1,3,4,-1, -1,3,4], [-2, -1,5,-1, -1,3,4,2], [-3, -2,1,-1, -1,3,4,1],
[1, 1,4,-1, -1,3,4,2], [2, 1,0,-1, -1,3,4,2], [3, 2,10,-1, -1,3,4,10]])
ipca = PCA(n_components = 7)
print (X.shape)
ipca.fit(X)
result = ipca.transform(X)
print (result.shape);

在此示例中，我们有 6 个样本和 8 个特征，我将 n_components 设置为 7，但结果大小为 6*6。
我认为当特征数量大于样本数量时，scikit 学习 pca 的最大组件数量将 return 等于样本数量

Answer 1

查看 PCA 上的 documentation。因为您没有将 n_components 参数传递给 PCA()，sklearn 使用 min(n_samples, n_features) 作为 n_components 的值，这就是为什么您得到的缩减特征集等于 [=21] =].

我相信您的方差等于 1，因为您没有设置 n_components，文档中：

If n_components is not set then all components are stored and the sum of explained variances is equal to 1.0.

在 movielens 数据集上应用 sklearn PCA

apply sklearn PCA on movielens dataset

python

pca

scikit-learn

加载电影名称和电影评级

标准化

应用主成分分析

用更小的例子测试