如何在数据集上应用 PCA 并打印相关特征

Question

我有一个包含 23 行和 48 列的数据集。我正在应用 PCA 来减少列维度的数量。我使用以下代码示例，发现只有 23 个是必需的功能：

#first
import numpy as np
from sklearn.decomposition import PCA
pca = PCA().fit(only_features)
plt.figure(figsize=(15,8))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

#second
df_pca = pca.fit_transform(X=only_features)
df_pca = pd.DataFrame(df_pca)
print(df_pca.shape)

不过，我想知道需要哪些功能。例如：如果原始数据集有 A-z 列并通过 PCA 减少，那么我想知道选择了哪些特征。

怎么做？

感谢帮助

Answer 1

归功于此 & ，Sklearn 的文档指出当您未指定 n_components 参数时保留的组件数是 min(n_samples, n_features)。所以 min(23, 48) = 23 这就是你需要 23 的原因。

解决方案 1： 如果您使用 Sklearn 图书馆学分

通过以下方式检查 PC 的方差：pca.explained_variance_ratio_
通过以下方式检查 PC 的重要性：print(abs( pca.components_ ))
使用自定义函数提取有关 PC 的更多信息请参阅此。

解决方案 2： 如果您使用 PCA 库 documenetation

# Initialize
model = pca()
# Fit transform
out = model.fit_transform(X)

# Print the top features. The results show that f1 is best, followed by f2 etc
print(out['topfeat'])

#     PC      feature
# 0  PC1      f1
# 1  PC2      f2
# 2  PC3      f3
# 3  PC4      f4
# 4  PC5      f5
...

甚至你也可以通过以下方式绘制 PC 绘图：model.plot()

如何在数据集上应用 PCA 并打印相关特征

How to apply PCA on a dataset and print the relevant features

pca

python-3.x