PCA 特征与原始特征不匹配

Question

我正在尝试使用 PCA 降低特征维度。我已经能够将 PCA 应用于我的训练数据，但我很难理解为什么减少的特征集 (X_train_pca) 与原始特征 (X_train) 没有相似之处。

print(X_train.shape) # (26215, 727)
pca = PCA(0.5)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
print(X_train_pca.shape) # (26215, 100)

most_important_features_indicies = [np.abs(pca.components_[i]).argmax() for i in range(pca.n_components_)]
most_important_feature_index = most_important_features_indicies[0]

X_train_pca中的第一个特征向量不应该只是X_train中第一个特征向量的子集吗？例如，为什么以下不等于 True？

print(X_train[0][most_important_feature_index] == X_train_pca[0][0]) # False

此外，X_train的第一个特征向量中的none个特征在X_train_pca的第一个特征向量中：

for i in X_train[0]:
    print(i in X_train_pca[0])
# False
# False
# False
# ...

Answer 1

PCA 将您的高维特征向量转换为低维特征向量。它不是简单地确定原始 space 中最不重要的索引并删除该维度。

Answer 2

这是正常现象，因为 PCA 算法对您的数据应用了转换：

PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. (https://en.wikipedia.org/wiki/Principal_component_analysis#Dimensionality_reduction)

运行下面的代码示例可以查看 PCA 算法对简单高斯数据集的影响。

from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt

pca = PCA(2)
X = np.random.multivariate_normal(mean=np.array([0, 0]), cov=np.array([[1, 0.75],[0.75, 1]]), size=(1000,))
X_new = pca.fit_transform(X)

plt.scatter(X[:, 0], X[:, 1], s=5, label='Initial data')
plt.scatter(X_new[:, 0], X_new[:, 1], s=5, label='Transformed data')
plt.legend()
plt.show()

PCA 特征与原始特征不匹配

PCA features do not match original features

python

machine-learning

pca

feature-selection