从头开始实施主成分分析以不同于 scikit-learn 的方式定位数据

Question

根据指南Implementing PCA in Python, by Sebastian Raschka我正在从头构建 PCA 算法用于我的研究目的。 class 定义为：

import numpy as np

class PCA(object):
    """Dimension Reduction using Principal Component Analysis (PCA)

    It is the procces of computing principal components which explains the
    maximum variation of the dataset using fewer components.

    :type  n_components: int, optional
    :param n_components: Number of components to consider, if not set then
                         `n_components = min(n_samples, n_features)`, where
                         `n_samples` is the number of samples, and
                         `n_features` is the number of features (i.e.,
                         dimension of the dataset).

    Attributes
    ==========
        :type  covariance_: np.ndarray
        :param covariance_: Coviarance Matrix

        :type  eig_vals_: np.ndarray
        :param eig_vals_: Calculated Eigen Values

        :type  eig_vecs_: np.ndarray
        :param eig_vecs_: Calculated Eigen Vectors

        :type  explained_variance_: np.ndarray
        :param explained_variance_: Explained Variance of Each Principal Components

        :type  cum_explained_variance_: np.ndarray
        :param cum_explained_variance_: Cumulative Explained Variables
    """

    def __init__(self, n_components : int = None):
        """Default Constructor for Initialization"""

        self.n_components = n_components

    def fit_transform(self, X : np.ndarray):
        """Fit the PCA algorithm into the Dataset"""

        if not self.n_components:
            self.n_components = min(X.shape)

        self.covariance_ = np.cov(X.T)

        # calculate eigens
        self.eig_vals_, self.eig_vecs_ = np.linalg.eig(self.covariance_)

        # explained variance
        _tot_eig_vals = sum(self.eig_vals_)
        self.explained_variance_ = np.array([(i / _tot_eig_vals) * 100 for i in sorted(self.eig_vals_, reverse = True)])
        self.cum_explained_variance_ = np.cumsum(self.explained_variance_)

        # define `W` as `d x k`-dimension
        self.W_ = self.eig_vecs_[:, :self.n_components]

        print(X.shape, self.W_.shape)
        return X.dot(self.W_)

以iris-dataset作为测试用例，实现PCA并可视化如下：

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# loading iris data, and normalize
from sklearn.datasets import load_iris
iris = load_iris()

from sklearn.preprocessing import MinMaxScaler
X, y = iris.data, iris.target
X = MinMaxScaler().fit_transform(X)

# using the PCA function (defined above)
# to fit_transform the X value
# naming the PCA object as dPCA (d = defined)
dPCA = PCA()
principalComponents = dPCA.fit_transform(X)

# creating a pandas dataframe for the principal components
# and visualize the data using scatter plot
PCAResult = pd.DataFrame(principalComponents, columns = [f"PCA-{i}" for i in range(1, dPCA.n_components + 1)])
PCAResult["target"] = y # possible as original order does not change

sns.scatterplot(x = "PCA-1", y = "PCA-2", data = PCAResult, hue = "target", s = 50)
plt.show()

输出如下：

现在，我想验证输出，为此我使用了sklearn库，输出如下：

from sklearn.decomposition import PCA # note the same name
sPCA = PCA() # consider all the components

principalComponents_ = sPCA.fit_transform(X)
PCAResult_ = pd.DataFrame(principalComponents_, columns = [f"PCA-{i}" for i in range(1, 5)])
PCAResult_["target"] = y # possible as original order does not change

sns.scatterplot(x = "PCA-1", y = "PCA-2", data = PCAResult_, hue = "target", s = 50)
plt.show()

我不明白为什么输出的方向不同，值略有不同。我研究了很多代码 [1, 2, 3]，所有代码都有同样的问题。我的问题：

sklearn有什么不同，就是剧情不同？我也尝试过使用不同的数据集 - 同样的问题。
有办法解决这个问题吗？

我无法研究 sklearn.decompose.PCA 算法，因为我对 python 的 OOP 概念不熟悉。

Sebastian Raschka 的博客 post 中的输出也有细微差别。下图：

Answer 1

计算特征向量时，您可能 change its sign 并且解也是有效的。

因此任何 PCA 轴都可以反转，解决方案将有效。

不过，您可能希望 PCA 轴与数据集中的原始变量之一强加正相关，并在需要时反转轴。

Answer 2

值的差异来自使用 svd 分解的 sklearn 的 PCA。在 sklearn 中有一个函数 svd_flip 用于翻转 PC，这解释了为什么你会看到这个翻转

有关帮助的更多详细信息 page：

It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the method of Halko et al. 2009, depending on the shape of the input data and the number of components to extract.

您可以阅读有关关系 here

我们首先运行您的示例数据集：

from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA 
from sklearn.datasets import load_iris
from sklearn.utils.extmath import svd_flip
import pandas as pd
import numpy as np
import scipy

iris = load_iris()

X, y = iris.data, iris.target
X = MinMaxScaler().fit_transform(X)

n_components = 4

sPCA = PCA(n_components,svd_solver="full")
sklearnPCs = pd.DataFrame(sPCA.fit_transform(X))

我们现在对您的居中矩阵执行 SVD：

U,S,Vt = scipy.linalg.svd(X - X.mean(axis=0))
U = U[:,:n_components]
U, Vt = svd_flip(U, Vt)

svdPCs =  pd.DataFrame(U*S)

结果：

            0         1         2         3
0   -0.630703  0.107578 -0.018719 -0.007307
1   -0.622905 -0.104260 -0.049142 -0.032359
2   -0.669520 -0.051417  0.019644 -0.007434
3   -0.654153 -0.102885  0.023219  0.020114
4   -0.648788  0.133488  0.015116  0.011786
..        ...       ...       ...       ...
145  0.551462  0.059841  0.086283 -0.110092
146  0.407146 -0.171821 -0.004102 -0.065241
147  0.447143  0.037560  0.049546 -0.032743
148  0.488208  0.149678  0.239209  0.002864
149  0.312066 -0.031130  0.118672  0.052505


svdPCs 
            0         1         2         3
0   -0.630703  0.107578 -0.018719 -0.007307
1   -0.622905 -0.104260 -0.049142 -0.032359
2   -0.669520 -0.051417  0.019644 -0.007434
3   -0.654153 -0.102885  0.023219  0.020114
4   -0.648788  0.133488  0.015116  0.011786
..        ...       ...       ...       ...
145  0.551462  0.059841  0.086283 -0.110092
146  0.407146 -0.171821 -0.004102 -0.065241
147  0.447143  0.037560  0.049546 -0.032743
148  0.488208  0.149678  0.239209  0.002864
149  0.312066 -0.031130  0.118672  0.052505

您可以在不翻转的情况下实现。值将相同，您的 PCA 将如其他答案中所述有效。

从头开始实施主成分分析以不同于 scikit-learn 的方式定位数据

Implementation of Principal Component Analysis from Scratch Orients the Data Differently than scikit-learn

python

algorithm

numpy

pca

scikit-learn