从头开始实施主成分分析以不同于 scikit-learn 的方式定位数据

Implementation of Principal Component Analysis from Scratch Orients the Data Differently than scikit-learn

根据指南Implementing PCA in Python, by Sebastian Raschka我正在从头构建 PCA 算法用于我的研究目的。 class 定义为:

import numpy as np

class PCA(object):
    """Dimension Reduction using Principal Component Analysis (PCA)

    It is the procces of computing principal components which explains the
    maximum variation of the dataset using fewer components.

    :type  n_components: int, optional
    :param n_components: Number of components to consider, if not set then
                         `n_components = min(n_samples, n_features)`, where
                         `n_samples` is the number of samples, and
                         `n_features` is the number of features (i.e.,
                         dimension of the dataset).

    Attributes
    ==========
        :type  covariance_: np.ndarray
        :param covariance_: Coviarance Matrix

        :type  eig_vals_: np.ndarray
        :param eig_vals_: Calculated Eigen Values

        :type  eig_vecs_: np.ndarray
        :param eig_vecs_: Calculated Eigen Vectors

        :type  explained_variance_: np.ndarray
        :param explained_variance_: Explained Variance of Each Principal Components

        :type  cum_explained_variance_: np.ndarray
        :param cum_explained_variance_: Cumulative Explained Variables
    """

    def __init__(self, n_components : int = None):
        """Default Constructor for Initialization"""

        self.n_components = n_components

    def fit_transform(self, X : np.ndarray):
        """Fit the PCA algorithm into the Dataset"""

        if not self.n_components:
            self.n_components = min(X.shape)

        self.covariance_ = np.cov(X.T)

        # calculate eigens
        self.eig_vals_, self.eig_vecs_ = np.linalg.eig(self.covariance_)

        # explained variance
        _tot_eig_vals = sum(self.eig_vals_)
        self.explained_variance_ = np.array([(i / _tot_eig_vals) * 100 for i in sorted(self.eig_vals_, reverse = True)])
        self.cum_explained_variance_ = np.cumsum(self.explained_variance_)

        # define `W` as `d x k`-dimension
        self.W_ = self.eig_vecs_[:, :self.n_components]

        print(X.shape, self.W_.shape)
        return X.dot(self.W_)

iris-dataset作为测试用例,实现PCA并可视化如下:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# loading iris data, and normalize
from sklearn.datasets import load_iris
iris = load_iris()

from sklearn.preprocessing import MinMaxScaler
X, y = iris.data, iris.target
X = MinMaxScaler().fit_transform(X)

# using the PCA function (defined above)
# to fit_transform the X value
# naming the PCA object as dPCA (d = defined)
dPCA = PCA()
principalComponents = dPCA.fit_transform(X)

# creating a pandas dataframe for the principal components
# and visualize the data using scatter plot
PCAResult = pd.DataFrame(principalComponents, columns = [f"PCA-{i}" for i in range(1, dPCA.n_components + 1)])
PCAResult["target"] = y # possible as original order does not change

sns.scatterplot(x = "PCA-1", y = "PCA-2", data = PCAResult, hue = "target", s = 50)
plt.show()

输出如下:

现在,我想验证输出,为此我使用了sklearn库,输出如下:

from sklearn.decomposition import PCA # note the same name
sPCA = PCA() # consider all the components

principalComponents_ = sPCA.fit_transform(X)
PCAResult_ = pd.DataFrame(principalComponents_, columns = [f"PCA-{i}" for i in range(1, 5)])
PCAResult_["target"] = y # possible as original order does not change

sns.scatterplot(x = "PCA-1", y = "PCA-2", data = PCAResult_, hue = "target", s = 50)
plt.show()

我不明白为什么输出的方向不同,值略有不同。我研究了很多代码 [1, 2, 3],所有代码都有同样的问题。我的问题:

  1. sklearn有什么不同,就是剧情不同?我也尝试过使用不同的数据集 - 同样的问题。
  2. 有办法解决这个问题吗?

我无法研究 sklearn.decompose.PCA 算法,因为我对 python 的 OOP 概念不熟悉。

Sebastian Raschka 的博客 post 中的输出也有细微差别。下图:

计算特征向量时,您可能 change its sign 并且解也是有效的。

因此任何 PCA 轴都可以反转,解决方案将有效。

不过,您可能希望 PCA 轴与数据集中的原始变量之一强加正相关,并在需要时反转轴。

值的差异来自使用 svd 分解的 sklearn 的 PCA。在 sklearn 中有一个函数 svd_flip 用于翻转 PC,这解释了为什么你会看到这个翻转

有关帮助的更多详细信息 page

It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the method of Halko et al. 2009, depending on the shape of the input data and the number of components to extract.

您可以阅读有关关系 here

我们首先运行您的示例数据集:

from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA 
from sklearn.datasets import load_iris
from sklearn.utils.extmath import svd_flip
import pandas as pd
import numpy as np
import scipy

iris = load_iris()

X, y = iris.data, iris.target
X = MinMaxScaler().fit_transform(X)

n_components = 4

sPCA = PCA(n_components,svd_solver="full")
sklearnPCs = pd.DataFrame(sPCA.fit_transform(X))

我们现在对您的居中矩阵执行 SVD:

U,S,Vt = scipy.linalg.svd(X - X.mean(axis=0))
U = U[:,:n_components]
U, Vt = svd_flip(U, Vt)

svdPCs =  pd.DataFrame(U*S)

结果:

            0         1         2         3
0   -0.630703  0.107578 -0.018719 -0.007307
1   -0.622905 -0.104260 -0.049142 -0.032359
2   -0.669520 -0.051417  0.019644 -0.007434
3   -0.654153 -0.102885  0.023219  0.020114
4   -0.648788  0.133488  0.015116  0.011786
..        ...       ...       ...       ...
145  0.551462  0.059841  0.086283 -0.110092
146  0.407146 -0.171821 -0.004102 -0.065241
147  0.447143  0.037560  0.049546 -0.032743
148  0.488208  0.149678  0.239209  0.002864
149  0.312066 -0.031130  0.118672  0.052505


svdPCs 
            0         1         2         3
0   -0.630703  0.107578 -0.018719 -0.007307
1   -0.622905 -0.104260 -0.049142 -0.032359
2   -0.669520 -0.051417  0.019644 -0.007434
3   -0.654153 -0.102885  0.023219  0.020114
4   -0.648788  0.133488  0.015116  0.011786
..        ...       ...       ...       ...
145  0.551462  0.059841  0.086283 -0.110092
146  0.407146 -0.171821 -0.004102 -0.065241
147  0.447143  0.037560  0.049546 -0.032743
148  0.488208  0.149678  0.239209  0.002864
149  0.312066 -0.031130  0.118672  0.052505

您可以在不翻转的情况下实现。值将相同,您的 PCA 将如其他答案中所述有效。