（增量）PCA 特征向量不是转置但应该是？

Question

当我们发布 homework assignment about PCA we told the course participants to pick any way of calculating the eigenvectors they found. They found multiple ways: eig, eigh (our favorite was svd)。在后来的任务中，我们告诉他们使用来自 scikit-learn 的 PCA - 令他们惊讶的是结果与我们预期的差异很大。

我试了一下，然后我们向参与者发布了一个解释，说明这两种解决方案都是正确的，可能只是算法中存在数值不稳定性。然而，最近我在与 co-worker 的讨论中再次选择了该文件，我们很快发现有一个有趣的细微变化可以使所有结果几乎相等：转置从 SVD 获得的特征向量（和因此来自 PCA）。

一些代码来说明这一点：

def pca_eig(data):
    """Uses numpy.linalg.eig to calculate the PCA."""
    data = data.T @ data
    val, vec = np.linalg.eig(data)
    return val, vec

对比

def pca_svd(data):
    """Uses numpy.linalg.svd to calculate the PCA."""
    u, s, v = np.linalg.svd(data)
    return s ** 2, v

不会产生相同的结果。然而，将 pca_svd 的 return 更改为 s ** 2, v.T，有效！遵循 wikipedia 的定义非常有意义：X 的 SVD 遵循 X=UΣW^T 其中

the right singular vectors W of X are equivalent to the eigenvectors of X^TX

因此，为了获得特征向量，我们需要转置 np.linalg.eig(...) 的输出 v。

除非有其他事情发生？无论如何，PCA and IncrementalPCA both show wrong results (or eig is wrong? I mean, transposing that yields the same equality), and looking at the code for PCA 表明他们正在做我最初做的事情：

U, S, V = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, V = svd_flip(U, V)

components_ = V

我创建了一点 gist demonstrating the differences (nbviewer)，第一个使用 PCA 和 IncPCA（也没有 SVD 转置），第二个使用转置特征向量：

不带转置的比较 SVD/PCAs（归一化数据）

与SVD/PCAs（归一化数据）的转置比较

可以清楚地看到，在上图中，结果并不是很好，而下图只是在某些迹象上有所不同，因此到处都是结果的镜像。

这真的是错误的并且是 scikit-learn 中的错误吗？更有可能我用错了数学——但什么是正确的？你能帮帮我吗？

Answer 1

如果您查看文档，从形状可以很清楚地看出特征向量在行中，而不是在列中。 sklearn PCA 的要点是您可以使用 transform 方法进行正确的转换。

（增量）PCA 特征向量不是转置但应该是？

(Incremental)PCA's Eigenvectors are not transposed but should be?

pca

scikit-learn

不带转置的比较 SVD/PCAs（归一化数据）

与SVD/PCAs（归一化数据）的转置比较