Python 的 PCA:特征向量不正交
PCA with Python: Eigenvectors are not orthogonal
我在 Python 从事主成分分析 (PCA)。出于理解的原因,我想自己实现它。为此,我根据给定的协方差矩阵创建随机数据,以便我可以使用不同的值以及主成分的行为方式。所以这个脚本只是为了理解和阐明PCA。
我的理解是,主成分(协方差矩阵的特征向量)总是相互正交的。维基百科下面这张图也是这么说的:
图片来自维基百科的描述(Source):
PCA of the multivariate Gaussian distribution centered at ( 1 , 3 ) with a standard deviation of 3 in roughly the ( 0.878 , 0.478 ) direction and of 1 in the orthogonal direction. The vectors shown are unit eigenvectors of the (symmetric, positive-semidefinite) covariance matrix scaled by the square root of the corresponding eigenvalue. Just as in the one-dimensional case, the square root is taken because the standard deviation is more readily visualized than the variance.
所以我希望对于我的情况,如果我绘制从随机数据中得出的特征向量,它们也相互正交。但这种情况并非如此。他们的方向总是相差大约。 60 度,而不是我预期的 90 度。如果我使用 sklearn 库中的 PCA,也会发生这种情况。请参见下图,其中红色是 sklearn PCA 的特征向量,绿色是我自己代码中的特征向量。
我的 Python 脚本:
from matplotlib import pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
def normalize(data: np.array, mean=None, std=None):
"""
Normalize a pandas dataframe with respect to their stochastical moment. If mean and/or std is not passed they are
calculated beforeheand.
:param data: Data to be normalized
:param mean: A mean value (optional)
:param std: A standard deviation (optional)
:return: normalized dataframe, mean value(s), standard deviation(s)
"""
if mean is None:
mean = data.mean(axis=0).reshape(1, -1)
if std is None:
std = data.std(axis=0).reshape(1, -1)
res = data - mean / std
return res, mean, std
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
datacount = 1000
# create data based on given covariance matrix
cov = np.array([[1, 0.85], [0.85, 1]])
rand1 = np.random.multivariate_normal([1, 3], cov, datacount)
# normalize, calculate covariance matrix and its eigenvectors and eigenvalues
rand1, mean, std = normalize(rand1)
cov = np.cov(rand1.T)
eig_values, eig_vectors = np.linalg.eig(cov)
# sort eig_values by importance and use this to also sort the vectors
idx = np.argsort(eig_values, axis=0)[::-1]
sorted_eig_vectors = eig_vectors[:, idx]
# plot data
plt.figure()
plt.scatter(rand1[:, 0], rand1[:, 1])
# set both axis limits to the maximum/minimum of the axis scalesv
ax = plt.gca()
xlimits = np.array(ax.get_xlim())
ylimits = np.array(ax.get_ylim())
axmax = np.max([np.max(xlimits), np.max(ylimits)])
axmin = np.min([np.min(xlimits), np.min(ylimits)])
ax.set_xlim([axmin, axmax])
ax.set_ylim([axmin, axmax])
# use PCA from sklearn for comparison
pca = PCA(n_components=2)
pca = pca.fit(rand1)
# Plot the eigenvectors
# Beware! Eigenvectors are oriented in rows in sklearn PCA and column-oriented in np.linalg.eig()!
for i in range(2):
plt.arrow(0, 0, pca.components_[0, i], pca.components_[1, i], color="g",
head_width=0.05, head_length=0.1)
for i in range(2):
plt.arrow(0, 0, eig_vectors[i, 0], eig_vectors[i, 1], color="r",
head_width=0.05, head_length=0.1)
# plt.annotate(text='', xy=(1, 1), xytext=(0, 0), arrowprops=dict(arrowstyle='<->'))
plt.grid()
plt.figure()
# Transform data to new subspace
eig_scores = np.dot(rand1, sorted_eig_vectors[:, :2]).T
# plot PCAs in subspace
plt.scatter(eig_scores[0], eig_scores[1])
# set both axis limits to the maximum/minimum of the axis scales
ax = plt.gca()
xlimits = np.array(ax.get_xlim())
ylimits = np.array(ax.get_ylim())
axmax = np.max([np.max(xlimits), np.max(ylimits)])
axmin = np.min([np.min(xlimits), np.min(ylimits)])
ax.set_xlim([axmin, axmax])
ax.set_ylim([axmin, axmax])
plt.grid()
plt.show()
# Are Eigenvectors orthogonal?
print(np.dot(eig_vectors[:, 0], eig_vectors[:, 1]) == 0) # yields True
print(np.dot(pca.components_[0, :], pca.components_[1, :]) == 0) # yields True
奇怪的是,我检查两种方法的特征向量是否正交的最后两行总是得出 True,表明向量实际上是正交的。
数据到新子空间的转换也很好,结果如下:
我错过了什么?我的期望是错误的吗?还是我的 Python 脚本有错误?
你检查过它们是正交的,它们是,但在情节中你说它们不是。
矢量是否正确绘制?他们是:
array([[ 0.707934 , -0.70627859],
[ 0.70627859, 0.707934 ]])
看图好像是
问题是您正在尝试测量显示器上的角度,其中两个轴的比例不同。
只需添加plt.axis('equal')
我在 Python 从事主成分分析 (PCA)。出于理解的原因,我想自己实现它。为此,我根据给定的协方差矩阵创建随机数据,以便我可以使用不同的值以及主成分的行为方式。所以这个脚本只是为了理解和阐明PCA。
我的理解是,主成分(协方差矩阵的特征向量)总是相互正交的。维基百科下面这张图也是这么说的:
图片来自维基百科的描述(Source):
PCA of the multivariate Gaussian distribution centered at ( 1 , 3 ) with a standard deviation of 3 in roughly the ( 0.878 , 0.478 ) direction and of 1 in the orthogonal direction. The vectors shown are unit eigenvectors of the (symmetric, positive-semidefinite) covariance matrix scaled by the square root of the corresponding eigenvalue. Just as in the one-dimensional case, the square root is taken because the standard deviation is more readily visualized than the variance.
所以我希望对于我的情况,如果我绘制从随机数据中得出的特征向量,它们也相互正交。但这种情况并非如此。他们的方向总是相差大约。 60 度,而不是我预期的 90 度。如果我使用 sklearn 库中的 PCA,也会发生这种情况。请参见下图,其中红色是 sklearn PCA 的特征向量,绿色是我自己代码中的特征向量。
我的 Python 脚本:
from matplotlib import pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
def normalize(data: np.array, mean=None, std=None):
"""
Normalize a pandas dataframe with respect to their stochastical moment. If mean and/or std is not passed they are
calculated beforeheand.
:param data: Data to be normalized
:param mean: A mean value (optional)
:param std: A standard deviation (optional)
:return: normalized dataframe, mean value(s), standard deviation(s)
"""
if mean is None:
mean = data.mean(axis=0).reshape(1, -1)
if std is None:
std = data.std(axis=0).reshape(1, -1)
res = data - mean / std
return res, mean, std
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
datacount = 1000
# create data based on given covariance matrix
cov = np.array([[1, 0.85], [0.85, 1]])
rand1 = np.random.multivariate_normal([1, 3], cov, datacount)
# normalize, calculate covariance matrix and its eigenvectors and eigenvalues
rand1, mean, std = normalize(rand1)
cov = np.cov(rand1.T)
eig_values, eig_vectors = np.linalg.eig(cov)
# sort eig_values by importance and use this to also sort the vectors
idx = np.argsort(eig_values, axis=0)[::-1]
sorted_eig_vectors = eig_vectors[:, idx]
# plot data
plt.figure()
plt.scatter(rand1[:, 0], rand1[:, 1])
# set both axis limits to the maximum/minimum of the axis scalesv
ax = plt.gca()
xlimits = np.array(ax.get_xlim())
ylimits = np.array(ax.get_ylim())
axmax = np.max([np.max(xlimits), np.max(ylimits)])
axmin = np.min([np.min(xlimits), np.min(ylimits)])
ax.set_xlim([axmin, axmax])
ax.set_ylim([axmin, axmax])
# use PCA from sklearn for comparison
pca = PCA(n_components=2)
pca = pca.fit(rand1)
# Plot the eigenvectors
# Beware! Eigenvectors are oriented in rows in sklearn PCA and column-oriented in np.linalg.eig()!
for i in range(2):
plt.arrow(0, 0, pca.components_[0, i], pca.components_[1, i], color="g",
head_width=0.05, head_length=0.1)
for i in range(2):
plt.arrow(0, 0, eig_vectors[i, 0], eig_vectors[i, 1], color="r",
head_width=0.05, head_length=0.1)
# plt.annotate(text='', xy=(1, 1), xytext=(0, 0), arrowprops=dict(arrowstyle='<->'))
plt.grid()
plt.figure()
# Transform data to new subspace
eig_scores = np.dot(rand1, sorted_eig_vectors[:, :2]).T
# plot PCAs in subspace
plt.scatter(eig_scores[0], eig_scores[1])
# set both axis limits to the maximum/minimum of the axis scales
ax = plt.gca()
xlimits = np.array(ax.get_xlim())
ylimits = np.array(ax.get_ylim())
axmax = np.max([np.max(xlimits), np.max(ylimits)])
axmin = np.min([np.min(xlimits), np.min(ylimits)])
ax.set_xlim([axmin, axmax])
ax.set_ylim([axmin, axmax])
plt.grid()
plt.show()
# Are Eigenvectors orthogonal?
print(np.dot(eig_vectors[:, 0], eig_vectors[:, 1]) == 0) # yields True
print(np.dot(pca.components_[0, :], pca.components_[1, :]) == 0) # yields True
奇怪的是,我检查两种方法的特征向量是否正交的最后两行总是得出 True,表明向量实际上是正交的。
数据到新子空间的转换也很好,结果如下:
我错过了什么?我的期望是错误的吗?还是我的 Python 脚本有错误?
你检查过它们是正交的,它们是,但在情节中你说它们不是。 矢量是否正确绘制?他们是:
array([[ 0.707934 , -0.70627859],
[ 0.70627859, 0.707934 ]])
看图好像是
问题是您正在尝试测量显示器上的角度,其中两个轴的比例不同。
只需添加plt.axis('equal')