理解机器学习中的 pca
understanding pca in machine learning
我正在使用部分 iris 数据集来更好地了解 PCA。
这是我的代码:
from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt
from sklearn import decomposition
dataset = load_iris()
X = dataset.data[:20,]
pca = decomposition.PCA(n_components=4)
pca.fit(X)
X = pca.transform(X)
print(X)
print()
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)
print(pca.noise_variance_)
print()
print(pca.components_)
print()
pca = decomposition.PCA(n_components=3)
pca.fit(X)
X = pca.transform(X)
print(X)
print()
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)
print(pca.noise_variance_)
print()
print(pca.components_)
print()
pca = decomposition.PCA(n_components=2)
pca.fit(X)
X = pca.transform(X)
print(X)
print()
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)
print(pca.noise_variance_)
print()
print(pca.components_)
print()
pca = decomposition.PCA(n_components=1)
pca.fit(X)
X = pca.transform(X)
print(X)
print()
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)
print(pca.noise_variance_)
print()
print(pca.components_)
print()
输出:
| F1 | F2 | F3 | F4 | Label |
|5.1 |3.5 |1.4 |0.2 | 0 |
|4.9 |3.0 |1.4 |0.2 | 0 |
|4.7 |3.2 |1.3 |0.2 | 0 |
|4.6 |3.1 |1.5 |0.2 | 0 |
|5.0 |3.6 |1.4 |0.2 | 0 |
|5.4 |3.9 |1.7 |0.4 | 0 |
|4.6 |3.4 |1.4 |0.3 | 0 |
|5.0 |3.4 |1.5 |0.2 | 0 |
|4.4 |2.9 |1.4 |0.2 | 0 |
|4.9 |3.1 |1.5 |0.1 | 0 |
|5.4 |3.7 |1.5 |0.2 | 0 |
|4.8 |3.4 |1.6 |0.2 | 0 |
|4.8 |3.0 |1.4 |0.1 | 0 |
|4.3 |3.0 |1.1 |0.1 | 0 |
|5.8 |4.0 |1.2 |0.2 | 0 |
|5.7 |4.4 |1.5 |0.4 | 0 |
|5.4 |3.9 |1.3 |0.4 | 0 |
|5.1 |3.5 |1.4 |0.3 | 0 |
|5.7 |3.8 |1.7 |0.3 | 0 |
|5.1 |3.8 |1.5 |0.3 | 0 |
[[ -5.35882132e-02 2.13091549e-02 5.63776995e-02 2.38909674e-02]
[ 4.31102885e-01 2.27802156e-01 7.74776903e-02 -8.56077547e-02]
[ 4.46437821e-01 -6.48981661e-02 7.80252213e-02 -2.16463511e-02]
[ 5.70213598e-01 1.37832371e-02 -1.17201913e-01 -2.27730577e-03]
[ -4.99837824e-02 -1.06433448e-01 1.11801355e-02 6.42148516e-02]
[ -5.88493547e-01 1.19234918e-02 -2.42112963e-01 -4.46036896e-02]
[ 3.62588639e-01 -2.42562846e-01 -9.89230051e-02 -3.13366123e-02]
[ 7.83136388e-02 6.27754417e-02 -4.79067754e-02 2.65736478e-02]
[ 8.58395527e-01 -1.49295381e-02 -5.29428852e-02 -4.69710396e-02]
[ 3.65880852e-01 2.20160693e-01 -4.51271386e-03 5.21066893e-02]
[ -4.13586321e-01 1.11767646e-01 2.13883619e-02 5.54246013e-02]
[ 2.13819922e-01 -2.35008745e-02 -1.97388814e-01 6.95802124e-02]
[ 5.14034854e-01 1.87196747e-01 7.30881295e-02 2.14166399e-02]
[ 8.97493973e-01 -2.33177183e-01 1.99567657e-01 3.71580447e-02]
[ -8.81108056e-01 4.91145021e-02 3.63511477e-01 3.42164603e-02]
[ -1.12874867e+00 -2.07254026e-01 -5.20579454e-02 1.83622028e-02]
[ -5.55989247e-01 -1.36936973e-01 1.21657674e-01 -1.11349149e-01]
[ -6.47040031e-02 1.68848098e-04 3.14975704e-02 -6.99733273e-02]
[ -7.24614545e-01 2.84297834e-01 -1.13495890e-01 -1.73834789e-02]
[ -2.77465322e-01 -1.60606696e-01 -1.07228711e-01 2.82043907e-02]]
[ 0.87954353 0.06300167 0.05039505 0.00705974]
[ 0.31612993 0.02264438 0.01811324 0.00253745]
0.0
[[-0.71816179 -0.68211748 -0.08126075 -0.1111579 ]
[ 0.61745716 -0.65996887 0.37215116 -0.21140307]
[ 0.2926969 -0.15927874 -0.90942659 -0.24880129]
[-0.131601 0.27163784 0.16686365 -0.93864295]]
[[ -5.35882132e-02 2.13091549e-02 -5.63776995e-02]
[ 4.31102885e-01 2.27802156e-01 -7.74776903e-02]
[ 4.46437821e-01 -6.48981661e-02 -7.80252213e-02]
[ 5.70213598e-01 1.37832371e-02 1.17201913e-01]
[ -4.99837824e-02 -1.06433448e-01 -1.11801355e-02]
[ -5.88493547e-01 1.19234918e-02 2.42112963e-01]
[ 3.62588639e-01 -2.42562846e-01 9.89230051e-02]
[ 7.83136388e-02 6.27754417e-02 4.79067754e-02]
[ 8.58395527e-01 -1.49295381e-02 5.29428852e-02]
[ 3.65880852e-01 2.20160693e-01 4.51271386e-03]
[ -4.13586321e-01 1.11767646e-01 -2.13883619e-02]
[ 2.13819922e-01 -2.35008745e-02 1.97388814e-01]
[ 5.14034854e-01 1.87196747e-01 -7.30881295e-02]
[ 8.97493973e-01 -2.33177183e-01 -1.99567657e-01]
[ -8.81108056e-01 4.91145021e-02 -3.63511477e-01]
[ -1.12874867e+00 -2.07254026e-01 5.20579454e-02]
[ -5.55989247e-01 -1.36936973e-01 -1.21657674e-01]
[ -6.47040031e-02 1.68848098e-04 -3.14975704e-02]
[ -7.24614545e-01 2.84297834e-01 1.13495890e-01]
[ -2.77465322e-01 -1.60606696e-01 1.07228711e-01]]
[ 0.87954353 0.06300167 0.05039505]
[ 0.31612993 0.02264438 0.01811324]
0.00253744874373
[[ 1.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ -0.00000000e+00 1.00000000e+00 -3.33066907e-15 0.00000000e+00]
[ 0.00000000e+00 -3.10862447e-15 -1.00000000e+00 -3.60822483e-16]]
[[ -5.35882132e-02 2.13091549e-02]
[ 4.31102885e-01 2.27802156e-01]
[ 4.46437821e-01 -6.48981661e-02]
[ 5.70213598e-01 1.37832371e-02]
[ -4.99837824e-02 -1.06433448e-01]
[ -5.88493547e-01 1.19234918e-02]
[ 3.62588639e-01 -2.42562846e-01]
[ 7.83136388e-02 6.27754417e-02]
[ 8.58395527e-01 -1.49295381e-02]
[ 3.65880852e-01 2.20160693e-01]
[ -4.13586321e-01 1.11767646e-01]
[ 2.13819922e-01 -2.35008745e-02]
[ 5.14034854e-01 1.87196747e-01]
[ 8.97493973e-01 -2.33177183e-01]
[ -8.81108056e-01 4.91145021e-02]
[ -1.12874867e+00 -2.07254026e-01]
[ -5.55989247e-01 -1.36936973e-01]
[ -6.47040031e-02 1.68848098e-04]
[ -7.24614545e-01 2.84297834e-01]
[ -2.77465322e-01 -1.60606696e-01]]
[ 0.88579703 0.06344961]
[ 0.31612993 0.02264438]
0.0181132415475
[[ 1.00000000e+00 0.00000000e+00 0.00000000e+00]
[ -0.00000000e+00 1.00000000e+00 -5.55111512e-16]]
[[-0.05358821]
[ 0.43110288]
[ 0.44643782]
[ 0.5702136 ]
[-0.04998378]
[-0.58849355]
[ 0.36258864]
[ 0.07831364]
[ 0.85839553]
[ 0.36588085]
[-0.41358632]
[ 0.21381992]
[ 0.51403485]
[ 0.89749397]
[-0.88110806]
[-1.12874867]
[-0.55598925]
[-0.064704 ]
[-0.72461455]
[-0.27746532]]
[ 0.93315793]
[ 0.31612993]
0.0226443764968
[[ 1. 0.]]
在我的数据集中,F1 的方差最大。这在 PCA 的输出中如何可见?
这里的 "explained variance" 到底是什么意思?这是否意味着原始特征对新计算值的方差有多大影响?
为什么第一个有 4 个分量的例子的噪声方差为 0?
components_
到底是什么?它们是n维特征向量吗?
F1 has the highest variance. How is this visible in the output of the PCA?
PCA 是一种特征转换技术,可旋转原始数据维度并将其转换为新的正交特征 space。在新特征 space 中,主成分(数据的 z-score-normalized 协方差矩阵的正交特征向量)形成 space 的维度。这些组件是原始特征尺寸的线性组合。考虑以下代码,主要主成分 PC1(捕获数据中的最大方差)可以表示为特征的线性组合 PC1=-0.718162*F1+0.292697*F3-0.131601*F4
.
import pandas as pd
pd.DataFrame(pca.components_, columns=['PC1', 'PC2', 'PC3', 'PC4'], index=['F1', 'F2', 'F3', 'F4'])
# PC1 PC2 PC3 PC4
#F1 -0.718162 -0.682117 -0.081261 -0.111158
#F2 0.617457 -0.659969 0.372151 -0.211403
#F3 0.292697 -0.159279 -0.909427 -0.248801
#F4 -0.131601 0.271638 0.166864 -0.938643
What exactly does "explained variance" mean here? Does this mean how much the original feature influenced the variance of the newly calculated values?
由每个选定成分解释的方差量,它是通过简单地采用 PCA 载荷列的方差(returns 列 pca.trandsform
的方差获得的,即,转换特征的方差,而不是原始特征),请参见以下代码:
X = pca.transform(X)
print(np.var(X, axis=0))
#[ 0.31612993 0.02264438 0.01811324 0.00253745]
print(pca.explained_variance_)
#[ 0.31612993 0.02264438 0.01811324 0.00253745]
Why is the noise variance 0 for the first example with 4 components?
因为我们在第一种情况下没有进行任何降维,所以我们只是将特征 space 转换为另一种,并使用了所有 4 个组件,没有排除任何一个(因此没有信息丢失)。
What exactly are the components_? Are they the n-dimensional eigenvectors?
这些分量可以被认为是缩放数据的协方差矩阵的正交特征向量,尽管正如文档所说,它是使用奇异值分解以更稳定的数值方式计算的,在这种情况下,它们是从右计算的奇异向量。
我正在使用部分 iris 数据集来更好地了解 PCA。
这是我的代码:
from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt
from sklearn import decomposition
dataset = load_iris()
X = dataset.data[:20,]
pca = decomposition.PCA(n_components=4)
pca.fit(X)
X = pca.transform(X)
print(X)
print()
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)
print(pca.noise_variance_)
print()
print(pca.components_)
print()
pca = decomposition.PCA(n_components=3)
pca.fit(X)
X = pca.transform(X)
print(X)
print()
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)
print(pca.noise_variance_)
print()
print(pca.components_)
print()
pca = decomposition.PCA(n_components=2)
pca.fit(X)
X = pca.transform(X)
print(X)
print()
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)
print(pca.noise_variance_)
print()
print(pca.components_)
print()
pca = decomposition.PCA(n_components=1)
pca.fit(X)
X = pca.transform(X)
print(X)
print()
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)
print(pca.noise_variance_)
print()
print(pca.components_)
print()
输出:
| F1 | F2 | F3 | F4 | Label |
|5.1 |3.5 |1.4 |0.2 | 0 |
|4.9 |3.0 |1.4 |0.2 | 0 |
|4.7 |3.2 |1.3 |0.2 | 0 |
|4.6 |3.1 |1.5 |0.2 | 0 |
|5.0 |3.6 |1.4 |0.2 | 0 |
|5.4 |3.9 |1.7 |0.4 | 0 |
|4.6 |3.4 |1.4 |0.3 | 0 |
|5.0 |3.4 |1.5 |0.2 | 0 |
|4.4 |2.9 |1.4 |0.2 | 0 |
|4.9 |3.1 |1.5 |0.1 | 0 |
|5.4 |3.7 |1.5 |0.2 | 0 |
|4.8 |3.4 |1.6 |0.2 | 0 |
|4.8 |3.0 |1.4 |0.1 | 0 |
|4.3 |3.0 |1.1 |0.1 | 0 |
|5.8 |4.0 |1.2 |0.2 | 0 |
|5.7 |4.4 |1.5 |0.4 | 0 |
|5.4 |3.9 |1.3 |0.4 | 0 |
|5.1 |3.5 |1.4 |0.3 | 0 |
|5.7 |3.8 |1.7 |0.3 | 0 |
|5.1 |3.8 |1.5 |0.3 | 0 |
[[ -5.35882132e-02 2.13091549e-02 5.63776995e-02 2.38909674e-02]
[ 4.31102885e-01 2.27802156e-01 7.74776903e-02 -8.56077547e-02]
[ 4.46437821e-01 -6.48981661e-02 7.80252213e-02 -2.16463511e-02]
[ 5.70213598e-01 1.37832371e-02 -1.17201913e-01 -2.27730577e-03]
[ -4.99837824e-02 -1.06433448e-01 1.11801355e-02 6.42148516e-02]
[ -5.88493547e-01 1.19234918e-02 -2.42112963e-01 -4.46036896e-02]
[ 3.62588639e-01 -2.42562846e-01 -9.89230051e-02 -3.13366123e-02]
[ 7.83136388e-02 6.27754417e-02 -4.79067754e-02 2.65736478e-02]
[ 8.58395527e-01 -1.49295381e-02 -5.29428852e-02 -4.69710396e-02]
[ 3.65880852e-01 2.20160693e-01 -4.51271386e-03 5.21066893e-02]
[ -4.13586321e-01 1.11767646e-01 2.13883619e-02 5.54246013e-02]
[ 2.13819922e-01 -2.35008745e-02 -1.97388814e-01 6.95802124e-02]
[ 5.14034854e-01 1.87196747e-01 7.30881295e-02 2.14166399e-02]
[ 8.97493973e-01 -2.33177183e-01 1.99567657e-01 3.71580447e-02]
[ -8.81108056e-01 4.91145021e-02 3.63511477e-01 3.42164603e-02]
[ -1.12874867e+00 -2.07254026e-01 -5.20579454e-02 1.83622028e-02]
[ -5.55989247e-01 -1.36936973e-01 1.21657674e-01 -1.11349149e-01]
[ -6.47040031e-02 1.68848098e-04 3.14975704e-02 -6.99733273e-02]
[ -7.24614545e-01 2.84297834e-01 -1.13495890e-01 -1.73834789e-02]
[ -2.77465322e-01 -1.60606696e-01 -1.07228711e-01 2.82043907e-02]]
[ 0.87954353 0.06300167 0.05039505 0.00705974]
[ 0.31612993 0.02264438 0.01811324 0.00253745]
0.0
[[-0.71816179 -0.68211748 -0.08126075 -0.1111579 ]
[ 0.61745716 -0.65996887 0.37215116 -0.21140307]
[ 0.2926969 -0.15927874 -0.90942659 -0.24880129]
[-0.131601 0.27163784 0.16686365 -0.93864295]]
[[ -5.35882132e-02 2.13091549e-02 -5.63776995e-02]
[ 4.31102885e-01 2.27802156e-01 -7.74776903e-02]
[ 4.46437821e-01 -6.48981661e-02 -7.80252213e-02]
[ 5.70213598e-01 1.37832371e-02 1.17201913e-01]
[ -4.99837824e-02 -1.06433448e-01 -1.11801355e-02]
[ -5.88493547e-01 1.19234918e-02 2.42112963e-01]
[ 3.62588639e-01 -2.42562846e-01 9.89230051e-02]
[ 7.83136388e-02 6.27754417e-02 4.79067754e-02]
[ 8.58395527e-01 -1.49295381e-02 5.29428852e-02]
[ 3.65880852e-01 2.20160693e-01 4.51271386e-03]
[ -4.13586321e-01 1.11767646e-01 -2.13883619e-02]
[ 2.13819922e-01 -2.35008745e-02 1.97388814e-01]
[ 5.14034854e-01 1.87196747e-01 -7.30881295e-02]
[ 8.97493973e-01 -2.33177183e-01 -1.99567657e-01]
[ -8.81108056e-01 4.91145021e-02 -3.63511477e-01]
[ -1.12874867e+00 -2.07254026e-01 5.20579454e-02]
[ -5.55989247e-01 -1.36936973e-01 -1.21657674e-01]
[ -6.47040031e-02 1.68848098e-04 -3.14975704e-02]
[ -7.24614545e-01 2.84297834e-01 1.13495890e-01]
[ -2.77465322e-01 -1.60606696e-01 1.07228711e-01]]
[ 0.87954353 0.06300167 0.05039505]
[ 0.31612993 0.02264438 0.01811324]
0.00253744874373
[[ 1.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ -0.00000000e+00 1.00000000e+00 -3.33066907e-15 0.00000000e+00]
[ 0.00000000e+00 -3.10862447e-15 -1.00000000e+00 -3.60822483e-16]]
[[ -5.35882132e-02 2.13091549e-02]
[ 4.31102885e-01 2.27802156e-01]
[ 4.46437821e-01 -6.48981661e-02]
[ 5.70213598e-01 1.37832371e-02]
[ -4.99837824e-02 -1.06433448e-01]
[ -5.88493547e-01 1.19234918e-02]
[ 3.62588639e-01 -2.42562846e-01]
[ 7.83136388e-02 6.27754417e-02]
[ 8.58395527e-01 -1.49295381e-02]
[ 3.65880852e-01 2.20160693e-01]
[ -4.13586321e-01 1.11767646e-01]
[ 2.13819922e-01 -2.35008745e-02]
[ 5.14034854e-01 1.87196747e-01]
[ 8.97493973e-01 -2.33177183e-01]
[ -8.81108056e-01 4.91145021e-02]
[ -1.12874867e+00 -2.07254026e-01]
[ -5.55989247e-01 -1.36936973e-01]
[ -6.47040031e-02 1.68848098e-04]
[ -7.24614545e-01 2.84297834e-01]
[ -2.77465322e-01 -1.60606696e-01]]
[ 0.88579703 0.06344961]
[ 0.31612993 0.02264438]
0.0181132415475
[[ 1.00000000e+00 0.00000000e+00 0.00000000e+00]
[ -0.00000000e+00 1.00000000e+00 -5.55111512e-16]]
[[-0.05358821]
[ 0.43110288]
[ 0.44643782]
[ 0.5702136 ]
[-0.04998378]
[-0.58849355]
[ 0.36258864]
[ 0.07831364]
[ 0.85839553]
[ 0.36588085]
[-0.41358632]
[ 0.21381992]
[ 0.51403485]
[ 0.89749397]
[-0.88110806]
[-1.12874867]
[-0.55598925]
[-0.064704 ]
[-0.72461455]
[-0.27746532]]
[ 0.93315793]
[ 0.31612993]
0.0226443764968
[[ 1. 0.]]
在我的数据集中,F1 的方差最大。这在 PCA 的输出中如何可见?
这里的 "explained variance" 到底是什么意思?这是否意味着原始特征对新计算值的方差有多大影响?
为什么第一个有 4 个分量的例子的噪声方差为 0?
components_
到底是什么?它们是n维特征向量吗?
F1 has the highest variance. How is this visible in the output of the PCA?
PCA 是一种特征转换技术,可旋转原始数据维度并将其转换为新的正交特征 space。在新特征 space 中,主成分(数据的 z-score-normalized 协方差矩阵的正交特征向量)形成 space 的维度。这些组件是原始特征尺寸的线性组合。考虑以下代码,主要主成分 PC1(捕获数据中的最大方差)可以表示为特征的线性组合 PC1=-0.718162*F1+0.292697*F3-0.131601*F4
.
import pandas as pd
pd.DataFrame(pca.components_, columns=['PC1', 'PC2', 'PC3', 'PC4'], index=['F1', 'F2', 'F3', 'F4'])
# PC1 PC2 PC3 PC4
#F1 -0.718162 -0.682117 -0.081261 -0.111158
#F2 0.617457 -0.659969 0.372151 -0.211403
#F3 0.292697 -0.159279 -0.909427 -0.248801
#F4 -0.131601 0.271638 0.166864 -0.938643
What exactly does "explained variance" mean here? Does this mean how much the original feature influenced the variance of the newly calculated values?
由每个选定成分解释的方差量,它是通过简单地采用 PCA 载荷列的方差(returns 列 pca.trandsform
的方差获得的,即,转换特征的方差,而不是原始特征),请参见以下代码:
X = pca.transform(X)
print(np.var(X, axis=0))
#[ 0.31612993 0.02264438 0.01811324 0.00253745]
print(pca.explained_variance_)
#[ 0.31612993 0.02264438 0.01811324 0.00253745]
Why is the noise variance 0 for the first example with 4 components?
因为我们在第一种情况下没有进行任何降维,所以我们只是将特征 space 转换为另一种,并使用了所有 4 个组件,没有排除任何一个(因此没有信息丢失)。
What exactly are the components_? Are they the n-dimensional eigenvectors?
这些分量可以被认为是缩放数据的协方差矩阵的正交特征向量,尽管正如文档所说,它是使用奇异值分解以更稳定的数值方式计算的,在这种情况下,它们是从右计算的奇异向量。