我应该为 pca 缩放 box cox 数据吗？

Question

我已经使用 电源变压器 转换了我的数据集（9 列）以生成标准化的高斯分布。

from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson',standardize=True)
#you can get the original data back using inverse_transform(X)

X_train=pt.fit_transform(X_train)

#fit the model only on the train set and transform the test set
X_test=pt.transform(X_test)

所以现在我的数据集对大多数具有零均值和单位方差的特征几乎呈高斯分布。然后我应用了 PolynomialFeatures():

from sklearn.preprocessing import PolynomialFeatures 
  
poly = PolynomialFeatures(degree = 4) 
X_poly = poly.fit_transform(X_train) 
  
LR2 = LinearRegression() 
LR2.fit(X_poly, y_train)

添加多项式特征后，我有 2380 列可能会导致过度拟合，所以我想使用 PCA 进行降维，但我在某处读到 PCA 需要对数据进行“缩放” "（这通常意味着使用 MinMaxScaler() 之类的东西更改值的范围）。

那么在将 PCA 应用于 boxcox 转换（和标准化）数据集之前，我应该使用 MinMaxScaler() 吗？

Answer 1

Standardization is important in PCA since it is a variance maximizing exercise. It projects your original data onto directions that maximize the variance. The first plot below shows the amount of total variance explained in the different principal components where we have not normalized the data. As you can see, it seems like component one explains most of the variance in the data.

查找更多详细信息here

在您的例子中，您使用的是具有标准化的幂变换（将均值和标准差设置为 0 和 1），设置为 True。在 PCA 之前通常不推荐归一化（将变量范围设置在 0 到 1 之间），因为它在处理数据和离群值的现有偏度方面作用不大。

检查this。

因此，如果您的功能已经标准化，我建议您不需要 Min Max Scaler。

我应该为 pca 缩放 box cox 数据吗？

Should I scale box cox data for pca?

python

scaling

transformation

pca