PCA 之后的最优特征选择技术？

Question

我正在使用 RandomForestClassifier 实施具有二元结果的分类任务，我知道数据预处理对于提高准确度得分的重要性。特别是，我的数据集包含 100 多个特征和近 4000 个实例，我想执行降维技术以避免过度拟合，因为数据中存在大量噪声。

对于这些任务，我通常使用经典的特征选择方法（过滤器、包装器、特征重要性），但我最近阅读了关于结合主成分分析 (PCA)（第一步）然后对转换后的数据集进行特征选择的内容.

我的问题如下：在对我的数据执行 PCA 后，是否应该使用特定的特征选择方法？特别是，我想了解的是，对我的数据使用 PCA 是否会使某些特定特征选择技术的使用变得无用或效率降低。

Answer 1

让我们从什么时候应该使用 PCA 开始？

当您不确定数据的哪个组成部分影响准确性时，PCA 最有用。

让我们考虑一下人脸识别任务。能不能一目了然地说出最关键的像素点？

例如：Olivetti 面孔。 40 人，黑暗均匀的背景，不同的照明、面部表情（open/closed 眼睛、微笑/不微笑）和面部细节（戴眼镜/不戴眼镜）。

因此，如果我们查看像素之间的相关性：

from sklearn.datasets import fetch_olivetti_faces
from numpy import corrcoef
from numpy import zeros_like
from numpy import triu_indices_from
from matplotlib.pyplot import figure
from matplotlib.pyplot import get_cmap
from matplotlib.pyplot import plot
from matplotlib.pyplot import colorbar
from matplotlib.pyplot import subplots
from matplotlib.pyplot import suptitle
from matplotlib.pyplot import imshow
from matplotlib.pyplot import xlabel
from matplotlib.pyplot import ylabel
from matplotlib.pyplot import savefig
from matplotlib.image import imread
import seaborn


olivetti = fetch_olivetti_faces()

X = olivetti.images  # Train
y = olivetti.target  # Labels

X = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))

seaborn.set(font_scale=1.2)
seaborn.set_style("darkgrid")
mask = zeros_like(corrcoef(X_resp))
mask[triu_indices_from(mask)] = True
with seaborn.axes_style("white"):
    f, ax = subplots(figsize=(20, 15))
    ax = seaborn.heatmap(corrcoef(X), 
                         annot=True, 
                         mask=mask, 
                         vmax=1,
                         vmin=0,
                         square=True, 
                         cmap="YlGnBu",
                         annot_kws={"size": 1})
    
savefig('heatmap.png')

从上面你能告诉我哪些像素对分类最重要吗？

但是，如果我问你，“你能告诉我慢性肾病最重要的特征吗？”

一眼就能告诉我：

如果我们从人脸识别任务中恢复过来，我们真的需要所有像素来进行分类吗？

不，我们没有。

在上面你只能看到 63 像素足以将人脸识别为人。

请注意63像素足以识别人脸，不是人脸识别。您需要更多像素来区分人脸。

所以我们所做的就是降维。您可能想阅读有关 Curse of dimensionality

的更多信息

好的，所以我们决定使用 PCA，因为我们不需要面部图像的每个像素。我们要降维了。

为了视觉上易于理解，我使用了二维。

def projection(obj, x, x_label, y_label, title, class_num=40, sample_num=10, dpi=300):
    x_obj = obj.transform(x)
    idx_range = class_num * sample_num
    fig = figure(figsize=(6, 3), dpi=dpi)
    ax = fig.add_subplot(1, 1, 1)
    c_map = get_cmap(name='jet', lut=class_num)
    scatter = ax.scatter(x_obj[:idx_range, 0], x_obj[:idx_range, 1], c=y[:idx_range],
                         s=10, cmap=c_map)
    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
    ax.set_title(title.format(class_num))
    colorbar(mappable=scatter)
    


pca_obj = PCA(n_components=2).fit(X)
x_label = "First Principle Component"
y_label = "Second Principle Component"
title = "PCA Projection of {} people"
projection(obj=pca_obj, x=X, x_label=x_label, y_label=y_label, title=title)

如您所见，具有 2 个分量的 PCA 不足以区分。

那么你需要多少个组件？

def display_n_components(obj):
    figure(1, figsize=(6,3), dpi=300)
    plot(obj.explained_variance_, linewidth=2)
    xlabel('Components')
    ylabel('Explained Variaces')


pca_obj2 = PCA().fit(X)
display_n_components(pca_obj2)

您需要 100 个组件才能获得良好的辨别力。

现在我们需要拆分训练集和测试集。

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

X_train = X_train.reshape((X_train.shape[0], X.shape[1] * X.shape[2])) 
X_test = X_test.reshape((X_test.shape[0], X.shape[1] * X.shape[2]))

pca = PCA(n_components=100).fit(X)
X_pca_tr = pca.transform(X_train)
X_pca_te = pca.transform(X_test)

forest1 = RandomForestClassifier(random_state=42)
forest1.fit(X_pca_tr, y_train)
y_pred = forest1.predict(X_pca_te)
print("\nAccuracy:{:,.2f}%".format(accuracy_score(y_true=y_test, y_pred=y_pred_)*100))

准确度为：

您可能想知道，PCA 是否提高了准确性？

答案是肯定的。

没有主成分分析：

PCA 之后的最优特征选择技术？

Optimal Feature Selection Technique after PCA?

python

classification

pca

feature-selection

random-forest