从分类数据集中删除冗余特征 ( make_classification )

Question

在make_classification方法中，

X,y = make_classification(n_samples=10, n_features=8, n_informative=7, n_redundant=1, n_repeated=0 , n_classes=2,random_state=6)

Docstring about n_redundant: The number of redundant features. These features are generated as random linear combinations of the informative features.

Docstring about n_repeated: The number of duplicated features, drawn randomly from the informative

n_repeated 特征很容易被挑选出来，因为它们与信息特征高度相关。
重复和冗余特征的文档字符串表明它们都是从信息特征中提取的。

我的问题是：冗余特征怎么可能removed/highlighted，它们的特点是什么

附上所有特征之间的相关热图，图像中哪个特征是多余的。

请帮忙

Answer 1

检查有多少独立列使用np.linalg.matrix_rank(X)
要查找矩阵 X 的线性独立行的索引，请使用 sympy.Matrix(X).rref()

演示

生成数据集并检查独立列数（矩阵秩）：

from sklearn.datasets import make_classification
from sympy import Matrix

X, _ = make_classification(
    n_samples=10, n_features=8, n_redundant=2,random_state=6
)
np.linalg.matrix_rank(X, tol=1e-3)
# 6

查找线性无关列的索引：

_, inds = Matrix(X).rref(iszerofunc=lambda x: abs(x)<1e-3)
inds
#(0, 1, 2, 3, 6, 7)

删除依赖列并检查矩阵秩（独立列数）：

#linearly independent
X_independent = X[:,inds]
assert np.linalg.matrix_rank(X_independent, tol=1e-3) == X_independent.shape[1]

从分类数据集中删除冗余特征 ( make_classification )

Removing the redundant feature from classification dataset ( make_classification )

statistics

correlation

scikit-learn

data-science