有没有办法在 scikit 学习中使用互信息作为管道的一部分?

Is there a way to use mutual information as part of a pipeline in scikit learn?

我正在使用 scikit-learn 创建模型。似乎效果最好的管道是:

  1. mutual_info_classif 具有阈值 - 即仅包括互信息得分高于给定阈值的字段。
  2. PCA
  3. LogisticRegression

我想使用 sklearn 的 pipeline 对象来完成它们,但我不确定如何获取相互信息 classification。对于第二步和第三步,我做了:

pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
    [
        ('dim_red', pca),
        ('pred', lr)
    ]
)

但我看不到包含第一步的方法。我知道我可以创建自己的 class 来执行此操作,如果必须的话我会这样做,但是有没有办法在 sklearn 中执行此操作?

您可以通过继承 BaseEstimator 来实现您的 Estimator。然后,您可以将它作为 estimator 传递给 SelectFromModel 实例,它可以在 Pipeline:

中使用
from sklearn.feature_selection import SelectFromModel, mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

X = [[ 0.87, -1.34,  0.31 ],
     [-2.79, -0.02, -0.85 ],
     [-1.34, -0.48, -2.55 ],
     [ 1.92,  1.48,  0.65 ]]
y = [0, 1, 0, 1]


class MutualInfoEstimator(BaseEstimator):
    def __init__(self, discrete_features='auto', n_neighbors=3, copy=True, random_state=None):
        self.discrete_features = discrete_features
        self.n_neighbors = n_neighbors
        self.copy = copy
        self.random_state = random_state
    

    def fit(self, X, y):
        self.feature_importances_ = mutual_info_classif(X, y, discrete_features=self.discrete_features, 
                                                        n_neighbors=self.n_neighbors, 
                                                        copy=self.copy, random_state=self.random_state)
    

feat_sel = SelectFromModel(estimator=MutualInfoEstimator(random_state=0))
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)

pipe = Pipeline(
    [
        ('feat_sel', feat_sel),
        ('pca', pca),
        ('pred', lr)
    ]
)

print(pipe)
Pipeline(steps=[('feat_sel',
                 SelectFromModel(estimator=MutualInfoSelector(random_state=0))),
                ('pca', PCA(random_state=100)),
                ('pred', LogisticRegression(random_state=200))])

请注意,新的估算器当然应该公开您要在优化期间调整的参数。这里我就全部曝光了。

是的,我认为没有其他方法可以做到这一点。至少我不知道!

怎么样SelectKBest or SelectPercentile:

from sklearn.feature_selection import SelectKBest

mi_best = SelectKBest(score_func=mutual_info_classif, k=10)
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
    [
        ('select', mi_best),
        ('dim_red', pca),
        ('pred', lr),
    ]
)