有没有办法在 scikit 学习中使用互信息作为管道的一部分?
Is there a way to use mutual information as part of a pipeline in scikit learn?
我正在使用 scikit-learn 创建模型。似乎效果最好的管道是:
- mutual_info_classif 具有阈值 - 即仅包括互信息得分高于给定阈值的字段。
- PCA
- LogisticRegression
我想使用 sklearn 的 pipeline 对象来完成它们,但我不确定如何获取相互信息 classification。对于第二步和第三步,我做了:
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('dim_red', pca),
('pred', lr)
]
)
但我看不到包含第一步的方法。我知道我可以创建自己的 class 来执行此操作,如果必须的话我会这样做,但是有没有办法在 sklearn 中执行此操作?
您可以通过继承 BaseEstimator
来实现您的 Estimator
。然后,您可以将它作为 estimator
传递给 SelectFromModel
实例,它可以在 Pipeline
:
中使用
from sklearn.feature_selection import SelectFromModel, mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
X = [[ 0.87, -1.34, 0.31 ],
[-2.79, -0.02, -0.85 ],
[-1.34, -0.48, -2.55 ],
[ 1.92, 1.48, 0.65 ]]
y = [0, 1, 0, 1]
class MutualInfoEstimator(BaseEstimator):
def __init__(self, discrete_features='auto', n_neighbors=3, copy=True, random_state=None):
self.discrete_features = discrete_features
self.n_neighbors = n_neighbors
self.copy = copy
self.random_state = random_state
def fit(self, X, y):
self.feature_importances_ = mutual_info_classif(X, y, discrete_features=self.discrete_features,
n_neighbors=self.n_neighbors,
copy=self.copy, random_state=self.random_state)
feat_sel = SelectFromModel(estimator=MutualInfoEstimator(random_state=0))
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('feat_sel', feat_sel),
('pca', pca),
('pred', lr)
]
)
print(pipe)
Pipeline(steps=[('feat_sel',
SelectFromModel(estimator=MutualInfoSelector(random_state=0))),
('pca', PCA(random_state=100)),
('pred', LogisticRegression(random_state=200))])
请注意,新的估算器当然应该公开您要在优化期间调整的参数。这里我就全部曝光了。
是的,我认为没有其他方法可以做到这一点。至少我不知道!
怎么样SelectKBest
or SelectPercentile
:
from sklearn.feature_selection import SelectKBest
mi_best = SelectKBest(score_func=mutual_info_classif, k=10)
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('select', mi_best),
('dim_red', pca),
('pred', lr),
]
)
我正在使用 scikit-learn 创建模型。似乎效果最好的管道是:
- mutual_info_classif 具有阈值 - 即仅包括互信息得分高于给定阈值的字段。
- PCA
- LogisticRegression
我想使用 sklearn 的 pipeline 对象来完成它们,但我不确定如何获取相互信息 classification。对于第二步和第三步,我做了:
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('dim_red', pca),
('pred', lr)
]
)
但我看不到包含第一步的方法。我知道我可以创建自己的 class 来执行此操作,如果必须的话我会这样做,但是有没有办法在 sklearn 中执行此操作?
您可以通过继承 BaseEstimator
来实现您的 Estimator
。然后,您可以将它作为 estimator
传递给 SelectFromModel
实例,它可以在 Pipeline
:
from sklearn.feature_selection import SelectFromModel, mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
X = [[ 0.87, -1.34, 0.31 ],
[-2.79, -0.02, -0.85 ],
[-1.34, -0.48, -2.55 ],
[ 1.92, 1.48, 0.65 ]]
y = [0, 1, 0, 1]
class MutualInfoEstimator(BaseEstimator):
def __init__(self, discrete_features='auto', n_neighbors=3, copy=True, random_state=None):
self.discrete_features = discrete_features
self.n_neighbors = n_neighbors
self.copy = copy
self.random_state = random_state
def fit(self, X, y):
self.feature_importances_ = mutual_info_classif(X, y, discrete_features=self.discrete_features,
n_neighbors=self.n_neighbors,
copy=self.copy, random_state=self.random_state)
feat_sel = SelectFromModel(estimator=MutualInfoEstimator(random_state=0))
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('feat_sel', feat_sel),
('pca', pca),
('pred', lr)
]
)
print(pipe)
Pipeline(steps=[('feat_sel',
SelectFromModel(estimator=MutualInfoSelector(random_state=0))),
('pca', PCA(random_state=100)),
('pred', LogisticRegression(random_state=200))])
请注意,新的估算器当然应该公开您要在优化期间调整的参数。这里我就全部曝光了。
是的,我认为没有其他方法可以做到这一点。至少我不知道!
怎么样SelectKBest
or SelectPercentile
:
from sklearn.feature_selection import SelectKBest
mi_best = SelectKBest(score_func=mutual_info_classif, k=10)
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('select', mi_best),
('dim_red', pca),
('pred', lr),
]
)