Sklearn Pipeline:将参数传递给自定义 Transformer?
Sklearn Pipeline : pass a parameter to a custom Transformer?
我的 sklearn
管道中有一个自定义转换器,我想知道如何将参数传递给我的转换器:
在下面的代码中,您可以看到我在我的 Transformer 中使用了字典 "weight"。我不希望在我的 Transformer 中定义这个字典,而是从管道传递它,这样我就可以在网格搜索中包含这个字典。是否可以将字典作为参数传递给我的 Transformer?
# My custom Transformer
class TextExtractor(BaseEstimator, TransformerMixin):
"""Concat the 'title', 'body' and 'code' from the results of
Whosebug query
Keys are 'title', 'body' and 'code'.
"""
def fit(self, x, y=None):
return self
def transform(self, x):
# here is the parameter I want to pass to my transformer
weight ={'title' : 10, 'body': 1, 'code' : 1}
x['text'] = weight['title']*x['Title'] +
weight['body']*x['Body'] +
weight['code']*x['Code']
return x['text']
param_grid = {
'min_df' : [10],
'max_df' : [0.01],
'max_features': [200],
'clf' : [sgd]
# here is the parameter I want to pass to my transformer
'weigth' : [{'title' : 10, 'body': 1, 'code' : 1}, {'title' : 1, 'body':
1, 'code' : 1}]
}
for g in ParameterGrid(param_grid) :
classifier_pipe = Pipeline(
steps=[ ('textextractor', TextExtractor()), #is it possible to pass
my parameter ?
('vectorizer', TfidfVectorizer(max_df=g['max_df'],
min_df=g['min_df'], max_features=g['max_features'])),
('clf', g['clf']),
],
)
为此,您只需在 class 定义的开头添加一个 __init__()
方法。在此步骤中,您将 class TextExtractor
定义为采用您称为 weight
.
的参数
这是如何完成的:(为了可重现性,我之前添加了很多代码行 - 如果您没有指定任何内容,我会编造一些虚假数据。我还假设您正在尝试做权重是乘以字符串?)
# import all the necessary packages
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import ParameterGrid, GridSearchCV
from sklearn.linear_model import SGDClassifier
import pandas as pd
import numpy as np
#Sample data
X = pd.DataFrame({"Title" : ["T1","T2","T3","T4","T5"], "Body": ["B1","B2","B3","B4","B5"], "Code": ["C1","C2","C3","C4","C5"]})
y = np.array([0,0,1,1,1])
#Define the SGDClassifier
sgd = SGDClassifier()
下面,我只添加了init步骤:
# My custom Transformer
class TextExtractor(BaseEstimator, TransformerMixin):
"""Concat the 'title', 'body' and 'code' from the results of
Whosebug query
Keys are 'title', 'body' and 'code'.
"""
def __init__(self, weight = {'title' : 10, 'body': 1, 'code' : 1}):
self.weight = weight
def fit(self, x, y=None):
return self
def transform(self, x):
x['text'] = self.weight['title']*x['Title'] + self.weight['body']*x['Body'] + self.weight['code']*x['Code']
return x['text']
注意,如果你不指定,我默认传递了一个参数值。这取决于你。然后你可以通过以下方式调用你的变压器:
textextractor = TextExtractor(weight = {'title' : 5, 'body': 2, 'code' : 1})
textextractor.transform(X)
这应该return:
0 T1T1T1T1T1B1B1C1
1 T2T2T2T2T2B2B2C2
2 T3T3T3T3T3B3B3C3
3 T4T4T4T4T4B4B4C4
4 T5T5T5T5T5B5B5C5
然后你可以定义你的参数网格:
param_grid = {
'vectorizer__min_df' : [0.1],
'vectorizer__max_df' : [0.9],
'vectorizer__max_features': [200],
# here is the parameter I want to pass to my transformer
'textextractor__weight' : [{'title' : 10, 'body': 1, 'code' : 1}, {'title' : 1, 'body':
1, 'code' : 1}]
}
最后做:
for g in ParameterGrid(param_grid) :
classifier_pipe = Pipeline(
steps=[ ('textextractor', TextExtractor(weight = g['textextractor__weight'])),
('vectorizer', TfidfVectorizer(max_df=g['vectorizer__max_df'],
min_df=g['vectorizer__min_df'], max_features=g['vectorizer__max_features'])),
('clf', sgd), ] )
除此之外,您可能想要进行网格搜索,这将要求您编写:
pipe = Pipeline( steps=[ ('textextractor', TextExtractor()),
('vectorizer', TfidfVectorizer()),
('clf', sgd) ] )
grid = GridSearchCV(pipe, param_grid, cv = 3)
grid.fit(X,y)
我的 sklearn
管道中有一个自定义转换器,我想知道如何将参数传递给我的转换器:
在下面的代码中,您可以看到我在我的 Transformer 中使用了字典 "weight"。我不希望在我的 Transformer 中定义这个字典,而是从管道传递它,这样我就可以在网格搜索中包含这个字典。是否可以将字典作为参数传递给我的 Transformer?
# My custom Transformer
class TextExtractor(BaseEstimator, TransformerMixin):
"""Concat the 'title', 'body' and 'code' from the results of
Whosebug query
Keys are 'title', 'body' and 'code'.
"""
def fit(self, x, y=None):
return self
def transform(self, x):
# here is the parameter I want to pass to my transformer
weight ={'title' : 10, 'body': 1, 'code' : 1}
x['text'] = weight['title']*x['Title'] +
weight['body']*x['Body'] +
weight['code']*x['Code']
return x['text']
param_grid = {
'min_df' : [10],
'max_df' : [0.01],
'max_features': [200],
'clf' : [sgd]
# here is the parameter I want to pass to my transformer
'weigth' : [{'title' : 10, 'body': 1, 'code' : 1}, {'title' : 1, 'body':
1, 'code' : 1}]
}
for g in ParameterGrid(param_grid) :
classifier_pipe = Pipeline(
steps=[ ('textextractor', TextExtractor()), #is it possible to pass
my parameter ?
('vectorizer', TfidfVectorizer(max_df=g['max_df'],
min_df=g['min_df'], max_features=g['max_features'])),
('clf', g['clf']),
],
)
为此,您只需在 class 定义的开头添加一个 __init__()
方法。在此步骤中,您将 class TextExtractor
定义为采用您称为 weight
.
这是如何完成的:(为了可重现性,我之前添加了很多代码行 - 如果您没有指定任何内容,我会编造一些虚假数据。我还假设您正在尝试做权重是乘以字符串?)
# import all the necessary packages
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import ParameterGrid, GridSearchCV
from sklearn.linear_model import SGDClassifier
import pandas as pd
import numpy as np
#Sample data
X = pd.DataFrame({"Title" : ["T1","T2","T3","T4","T5"], "Body": ["B1","B2","B3","B4","B5"], "Code": ["C1","C2","C3","C4","C5"]})
y = np.array([0,0,1,1,1])
#Define the SGDClassifier
sgd = SGDClassifier()
下面,我只添加了init步骤:
# My custom Transformer
class TextExtractor(BaseEstimator, TransformerMixin):
"""Concat the 'title', 'body' and 'code' from the results of
Whosebug query
Keys are 'title', 'body' and 'code'.
"""
def __init__(self, weight = {'title' : 10, 'body': 1, 'code' : 1}):
self.weight = weight
def fit(self, x, y=None):
return self
def transform(self, x):
x['text'] = self.weight['title']*x['Title'] + self.weight['body']*x['Body'] + self.weight['code']*x['Code']
return x['text']
注意,如果你不指定,我默认传递了一个参数值。这取决于你。然后你可以通过以下方式调用你的变压器:
textextractor = TextExtractor(weight = {'title' : 5, 'body': 2, 'code' : 1})
textextractor.transform(X)
这应该return:
0 T1T1T1T1T1B1B1C1
1 T2T2T2T2T2B2B2C2
2 T3T3T3T3T3B3B3C3
3 T4T4T4T4T4B4B4C4
4 T5T5T5T5T5B5B5C5
然后你可以定义你的参数网格:
param_grid = {
'vectorizer__min_df' : [0.1],
'vectorizer__max_df' : [0.9],
'vectorizer__max_features': [200],
# here is the parameter I want to pass to my transformer
'textextractor__weight' : [{'title' : 10, 'body': 1, 'code' : 1}, {'title' : 1, 'body':
1, 'code' : 1}]
}
最后做:
for g in ParameterGrid(param_grid) :
classifier_pipe = Pipeline(
steps=[ ('textextractor', TextExtractor(weight = g['textextractor__weight'])),
('vectorizer', TfidfVectorizer(max_df=g['vectorizer__max_df'],
min_df=g['vectorizer__min_df'], max_features=g['vectorizer__max_features'])),
('clf', sgd), ] )
除此之外,您可能想要进行网格搜索,这将要求您编写:
pipe = Pipeline( steps=[ ('textextractor', TextExtractor()),
('vectorizer', TfidfVectorizer()),
('clf', sgd) ] )
grid = GridSearchCV(pipe, param_grid, cv = 3)
grid.fit(X,y)