有没有办法根据 y 中的真实标签对特征 X 进行转换?
Is there a way to do transformation on features X based on true labels in y?
我已经检查了涵盖该主题的其他问题,例如 , this, , and this as well as some great blog posts, blog1, blog2 and blog3(对各自作者的荣誉)但没有成功。
我想做的是转换 X
中值低于特定阈值的行,但仅转换那些对应于目标 y
中某些特定 class 的行(y != 9
)。阈值是根据其他class(y == 9
)计算的。但是,我在理解如何正确实施时遇到问题。
因为我想对此进行参数调整和交叉验证,所以我将不得不使用管道进行转换。我的自定义转换器 class 如下所示。请注意,我没有包含 TransformerMixin
,因为我认为我需要在 fit_transform()
函数中考虑 y
。
class CustomTransformer(BaseEstimator):
def __init__(self, percentile=.90):
self.percentile = percentile
def fit(self, X, y):
# Calculate thresholds for each column
thresholds = X.loc[y == 9, :].quantile(q=self.percentile, interpolation='linear').to_dict()
# Store them for later use
self.thresholds = thresholds
return self
def transform(self, X, y):
# Create a copy of X
X_ = X.copy(deep=True)
# Replace values lower than the threshold for each column
for p in self.thresholds:
X_.loc[y != 9, p] = X_.loc[y != 9, p].apply(lambda x: 0 if x < self.thresholds[p] else x)
return X_
def fit_transform(self, X, y=None):
return self.fit(X, y).transform(X, y)
然后将其送入管道和后续的 GridSearchCV。我在下面提供了一个工作示例。
imports...
# Create some example data to work with
random.seed(12)
target = [randint(1, 8) for _ in range(60)] + [9]*40
shuffle(target)
example = pd.DataFrame({'feat1': sample(range(50, 200), 100),
'feat2': sample(range(10, 160), 100),
'target': target})
example_x = example[['feat1', 'feat2']]
example_y = example['target']
# Create a final nested pipeline where the data pre-processing steps and the final estimator are included
pipeline = Pipeline(steps=[('CustomTransformer', CustomTransformer(percentile=.90)),
('estimator', RandomForestClassifier())])
# Parameter tuning with GridSearchCV
p_grid = {'estimator__n_estimators': [50, 100, 200]}
gs = GridSearchCV(pipeline, p_grid, cv=10, n_jobs=-1, verbose=3)
gs.fit(example_x, example_y)
以上代码给出了以下错误。
/opt/anaconda3/envs/Python37/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
TypeError: transform() missing 1 required positional argument: 'y'
我也尝试过其他方法,例如在 fit()
期间存储相应的 class 索引,然后在 transform()
期间使用这些索引。但是,由于交叉验证期间的训练和测试索引不同,因此在 transform()
中替换值时会出现索引错误。
那么,有什么巧妙的方法可以解决这个问题吗?
我在评论中谈到了这个:
class CustomTransformer(BaseEstimator):
def __init__(self, percentile=.90):
self.percentile = percentile
def fit(self, X, y):
# Calculate thresholds for each column
# We have appended y as last column in X, so remove that
X_ = X.iloc[:,:-1].copy(deep=True)
thresholds = X_.loc[y == 9, :].quantile(q=self.percentile, interpolation='linear').to_dict()
# Store them for later use
self.thresholds = thresholds
return self
def transform(self, X):
# Create a copy of actual X, except the targets which are appended
# We have appended y as last column in X, so remove that
X_ = X.iloc[:,:-1].copy(deep=True)
# Use that here to get y
y = X.iloc[:, -1].copy(deep=True)
# Replace values lower than the threshold for each column
for p in self.thresholds:
X_.loc[y != 9, p] = X_.loc[y != 9, p].apply(lambda x: 0 if x < self.thresholds[p] else x)
return X_
def fit_transform(self, X, y):
return self.fit(X, y).transform(X)
然后改变你的X,y:
# We are appending the target into X
example_x = example[['feat1', 'feat2', 'target']]
example_y = example['target']
我已经检查了涵盖该主题的其他问题,例如
我想做的是转换 X
中值低于特定阈值的行,但仅转换那些对应于目标 y
中某些特定 class 的行(y != 9
)。阈值是根据其他class(y == 9
)计算的。但是,我在理解如何正确实施时遇到问题。
因为我想对此进行参数调整和交叉验证,所以我将不得不使用管道进行转换。我的自定义转换器 class 如下所示。请注意,我没有包含 TransformerMixin
,因为我认为我需要在 fit_transform()
函数中考虑 y
。
class CustomTransformer(BaseEstimator):
def __init__(self, percentile=.90):
self.percentile = percentile
def fit(self, X, y):
# Calculate thresholds for each column
thresholds = X.loc[y == 9, :].quantile(q=self.percentile, interpolation='linear').to_dict()
# Store them for later use
self.thresholds = thresholds
return self
def transform(self, X, y):
# Create a copy of X
X_ = X.copy(deep=True)
# Replace values lower than the threshold for each column
for p in self.thresholds:
X_.loc[y != 9, p] = X_.loc[y != 9, p].apply(lambda x: 0 if x < self.thresholds[p] else x)
return X_
def fit_transform(self, X, y=None):
return self.fit(X, y).transform(X, y)
然后将其送入管道和后续的 GridSearchCV。我在下面提供了一个工作示例。
imports...
# Create some example data to work with
random.seed(12)
target = [randint(1, 8) for _ in range(60)] + [9]*40
shuffle(target)
example = pd.DataFrame({'feat1': sample(range(50, 200), 100),
'feat2': sample(range(10, 160), 100),
'target': target})
example_x = example[['feat1', 'feat2']]
example_y = example['target']
# Create a final nested pipeline where the data pre-processing steps and the final estimator are included
pipeline = Pipeline(steps=[('CustomTransformer', CustomTransformer(percentile=.90)),
('estimator', RandomForestClassifier())])
# Parameter tuning with GridSearchCV
p_grid = {'estimator__n_estimators': [50, 100, 200]}
gs = GridSearchCV(pipeline, p_grid, cv=10, n_jobs=-1, verbose=3)
gs.fit(example_x, example_y)
以上代码给出了以下错误。
/opt/anaconda3/envs/Python37/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
TypeError: transform() missing 1 required positional argument: 'y'
我也尝试过其他方法,例如在 fit()
期间存储相应的 class 索引,然后在 transform()
期间使用这些索引。但是,由于交叉验证期间的训练和测试索引不同,因此在 transform()
中替换值时会出现索引错误。
那么,有什么巧妙的方法可以解决这个问题吗?
我在评论中谈到了这个:
class CustomTransformer(BaseEstimator):
def __init__(self, percentile=.90):
self.percentile = percentile
def fit(self, X, y):
# Calculate thresholds for each column
# We have appended y as last column in X, so remove that
X_ = X.iloc[:,:-1].copy(deep=True)
thresholds = X_.loc[y == 9, :].quantile(q=self.percentile, interpolation='linear').to_dict()
# Store them for later use
self.thresholds = thresholds
return self
def transform(self, X):
# Create a copy of actual X, except the targets which are appended
# We have appended y as last column in X, so remove that
X_ = X.iloc[:,:-1].copy(deep=True)
# Use that here to get y
y = X.iloc[:, -1].copy(deep=True)
# Replace values lower than the threshold for each column
for p in self.thresholds:
X_.loc[y != 9, p] = X_.loc[y != 9, p].apply(lambda x: 0 if x < self.thresholds[p] else x)
return X_
def fit_transform(self, X, y):
return self.fit(X, y).transform(X)
然后改变你的X,y:
# We are appending the target into X
example_x = example[['feat1', 'feat2', 'target']]
example_y = example['target']