scikit-learn 估算另一个特征中标称值组内特征的均值
scikit-learn impute mean of feature within groups of nominal value in another feature
我想估算一个特征的均值,但只根据另一列中具有相同 category/nominal 值的其他示例计算均值,我想知道是否可以使用 scikit-learn 的 Imputer class?这样只会让添加到管道中变得更容易。
例如:
使用来自 kaggle 的泰坦尼克号数据集:source
我将如何计算每个 pclass
的平均值 fare
。其背后的想法是,不同 classes 的人在门票之间的成本会有很大差异。
更新: 经过与一些人的讨论,我应该使用的短语是 "imputing the mean within class"。
我查看了下面 Vivek 的评论,当我有时间做我想做的事情时,我将构建一个通用的管道函数:) 我很清楚如何去做,并且会 post 作为完成后的答案。
所以下面是一个非常简单的方法来解决我的问题,它只是为了处理事物的方式。一个更健壮的实现可能涉及利用 scikit learn 中的 Imputer class,这意味着它也可以执行模式、中值等,并且会更好地处理 sparse/dense 矩阵。
这是基于 Vivek Kumar 对原始问题的评论,该评论建议将数据拆分为堆栈并以这种方式进行估算,然后重新组装它们。
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
class WithinClassMeanImputer(BaseEstimator, TransformerMixin):
def __init__(self, replace_col_index, class_col_index = None, missing_values=np.nan):
self.missing_values = missing_values
self.replace_col_index = replace_col_index
self.y = None
self.class_col_index = class_col_index
def fit(self, X, y = None):
self.y = y
return self
def transform(self, X):
y = self.y
classes = np.unique(y)
stacks = []
if len(X) > 1 and len(self.y) = len(X):
if( self.class_col_index == None ):
# If we're using the dependent variable
for aclass in classes:
with_missing = X[(y == aclass) &
(X[:, self.replace_col_index] == self.missing_values)]
without_missing = X[(y == aclass) &
(X[:, self.replace_col_index] != self.missing_values)]
column = without_missing[:, self.replace_col_index]
# Calculate mean from examples without missing values
mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values])
# Broadcast mean to all missing values
with_missing[:, self.replace_col_index] = mean
stacks.append(np.concatenate((with_missing, without_missing)))
else:
# If we're using nominal values within a binarised feature (i.e. the classes
# are unique values within a nominal column - e.g. sex)
for aclass in classes:
with_missing = X[(X[:, self.class_col_index] == aclass) &
(X[:, self.replace_col_index] == self.missing_values)]
without_missing = X[(X[:, self.class_col_index] == aclass) &
(X[:, self.replace_col_index] != self.missing_values)]
column = without_missing[:, self.replace_col_index]
# Calculate mean from examples without missing values
mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values])
# Broadcast mean to all missing values
with_missing[:, self.replace_col_index] = mean
stacks.append(np.concatenate((with_missing, without_missing)))
if len(stacks) > 1 :
# Reassemble our stacks of values
X = np.concatenate(stacks)
return X
我想估算一个特征的均值,但只根据另一列中具有相同 category/nominal 值的其他示例计算均值,我想知道是否可以使用 scikit-learn 的 Imputer class?这样只会让添加到管道中变得更容易。
例如:
使用来自 kaggle 的泰坦尼克号数据集:source
我将如何计算每个 pclass
的平均值 fare
。其背后的想法是,不同 classes 的人在门票之间的成本会有很大差异。
更新: 经过与一些人的讨论,我应该使用的短语是 "imputing the mean within class"。
我查看了下面 Vivek 的评论,当我有时间做我想做的事情时,我将构建一个通用的管道函数:) 我很清楚如何去做,并且会 post 作为完成后的答案。
所以下面是一个非常简单的方法来解决我的问题,它只是为了处理事物的方式。一个更健壮的实现可能涉及利用 scikit learn 中的 Imputer class,这意味着它也可以执行模式、中值等,并且会更好地处理 sparse/dense 矩阵。
这是基于 Vivek Kumar 对原始问题的评论,该评论建议将数据拆分为堆栈并以这种方式进行估算,然后重新组装它们。
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
class WithinClassMeanImputer(BaseEstimator, TransformerMixin):
def __init__(self, replace_col_index, class_col_index = None, missing_values=np.nan):
self.missing_values = missing_values
self.replace_col_index = replace_col_index
self.y = None
self.class_col_index = class_col_index
def fit(self, X, y = None):
self.y = y
return self
def transform(self, X):
y = self.y
classes = np.unique(y)
stacks = []
if len(X) > 1 and len(self.y) = len(X):
if( self.class_col_index == None ):
# If we're using the dependent variable
for aclass in classes:
with_missing = X[(y == aclass) &
(X[:, self.replace_col_index] == self.missing_values)]
without_missing = X[(y == aclass) &
(X[:, self.replace_col_index] != self.missing_values)]
column = without_missing[:, self.replace_col_index]
# Calculate mean from examples without missing values
mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values])
# Broadcast mean to all missing values
with_missing[:, self.replace_col_index] = mean
stacks.append(np.concatenate((with_missing, without_missing)))
else:
# If we're using nominal values within a binarised feature (i.e. the classes
# are unique values within a nominal column - e.g. sex)
for aclass in classes:
with_missing = X[(X[:, self.class_col_index] == aclass) &
(X[:, self.replace_col_index] == self.missing_values)]
without_missing = X[(X[:, self.class_col_index] == aclass) &
(X[:, self.replace_col_index] != self.missing_values)]
column = without_missing[:, self.replace_col_index]
# Calculate mean from examples without missing values
mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values])
# Broadcast mean to all missing values
with_missing[:, self.replace_col_index] = mean
stacks.append(np.concatenate((with_missing, without_missing)))
if len(stacks) > 1 :
# Reassemble our stacks of values
X = np.concatenate(stacks)
return X