在管道进程中创建新的 Pandas Dataframe 列

Question

我很抱歉，但我不确定如何在不举我的例子的情况下解释这一点，因此：

对于 Kaggle 中的 titanic 数据集，有些人添加了一个名为 'isChild' 的新列并将其应用于年龄列，如果年龄在 13 岁以下，则他是 child 否则he/she 是成年人。从那里他们可以很好地预处理、创建和调整他们的模型。

如果我要创建相同的模型并将其部署到任何人都可以使用 Dataframe 的原始输入在前端填写表单的地方，该模型将无法工作因为 'isChild' 是在预处理部分计算的。

我知道人们使用 Pipeline 和 make_pipline 来创建流程，但我的问题是人们总是在 Pipeline 中添加通用步骤，例如 PCA 或输入缺失值。如何添加一个步骤来添加这个新列，然后在整个模型中运行它？

如果您能指导我或 link 我提供有用的信息或回答这个问题，我们将不胜感激。

Answer 1

所以我的问题的答案是我首先必须创建一个 class，像这样：

class DataframeFunctionTransformer():
def __init__(self, func):
    self.func = func

def transform(self, input_df, **transform_params):
    return self.func(input_df)

def fit(self, X, y=None, **fit_params):
    return self

然后一旦创建了这个 class，我就可以创建自己的函数，即向 titanic Dataframe 添加一个新列（isChild 列）：

def ischild(dataset):
dataset['Child'] = dataset['Age'].apply(lambda x: 'Yes' if x<13 else 'No')

return dataset

现在，当使用 sklearn 创建管道时，我可以像这样使用我的新函数：

from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ("ChildColumn", DataframeFunctionTransformer(ischild))
])

谢谢。

在管道进程中创建新的 Pandas Dataframe 列

Creating new Pandas Dataframe column within pipeline process

python

preprocessor

pipeline

python-3.x

pandas