避免不必要的 Class 声明

Question

我正在做一个 ML 项目并决定使用 classes 来组织我的代码。虽然，我不确定我的方法是否最佳。如果您能分享最佳实践，以及您将如何应对类似挑战，我将不胜感激：

让我们专注于预处理模块，我在其中创建了 Preprocessor class.

这个 class 有 3 种数据操作方法，每种方法都将数据框作为输入并添加一个特征。每个方法的输出可以是另一个的输入。

我还有第 4 个 wrapper 方法，它采用这 3 个方法，将它们链接起来并创建最终输出：

def wrapper(self):
   output = self.method_1(self.df)
   output = self.method_2(output)
   output = self.method_3(output)
return output

当我想使用 class 时，我正在使用 df 创建实例，然后调用 wrapper 函数它。这感觉不自然，让我觉得有更好的方法。

import A_class
instance = A_class(df)
output = instance.wrapper()

Answer 1

如果您需要跟踪对象的 of/modify 内部状态，

类非常有用。但它们并不是让您的代码仅按现有组织的神奇事物。如果您拥有的只是一个预处理管道，它获取一些数据并通过直线方法运行它，那么常规函数通常会不那么麻烦。

根据您提供的上下文，我可能会这样做：

pipelines.py

def preprocess_data_xyz(data):
    """
    Takes a dataframe of nature XYZ and returns it after 
    running it through the necessary preprocessing steps.
    """
    step_1 = func_1(data)
    step_2 = func_2(step_1)
    step_3 = func_3(step_2)
    return step_3

def func_1(data):
    """Does X to data."""
    pass

# etc ...

analysis.py

import pandas as pd
from pipelines import preprocess_data_xyz

data_xyz = pd.DataFrame( ... )
preprocessed_data_xyz = preprocess_data_xyz(data=data_xyz)

选择更好的变量和函数也是组织代码的主要组成部分 - 您应该将 func_1 替换为描述它对数据所做的操作的名称（类似于 add_numerical_column、parse_datetime_column, 等等）。 data_xyz 变量也是如此。

避免不必要的 Class 声明

Avoiding Unnecessary Class Declarations

python

oop

class