在 sklearn 的 Pipeline 中使用 LabelEncoder 给出: fit_transform takes 2 positional arguments but 3 were given
Using a LabelEncoder in sklearn's Pipeline gives: fit_transform takes 2 positional arguments but 3 were given
我一直在尝试 运行 一些 ML 代码,但在 运行 完成我的管道后,我在拟合阶段一直犹豫不决。我在各种论坛上四处张望,但收效甚微。我发现有些人说您不能在管道中使用 LabelEncoder。我不确定这是多么真实。如果有人对此事有任何见解,我会很高兴听到他们的意见。
我不断收到此错误:
TypeError: fit_transform() takes 2 positional arguments but 3 were given
所以我不确定问题是出自我还是出自 python。这是我的代码:
data = pd.read_csv("ks-projects-201801.csv",
index_col="ID",
parse_dates=["deadline","launched"],
infer_datetime_format=True)
var = list(data)
data = data.drop(labels=[1014746686,1245461087, 1384087152, 1480763647, 330942060, 462917959, 69489148])
missing = [i for i in var if data[i].isnull().any()]
data = data.dropna(subset=missing,axis=0)
le = LabelEncoder()
oe = OrdinalEncoder()
oh = OneHotEncoder()
y = [i for i in var if i=="state"]
y = data[var.pop(8)]
p,p.index = pd.Series(le.fit_transform(y)),y.index
q = pd.read_csv("y.csv",index_col="ID")["0"]
label_y = le.fit_transform(y)
x = data[var]
obj_feat = x.select_dtypes(include="object")
dat_feat = x.select_dtypes(include="datetime64[ns]")
dat_feat = dat_feat.assign(dmonth=dat_feat.deadline.dt.month.astype("int64"),
dyear = dat_feat.deadline.dt.year.astype("int64"),
lmonth=dat_feat.launched.dt.month.astype("int64"),
lyear=dat_feat.launched.dt.year.astype("int64"))
dat_feat = dat_feat.drop(labels=["deadline","launched"],axis=1)
num_feat = x.select_dtypes(include=["int64","float64"])
u = dict(zip(list(obj_feat),[len(obj_feat[i].unique()) for i in obj_feat]))
le_obj = [i for i in u if u[i]<10]
oh_obj = [i for i in u if u[i]<20 and u[i]>10]
te_obj = [i for i in u if u[i]>20 and u[i]<25]
cb_obj = [i for i in u if u[i]>100]
# Pipeline time
#Impute and encode
strat = ["constant","most_frequent","mean","median"]
sc = StandardScaler()
oh_unk = "ignore"
encoders = [LabelEncoder(),
OneHotEncoder(handle_unknown=oh_unk),
TargetEncoder(),
CatBoostEncoder()]
#num_trans = Pipeline(steps=[("imp",SimpleImputer(strategy=strat[2])),
num_trans = Pipeline(steps=[("sc",sc)])
#obj_imp = Pipeline(steps=[("imp",SimpleImputer(strategy=strat[1]))])
oh_enc = Pipeline(steps=[("oh_enc",encoders[1])])
te_enc = Pipeline(steps=[("te_enc",encoders[2])])
cb_enc = Pipeline(steps=[("cb_enc",encoders[0])])
trans = ColumnTransformer(transformers=[
("num",num_trans,list(num_feat)+list(dat_feat)),
#("obj",obj_imp,list(obj_feat)),
("onehot",oh_enc,oh_obj),
("target",te_enc,te_obj),
("catboost",cb_enc,cb_obj)
])
models = [RandomForestClassifier(random_state=0),
KNeighborsClassifier(),
DecisionTreeClassifier(random_state=0)]
model = models[2]
print("Check 4")
# Chaining it all together
run = Pipeline(steps=[("Transformation",trans),("Model",model)])
x = pd.concat([obj_feat,dat_feat,num_feat],axis=1)
print("Check 5")
run.fit(x,p)
它 运行 没问题,直到 run.fit 抛出错误。我很想听听任何人可能提出的任何建议,也将不胜感激解决此问题的任何可能方法!谢谢。
问题与 this answer, but with a LabelEncoder
in your case. The LabelEncoder
's fit_transform
方法中发现的相同:
def fit_transform(self, y):
"""Fit label encoder and return encoded labels
...
而 Pipeline
期望其所有转换器都采用三个位置参数 fit_transform(self, X, y)
。
您可以按照上述答案制作自定义转换器,但是,LabelEncoder
不应用作特征转换器。在 LabelEncoder for categorical features? 中可以看到关于原因的广泛解释。因此,我建议不要使用 LabelEcoder
并在功能数量过多时使用其他一些贝叶斯编码器,例如编码器列表中也有的 TargetEncoder
。
我一直在尝试 运行 一些 ML 代码,但在 运行 完成我的管道后,我在拟合阶段一直犹豫不决。我在各种论坛上四处张望,但收效甚微。我发现有些人说您不能在管道中使用 LabelEncoder。我不确定这是多么真实。如果有人对此事有任何见解,我会很高兴听到他们的意见。
我不断收到此错误:
TypeError: fit_transform() takes 2 positional arguments but 3 were given
所以我不确定问题是出自我还是出自 python。这是我的代码:
data = pd.read_csv("ks-projects-201801.csv",
index_col="ID",
parse_dates=["deadline","launched"],
infer_datetime_format=True)
var = list(data)
data = data.drop(labels=[1014746686,1245461087, 1384087152, 1480763647, 330942060, 462917959, 69489148])
missing = [i for i in var if data[i].isnull().any()]
data = data.dropna(subset=missing,axis=0)
le = LabelEncoder()
oe = OrdinalEncoder()
oh = OneHotEncoder()
y = [i for i in var if i=="state"]
y = data[var.pop(8)]
p,p.index = pd.Series(le.fit_transform(y)),y.index
q = pd.read_csv("y.csv",index_col="ID")["0"]
label_y = le.fit_transform(y)
x = data[var]
obj_feat = x.select_dtypes(include="object")
dat_feat = x.select_dtypes(include="datetime64[ns]")
dat_feat = dat_feat.assign(dmonth=dat_feat.deadline.dt.month.astype("int64"),
dyear = dat_feat.deadline.dt.year.astype("int64"),
lmonth=dat_feat.launched.dt.month.astype("int64"),
lyear=dat_feat.launched.dt.year.astype("int64"))
dat_feat = dat_feat.drop(labels=["deadline","launched"],axis=1)
num_feat = x.select_dtypes(include=["int64","float64"])
u = dict(zip(list(obj_feat),[len(obj_feat[i].unique()) for i in obj_feat]))
le_obj = [i for i in u if u[i]<10]
oh_obj = [i for i in u if u[i]<20 and u[i]>10]
te_obj = [i for i in u if u[i]>20 and u[i]<25]
cb_obj = [i for i in u if u[i]>100]
# Pipeline time
#Impute and encode
strat = ["constant","most_frequent","mean","median"]
sc = StandardScaler()
oh_unk = "ignore"
encoders = [LabelEncoder(),
OneHotEncoder(handle_unknown=oh_unk),
TargetEncoder(),
CatBoostEncoder()]
#num_trans = Pipeline(steps=[("imp",SimpleImputer(strategy=strat[2])),
num_trans = Pipeline(steps=[("sc",sc)])
#obj_imp = Pipeline(steps=[("imp",SimpleImputer(strategy=strat[1]))])
oh_enc = Pipeline(steps=[("oh_enc",encoders[1])])
te_enc = Pipeline(steps=[("te_enc",encoders[2])])
cb_enc = Pipeline(steps=[("cb_enc",encoders[0])])
trans = ColumnTransformer(transformers=[
("num",num_trans,list(num_feat)+list(dat_feat)),
#("obj",obj_imp,list(obj_feat)),
("onehot",oh_enc,oh_obj),
("target",te_enc,te_obj),
("catboost",cb_enc,cb_obj)
])
models = [RandomForestClassifier(random_state=0),
KNeighborsClassifier(),
DecisionTreeClassifier(random_state=0)]
model = models[2]
print("Check 4")
# Chaining it all together
run = Pipeline(steps=[("Transformation",trans),("Model",model)])
x = pd.concat([obj_feat,dat_feat,num_feat],axis=1)
print("Check 5")
run.fit(x,p)
它 运行 没问题,直到 run.fit 抛出错误。我很想听听任何人可能提出的任何建议,也将不胜感激解决此问题的任何可能方法!谢谢。
问题与 this answer, but with a LabelEncoder
in your case. The LabelEncoder
's fit_transform
方法中发现的相同:
def fit_transform(self, y):
"""Fit label encoder and return encoded labels
...
而 Pipeline
期望其所有转换器都采用三个位置参数 fit_transform(self, X, y)
。
您可以按照上述答案制作自定义转换器,但是,LabelEncoder
不应用作特征转换器。在 LabelEncoder for categorical features? 中可以看到关于原因的广泛解释。因此,我建议不要使用 LabelEcoder
并在功能数量过多时使用其他一些贝叶斯编码器,例如编码器列表中也有的 TargetEncoder
。