使用管道时错误的 MSE

Bad MSE while using Pipes

我正在尝试从我抓取的数据集中预测一些价格。我从来没有为此使用过Python(我通常使用tidyverse,但这次我想探索pipeline。 所以这是代码片段:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import numpy as np

df = pd.read_csv("https://raw.githubusercontent.com/norhther/idealista/main/idealistaBCN.csv")
df.drop("info", axis = 1, inplace = True)
df["floor"].fillna(1, inplace=True)
df.drop("neigh", axis = 1, inplace = True)
df.dropna(inplace = True)
df = df[df["habs"] < 11]
X = df.drop("price", axis = 1)
y = df["price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
ct = ColumnTransformer(
   [("standardScaler", StandardScaler(), ["habs", "m2", "floor"]),
   ("onehot", OneHotEncoder(), ["type"]
    )], remainder="passthrough")

pipe = Pipeline(steps = [("Transformer", ct),
                          ("svr", SVR())])

param_grid = {
  "svr__kernel" : ['linear', 'poly', 'rbf', 'sigmoid'],
  "svr__degree" : range(3,6),
  "svr__gamma" : ['scale', 'auto'],
  "svr__coef0" : np.linspace(0.01, 1, 2)
}

search = GridSearchCV(pipe, param_grid,  scoring = ['neg_mean_squared_error'], refit='neg_mean_squared_error')

search.fit(X_train, y_train)
print(search.best_score_)

pipe = Pipeline(steps = [("Transformer", ct),
                          ("svr", SVR(coef0 = search.best_params_["svr__coef0"],
                                     degree = search.best_params_["svr__degree"],
                                     kernel = 

search.best_params_["svr__kernel"]))])

from sklearn.metrics import mean_squared_error

pipe.fit(X_train, y_train)
preds = pipe.predict(X_train)
mean_squared_error(preds, y_train)

这里的search.best_score_-443829697806.1671MSE608953977916.3896 我想我搞砸了什么,也许是变压器,但我不完全确定。我认为这是一个夸张的MSE。我用 tidymodels 做了一个非常相似的方法,我得到了更好的结果。 所以在这里我想知道是变压器有问题,还是模型太差了。

原因是你没有在参数中包含C,你需要覆盖整个Cs范围才能适应。如果我们用默认的C=1来拟合,就可以看出问题出在哪里了:

import matplotlib.pyplot as plt
o = pipe.named_steps["Transformer"].fit_transform(X_train)
mdl = SVR(C=1)
mdl.fit(o,y_train)
plt.scatter(mdl.predict(o),y_train)

有些价格是平均值的 10 倍(1e7 对比中位数 5e5)。如果您使用 mse 或 r^2,这些将在很大程度上取决于这些极值。所以我们需要更密切地关注数据,这是由 C 决定的,你可以 read more about here。我们尝试一个范围:

ct = ColumnTransformer(
   [("standardScaler", StandardScaler(), ["habs", "m2", "floor"]),
   ("onehot", OneHotEncoder(), ["type"]
    )], remainder="passthrough")

pipe = Pipeline(steps = [("Transformer", ct),
                          ("svr", SVR())])

#, 'poly', 'rbf', 'sigmoid'
param_grid = {
  "svr__kernel" : ['rbf'],
  "svr__gamma" : ['auto'],
  "svr__coef0" : [1,2],
   "svr__C" : [1e-03,1e-01,1e1,1e3,1e5,1e7]
}

search = GridSearchCV(pipe, param_grid, scoring = ['neg_mean_squared_error'], 
refit='neg_mean_squared_error')

search.fit(X_train, y_train)
print(search.best_score_)
-132061065775.25969

您的 y 值很高,并且 MSE 值将在您的 y 值的方差范围内,因此如果我们检查:

y_train.var()
545423126823.4545

132061065775.25969 / y_train.var()
0.24212590057261346

很好,您将 MSE 降低到大约 25% 的方差。我们可以用测试数据来检查这一点,我想在这种情况下很幸运,C 值非常好:

from sklearn.metrics import mean_squared_error

o = pipe.named_steps["Transformer"].fit_transform(X_train)
mdl = SVR(C=10000000.0, coef0=1, gamma='auto')
mdl.fit(o,y_train)

o_test = pipe.named_steps["Transformer"].fit_transform(X_test)

pred = mdl.predict(o_test)
print( mean_squared_error(pred,y_test) , mean_squared_error(pred,y_test)/y_test.var())
plt.scatter(mdl.predict(o_test),y_test)