Stats Models out of sample prediction of new data 其中特征已经被转换

Stats Models out of sample prediction of new data where features have been transformed

我很好奇为什么我无法得出模型预测的相同值。

考虑以下模型。我试图了解保险费用、年龄和客户是否吸烟之间的关系。

注意年龄变量已经过预处理(均值居中)。

import pandas as pd
import statsmodels.formula.api as smf

insurance = pd.read_csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv")
model1 = smf.ols('charges~I(age - np.mean(age)) * smoker', data=insurance)
fit1 = model1.fit()
params = fit1.params
# get params
b0, b2, b1, b3 = params['Intercept'], params['smoker[T.yes]'], params['I(age - np.mean(age))'], params['I(age - np.mean(age)):smoker[T.yes]']
x1 = (insurance['age'] - np.mean(insurance['age']))
# two lines with diff intercept and slopes
y_hat_non = b0 + b1 * x1 
y_hat_smok = (b0  + b2) + (b1 + b3) * x1

现在,当我生成新数据并应用预测方法时,我会在尝试手动计算这些数据时得到不同的值。 以索引 0 和索引 2 为例,我希望预测值与下面的输出相似,但这些确实相差甚远。

我是否遗漏了一些有关拟合模型时完成的特征转换的信息?

new_data = pd.DataFrame({'age': {0: 19, 1: 41, 2: 43}, 
                        'smoker': {0: 'yes', 1: 'no', 2: 'no'}})

idx_0 = (b0+b2) + (b1+b3) * 19
# 38061.1
idx_2 = b0 + b1 * 43
# 19878.4

fit1.predict(new_data)
0    27581.276650
1    10168.273779
2    10702.771604


我想你想将年龄 variable 居中,这个 I(age - np.mean(age)) 有效,但是当你尝试预测时,它会根据你的预测数据框中的平均值再次重新评估年龄.

此外,当你乘以系数时,你必须将它乘以中心值(即年龄 - 平均值(年龄))而不是原始值。

创建另一个年龄居中的列也没什么坏处:

import pandas as pd
import statsmodels.formula.api as smf
import numpy as np
from sklearn.preprocessing import StandardScaler

sc = StandardScaler(with_std=False)

insurance = pd.read_csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv")
insurance['age_c'] = sc.fit_transform(insurance[['age']])

model1 = smf.ols('charges~age_c * smoker', data=insurance)
fit1 = model1.fit()
params = fit1.params
# get params
b0, b2, b1, b3 = params['Intercept'], params['smoker[T.yes]'], params['age_c'], params['age_c:smoker[T.yes]']

你可以通过使用之前的缩放器到年龄列来预测:

new_data = pd.DataFrame({'age': {0: 19, 1: 41, 2: 43}, 
                        'smoker': {0: 'yes', 1: 'no', 2: 'no'}})

new_data['age_c'] = sc.transform(new_data[['age']])

new_data

   age smoker      age_c
0   19    yes -20.207025
1   41     no   1.792975
2   43     no   3.792975

检查:

idx_0 = (b0+b2) + (b1+b3) * -20.207025
# 26093.64269247414
idx_2 = b0 + b1 * 3.792975
9400.282805032146

fit1.predict(new_data)
Out[13]: 
0    26093.642567
1     8865.784870
2     9400.282695