Stats Models out of sample prediction of new data 其中特征已经被转换

Question

我很好奇为什么我无法得出模型预测的相同值。

考虑以下模型。我试图了解保险费用、年龄和客户是否吸烟之间的关系。

注意年龄变量已经过预处理（均值居中）。

import pandas as pd
import statsmodels.formula.api as smf

insurance = pd.read_csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv")
model1 = smf.ols('charges~I(age - np.mean(age)) * smoker', data=insurance)
fit1 = model1.fit()
params = fit1.params
# get params
b0, b2, b1, b3 = params['Intercept'], params['smoker[T.yes]'], params['I(age - np.mean(age))'], params['I(age - np.mean(age)):smoker[T.yes]']
x1 = (insurance['age'] - np.mean(insurance['age']))
# two lines with diff intercept and slopes
y_hat_non = b0 + b1 * x1 
y_hat_smok = (b0  + b2) + (b1 + b3) * x1

现在，当我生成新数据并应用预测方法时，我会在尝试手动计算这些数据时得到不同的值。以索引 0 和索引 2 为例，我希望预测值与下面的输出相似，但这些确实相差甚远。

我是否遗漏了一些有关拟合模型时完成的特征转换的信息？

new_data = pd.DataFrame({'age': {0: 19, 1: 41, 2: 43}, 
                        'smoker': {0: 'yes', 1: 'no', 2: 'no'}})

idx_0 = (b0+b2) + (b1+b3) * 19
# 38061.1
idx_2 = b0 + b1 * 43
# 19878.4

fit1.predict(new_data)
0    27581.276650
1    10168.273779
2    10702.771604

Answer 1

我想你想将年龄 variable 居中，这个 I(age - np.mean(age)) 有效，但是当你尝试预测时，它会根据你的预测数据框中的平均值再次重新评估年龄.

此外，当你乘以系数时，你必须将它乘以中心值（即年龄 - 平均值（年龄））而不是原始值。

创建另一个年龄居中的列也没什么坏处：

import pandas as pd
import statsmodels.formula.api as smf
import numpy as np
from sklearn.preprocessing import StandardScaler

sc = StandardScaler(with_std=False)

insurance = pd.read_csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv")
insurance['age_c'] = sc.fit_transform(insurance[['age']])

model1 = smf.ols('charges~age_c * smoker', data=insurance)
fit1 = model1.fit()
params = fit1.params
# get params
b0, b2, b1, b3 = params['Intercept'], params['smoker[T.yes]'], params['age_c'], params['age_c:smoker[T.yes]']

你可以通过使用之前的缩放器到年龄列来预测：

new_data = pd.DataFrame({'age': {0: 19, 1: 41, 2: 43}, 
                        'smoker': {0: 'yes', 1: 'no', 2: 'no'}})

new_data['age_c'] = sc.transform(new_data[['age']])

new_data

   age smoker      age_c
0   19    yes -20.207025
1   41     no   1.792975
2   43     no   3.792975

检查：

idx_0 = (b0+b2) + (b1+b3) * -20.207025
# 26093.64269247414
idx_2 = b0 + b1 * 3.792975
9400.282805032146

fit1.predict(new_data)
Out[13]: 
0    26093.642567
1     8865.784870
2     9400.282695

Stats Models out of sample prediction of new data 其中特征已经被转换

Stats Models out of sample prediction of new data where features have been transformed

python

statistics

regression

statsmodels