多元线性回归。系数不匹配

Question

所以我有这个小数据集，我想对其执行多元线性回归。

首先，我删除了交付列，因为它与英里数高度相关。虽然 gasprice 应该被删除，但我没有删除它，以便我可以执行多元线性回归而不是简单线性回归。最后我删除了异常值并执行了以下操作：

Dataset

import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt

import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model

%matplotlib inline

X = dfafter
Y = dfafter[['hours']]

# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

# create a Linear Regression model object
regression_model = LinearRegression()

# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train) 


#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]

print("The intercept for our model is {}".format(intercept))
print('-'*100)

# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later 

#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)

# create a OLS model
model = sm.OLS(Y, X2)

# fit the data
est = model.fit()



# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)

# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)

# calulcate the root mean squared error
model_rmse =  math.sqrt(model_mse)

# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))


print(est.summary())
#????????? something is wrong



X = df[['miles', 'gasprice']]
y = df['hours']

regr = linear_model.LinearRegression()
regr.fit(X, y)

print(regr.coef_)

所以代码到此结束。每次打印出来时，我都会发现不同的系数。我做错了什么，有哪一个是正确的吗？

Answer 1

我看到你在这里尝试了 3 种不同的东西，所以让我总结一下：

sklearn.linear_model.LinearRegression() 和 train_test_split(X, Y, test_size=0.2, random_state=1)，所以只使用了 80% 的数据（但是每次运行分割应该是相同的，因为你固定了随机状态）
statsmodels.api.OLS 与 full 数据集（你传递的是 X2 和 Y，它们没有被分割成 train-test)
sklearn.linear_model.LinearRegression() 与 full 数据集，如 n2.

我尝试使用 iris 数据集进行重现，我得到了案例 #2 和案例 #3 的相同结果（它们是在相同的精确数据上训练的），并且案例 1 的系数略有不同.

为了评估它们是否“正确”，您需要根据未见数据评估模型并查看调整后的 R^2 分数等（因此您需要 holdout（测试）集）。如果您想进一步改进模型，您可以尝试更好地理解线性模型中特征的相互作用。 Statsmodels 有一个简洁的“R-like”公式来指定你的模型：https://www.statsmodels.org/dev/example_formulas.html

多元线性回归。系数不匹配

Multiple Linear Regression. Coeffs don't match

python

numpy

scipy

linear-regression

sklearn-pandas