OLS 回归结果 R 平方值 ca r2_squared by scikit learn
OLS Regression Results R-squared value ca r2_squared by scikitlearn
我按照网上的教程使用OLS建立模型(来自statsmodel!)OLS分析结果给了我一个惊人的R^2值(0.909)。但是,当我尝试使用 scikit-learn 的 r2_score 函数来评估 R^2 分数时,我只得到 0.68.
有人能说出这里有什么区别吗?
数据集来自这里:https://www.kaggle.com/harlfoxem/housesalesprediction
附上我的代码!
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
df = pd.read_csv('kc_house_data.csv')
df = df.drop(['id','date'], axis = 1)
X = df.iloc[:, 1:]
y= df.iloc[:, 0]
X = sm.add_constant(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
reg_OLS = sm.OLS(endog=y_train, exog=X_train).fit()
reg_OLS.summary()
y_pred = reg_OLS.predict(X_test)
print(r2_score(y_test, y_pred))
OLS 回归结果的输出
OLS Regression Results
Dep. Variable: price R-squared (uncentered): 0.909
Model: OLS Adj. R-squared (uncentered): 0.909
Method: Least Squares F-statistic: 8491.
Date: Thu, 17 Mar 2022 Prob (F-statistic): 0.00
Time: 23:36:48 Log-Likelihood: -1.9598e+05
No. Observations: 14408 AIC: 3.920e+05
Df Residuals: 14391 BIC: 3.921e+05
Df Model: 17
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
bedrooms -2.898e+04 2217.526 -13.068 0.000 -3.33e+04 -2.46e+04
bathrooms 3.621e+04 3845.397 9.416 0.000 2.87e+04 4.37e+04
sqft_living 100.2140 2.690 37.257 0.000 94.942 105.486
sqft_lot 0.2609 0.057 4.563 0.000 0.149 0.373
floors 1.201e+04 4187.373 2.867 0.004 3798.844 2.02e+04
waterfront 6.237e+05 2.01e+04 31.012 0.000 5.84e+05 6.63e+05
view 5.237e+04 2566.027 20.410 0.000 4.73e+04 5.74e+04
condition 2.844e+04 2774.191 10.250 0.000 2.3e+04 3.39e+04
grade 9.613e+04 2558.509 37.571 0.000 9.11e+04 1.01e+05
sqft_above 63.2770 2.638 23.985 0.000 58.106 68.448
sqft_basement 36.9370 3.127 11.813 0.000 30.808 43.066
yr_built -2529.7423 80.708 -31.344 0.000 -2687.940 -2371.544
yr_renovated 13.0704 4.307 3.035 0.002 4.629 21.512
zipcode -510.0820 21.216 -24.043 0.000 -551.667 -468.497
lat 6.084e+05 1.27e+04 47.725 0.000 5.83e+05 6.33e+05
long -2.076e+05 1.55e+04 -13.371 0.000 -2.38e+05 -1.77e+05
sqft_living15 33.6926 3.996 8.432 0.000 25.860 41.525
sqft_lot15 -0.4850 0.092 -5.275 0.000 -0.665 -0.305
Omnibus: 9620.580 Durbin-Watson: 1.997
Prob(Omnibus): 0.000 Jarque-Bera (JB): 363824.963
Skew: 2.694 Prob(JB): 0.00
Kurtosis: 27.021 Cond. No. 1.32e+17
来自r2_score
的输出
0.6855578295481021
你的R2=0.909来自于训练数据上的OLS,而R2_score=0.68是基于测试数据的相关性
尝试预测火车数据并在火车和预测的火车数据上使用 R2_score。
我按照网上的教程使用OLS建立模型(来自statsmodel!)OLS分析结果给了我一个惊人的R^2值(0.909)。但是,当我尝试使用 scikit-learn 的 r2_score 函数来评估 R^2 分数时,我只得到 0.68.
有人能说出这里有什么区别吗?
数据集来自这里:https://www.kaggle.com/harlfoxem/housesalesprediction
附上我的代码!
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
df = pd.read_csv('kc_house_data.csv')
df = df.drop(['id','date'], axis = 1)
X = df.iloc[:, 1:]
y= df.iloc[:, 0]
X = sm.add_constant(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
reg_OLS = sm.OLS(endog=y_train, exog=X_train).fit()
reg_OLS.summary()
y_pred = reg_OLS.predict(X_test)
print(r2_score(y_test, y_pred))
OLS 回归结果的输出
OLS Regression Results
Dep. Variable: price R-squared (uncentered): 0.909
Model: OLS Adj. R-squared (uncentered): 0.909
Method: Least Squares F-statistic: 8491.
Date: Thu, 17 Mar 2022 Prob (F-statistic): 0.00
Time: 23:36:48 Log-Likelihood: -1.9598e+05
No. Observations: 14408 AIC: 3.920e+05
Df Residuals: 14391 BIC: 3.921e+05
Df Model: 17
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
bedrooms -2.898e+04 2217.526 -13.068 0.000 -3.33e+04 -2.46e+04
bathrooms 3.621e+04 3845.397 9.416 0.000 2.87e+04 4.37e+04
sqft_living 100.2140 2.690 37.257 0.000 94.942 105.486
sqft_lot 0.2609 0.057 4.563 0.000 0.149 0.373
floors 1.201e+04 4187.373 2.867 0.004 3798.844 2.02e+04
waterfront 6.237e+05 2.01e+04 31.012 0.000 5.84e+05 6.63e+05
view 5.237e+04 2566.027 20.410 0.000 4.73e+04 5.74e+04
condition 2.844e+04 2774.191 10.250 0.000 2.3e+04 3.39e+04
grade 9.613e+04 2558.509 37.571 0.000 9.11e+04 1.01e+05
sqft_above 63.2770 2.638 23.985 0.000 58.106 68.448
sqft_basement 36.9370 3.127 11.813 0.000 30.808 43.066
yr_built -2529.7423 80.708 -31.344 0.000 -2687.940 -2371.544
yr_renovated 13.0704 4.307 3.035 0.002 4.629 21.512
zipcode -510.0820 21.216 -24.043 0.000 -551.667 -468.497
lat 6.084e+05 1.27e+04 47.725 0.000 5.83e+05 6.33e+05
long -2.076e+05 1.55e+04 -13.371 0.000 -2.38e+05 -1.77e+05
sqft_living15 33.6926 3.996 8.432 0.000 25.860 41.525
sqft_lot15 -0.4850 0.092 -5.275 0.000 -0.665 -0.305
Omnibus: 9620.580 Durbin-Watson: 1.997
Prob(Omnibus): 0.000 Jarque-Bera (JB): 363824.963
Skew: 2.694 Prob(JB): 0.00
Kurtosis: 27.021 Cond. No. 1.32e+17
来自r2_score
的输出0.6855578295481021
你的R2=0.909来自于训练数据上的OLS,而R2_score=0.68是基于测试数据的相关性
尝试预测火车数据并在火车和预测的火车数据上使用 R2_score。