为什么我通过 sklearn 获得线性回归的低分,但从 statsmodels 获得高 R 平方值?

Why am I getting low score for Linear Regression via sklearn but high R-squared value from statsmodels?

我正在解决线性回归问题。使用统计模型的分析得出 R 平方为 0.907,这是非常高的。因此,我使用 sklearn 计算的模型的方面分数应该更大,但我得到的分数只有 0.6478154705337766,这有点低。

我错过了什么吗?在统计模型摘要中,所有变量的 p 值都小于 0.05。我没有检查其他变量,比如系数,因为我听很多人说没有必要检查其他变量。详细的重新分级问题如下。

问题陈述和相关数据集: https://datahack.analyticsvidhya.com/contest/black-friday/

Sklearn 分数:0.6478154705337766

统计模型摘要:

 OLS Regression Results                                
 =======================================================================================
 Dep. Variable:                      y   R-squared (uncentered):                   0.907
 Model:                            OLS   Adj. R-squared (uncentered):              0.907
 Method:                 Least Squares   F-statistic:                          6.458e+04
 Date:                Mon, 21 Oct 2019   Prob (F-statistic):                        0.00
 Time:                        18:57:44   Log-Likelihood:                     -5.2226e+06
 No. Observations:              550068   AIC:                                  1.045e+07
 Df Residuals:                  549985   BIC:                                  1.045e+07
 Df Model:                          83                                                  
 Covariance Type:            nonrobust                                                  
 ==============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
 ------------------------------------------------------------------------------
 x1           -59.0946      9.426     -6.269      0.000     -77.569     -40.620
 x2           401.0189     10.441     38.409      0.000     380.555     421.483
 x3              1e+04     23.599    423.786      0.000    9954.518       1e+04
 x4          1.035e+04     21.740    475.990      0.000    1.03e+04    1.04e+04
 x5           1.04e+04     23.309    446.356      0.000    1.04e+04    1.04e+04
 x6          1.041e+04     26.858    387.693      0.000    1.04e+04    1.05e+04
 x7          1.065e+04     27.715    384.315      0.000    1.06e+04    1.07e+04
 x8          1.041e+04     32.580    319.469      0.000    1.03e+04    1.05e+04
x9           614.8732     19.178     32.061      0.000     577.285     652.462
x10          710.7823     23.135     30.723      0.000     665.438     756.126
x11          865.8851     27.138     31.906      0.000     812.695     919.076
x12          849.9004     18.358     46.296      0.000     813.919     885.881
x13          596.9014     31.632     18.870      0.000     534.904     658.899
x14          762.7278     25.809     29.553      0.000     712.143     813.312
x15          638.7214     18.085     35.319      0.000     603.276     674.166
x16          450.8858     82.928      5.437      0.000     288.349     613.423
x17          831.6309     43.033     19.325      0.000     747.287     915.975
x18         9266.9203     32.520    284.958      0.000    9203.182    9330.659
x19          548.8524     32.358     16.962      0.000     485.432     612.273
x20          819.7812     21.937     37.370      0.000     776.786     862.776
x21          575.2436     41.598     13.829      0.000     493.713     656.775
x22          780.1032     22.922     34.032      0.000     735.176     825.030
x23          854.8429     31.605     27.048      0.000     792.898     916.788
x24          603.5181     23.772     25.388      0.000     556.926     650.111
x25          635.8521     20.312     31.305      0.000     596.042     675.662
x26          455.0734     41.495     10.967      0.000     373.745     536.402
x27         1241.9456     36.844     33.708      0.000    1169.732    1314.160
x28          491.6905     21.378     23.000      0.000     449.791     533.590
x29          599.4075     10.701     56.014      0.000     578.434     620.381
x30         1024.8516     11.618     88.210      0.000    1002.080    1047.623
x31          282.3561     11.849     23.830      0.000     259.133     305.579
x32          218.2959     12.181     17.921      0.000     194.421     242.171
x33          194.9270     12.699     15.350      0.000     170.037     219.817
x34        -1038.1290     29.412    -35.296      0.000   -1095.776    -980.482
x35        -1429.4546     40.730    -35.096      0.000   -1509.284   -1349.625
x36        -1.021e+04     36.784   -277.658      0.000   -1.03e+04   -1.01e+04
x37        -5982.2095     15.651   -382.220      0.000   -6012.885   -5951.534
x38         3004.0730     28.298    106.159      0.000    2948.610    3059.536
x39         4535.2965     54.872     82.652      0.000    4427.749    4642.844
x40        -4645.1924     16.698   -278.195      0.000   -4677.919   -4612.466
x41         3110.6592    160.033     19.438      0.000    2797.000    3424.318
x42         7195.3346     48.059    149.718      0.000    7101.140    7289.529
x43        -7488.9490     24.289   -308.323      0.000   -7536.555   -7441.343
x44        -1.068e+04     53.542   -199.516      0.000   -1.08e+04   -1.06e+04
x45         -1.19e+04     45.546   -261.177      0.000    -1.2e+04   -1.18e+04
x46         1175.4639     83.574     14.065      0.000    1011.662    1339.266
x47         2354.8546     42.888     54.907      0.000    2270.795    2438.914
x48         2935.1657     35.917     81.721      0.000    2864.769    3005.562
x49        -1895.0141    134.688    -14.070      0.000   -2158.999   -1631.029
x50        -9003.5945     59.618   -151.022      0.000   -9120.444   -8886.745
x51        -1.194e+04     81.812   -145.944      0.000   -1.21e+04   -1.18e+04
x52        -1.158e+04     65.553   -176.632      0.000   -1.17e+04   -1.15e+04
x53         1489.1716     24.670     60.364      0.000    1440.819    1537.524
x54         2238.5714     93.608     23.914      0.000    2055.102    2422.041
x55         -732.7678     41.730    -17.560      0.000    -814.558    -650.978
x56          480.2321     29.776     16.128      0.000     421.872     538.592
x57         1076.8803     30.482     35.328      0.000    1017.136    1136.624
x58         1023.1860    128.939      7.935      0.000     770.470    1275.902
x59          987.0863     17.776     55.530      0.000     952.246    1021.926
x60          307.3852     45.456      6.762      0.000     218.293     396.478
x61         1979.9974     67.180     29.473      0.000    1848.327    2111.667
x62          441.5194     29.476     14.979      0.000     383.746     499.292
x63          203.3906     34.692      5.863      0.000     135.396     271.386
x64          250.2751     16.466     15.200      0.000     218.003     282.547
x65          653.6979     20.591     31.747      0.000     613.340     694.055
x66          893.8433     18.950     47.168      0.000     856.702     930.985
x67         1052.2746     29.336     35.870      0.000     994.777    1109.772
x68         1211.0301     61.789     19.599      0.000    1089.925    1332.135
x69          626.3778    131.545      4.762      0.000     368.553     884.202
x70        -3303.6544     99.019    -33.364      0.000   -3497.728   -3109.581
x71          678.0397     31.709     21.383      0.000     615.891     740.188
x72          449.4691     50.429      8.913      0.000     350.631     548.308
x73         1881.4959     33.873     55.546      0.000    1815.106    1947.886
x74          488.1976     34.729     14.057      0.000     420.130     556.266
x75         -818.2759     94.178     -8.689      0.000   -1002.861    -633.690
x76         -476.0159     78.144     -6.091      0.000    -629.176    -322.855
x77          369.1793     37.992      9.717      0.000     294.716     443.642
x78         -610.9179     49.224    -12.411      0.000    -707.395    -514.441
x79          217.0498     26.327      8.244      0.000     165.450     268.650
x80         -144.8580     24.612     -5.886      0.000    -193.097     -96.619
x81          475.4497     21.298     22.323      0.000     433.705     517.194
x82         1404.9458     27.294     51.474      0.000    1351.450    1458.442
x83          329.1859     49.154      6.697      0.000     232.846     425.526
==============================================================================
Omnibus:                    27530.062   Durbin-Watson:                   1.533
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            81968.349
Skew:                          -0.223   Prob(JB):                         0.00
Kurtosis:                       4.838   Cond. No.                         48.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

如果您需要任何额外信息,请告诉我。我没有分享 sklearn 和 statmodels 中的确切代码细节,因为我认为这可能会使问题陈述复杂化,如果有必要,我愿意分享它。

线性回归的基本形式在 statsmodelsscikit-learn 中是相同的。然而,实现不同可能会在边缘情况下产生不同的结果,而 scikit learn 通常对更大的模型有更多的支持。比如statsmodels目前在极少数地方使用了稀疏矩阵。

最重要的区别在于周围的基础设施和直接支持的用例。

Statsmodels 主要遵循传统模型,我们想知道给定模型对数据的拟合程度,以及哪些变量 "explain" 或影响结果,或者是什么效果的大小是。 Scikit-learn 遵循机器学习传统,其中主要支持的任务是选择 "best" 模型进行预测。

因此,statsmodels 支持功能的重点是分析训练数据,包括假设检验和拟合优度度量,而 scikit-learn 支持基础设施的重点是模型选择用于样本外预测,因此在 "test data".

上进行交叉验证

旁注:您的问题更适合 https://stats.stackexchange.com