为什么我通过 sklearn 获得线性回归的低分,但从 statsmodels 获得高 R 平方值?
Why am I getting low score for Linear Regression via sklearn but high R-squared value from statsmodels?
我正在解决线性回归问题。使用统计模型的分析得出 R 平方为 0.907,这是非常高的。因此,我使用 sklearn 计算的模型的方面分数应该更大,但我得到的分数只有 0.6478154705337766,这有点低。
我错过了什么吗?在统计模型摘要中,所有变量的 p 值都小于 0.05。我没有检查其他变量,比如系数,因为我听很多人说没有必要检查其他变量。详细的重新分级问题如下。
问题陈述和相关数据集:
https://datahack.analyticsvidhya.com/contest/black-friday/
Sklearn 分数:0.6478154705337766
统计模型摘要:
OLS Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.907
Model: OLS Adj. R-squared (uncentered): 0.907
Method: Least Squares F-statistic: 6.458e+04
Date: Mon, 21 Oct 2019 Prob (F-statistic): 0.00
Time: 18:57:44 Log-Likelihood: -5.2226e+06
No. Observations: 550068 AIC: 1.045e+07
Df Residuals: 549985 BIC: 1.045e+07
Df Model: 83
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 -59.0946 9.426 -6.269 0.000 -77.569 -40.620
x2 401.0189 10.441 38.409 0.000 380.555 421.483
x3 1e+04 23.599 423.786 0.000 9954.518 1e+04
x4 1.035e+04 21.740 475.990 0.000 1.03e+04 1.04e+04
x5 1.04e+04 23.309 446.356 0.000 1.04e+04 1.04e+04
x6 1.041e+04 26.858 387.693 0.000 1.04e+04 1.05e+04
x7 1.065e+04 27.715 384.315 0.000 1.06e+04 1.07e+04
x8 1.041e+04 32.580 319.469 0.000 1.03e+04 1.05e+04
x9 614.8732 19.178 32.061 0.000 577.285 652.462
x10 710.7823 23.135 30.723 0.000 665.438 756.126
x11 865.8851 27.138 31.906 0.000 812.695 919.076
x12 849.9004 18.358 46.296 0.000 813.919 885.881
x13 596.9014 31.632 18.870 0.000 534.904 658.899
x14 762.7278 25.809 29.553 0.000 712.143 813.312
x15 638.7214 18.085 35.319 0.000 603.276 674.166
x16 450.8858 82.928 5.437 0.000 288.349 613.423
x17 831.6309 43.033 19.325 0.000 747.287 915.975
x18 9266.9203 32.520 284.958 0.000 9203.182 9330.659
x19 548.8524 32.358 16.962 0.000 485.432 612.273
x20 819.7812 21.937 37.370 0.000 776.786 862.776
x21 575.2436 41.598 13.829 0.000 493.713 656.775
x22 780.1032 22.922 34.032 0.000 735.176 825.030
x23 854.8429 31.605 27.048 0.000 792.898 916.788
x24 603.5181 23.772 25.388 0.000 556.926 650.111
x25 635.8521 20.312 31.305 0.000 596.042 675.662
x26 455.0734 41.495 10.967 0.000 373.745 536.402
x27 1241.9456 36.844 33.708 0.000 1169.732 1314.160
x28 491.6905 21.378 23.000 0.000 449.791 533.590
x29 599.4075 10.701 56.014 0.000 578.434 620.381
x30 1024.8516 11.618 88.210 0.000 1002.080 1047.623
x31 282.3561 11.849 23.830 0.000 259.133 305.579
x32 218.2959 12.181 17.921 0.000 194.421 242.171
x33 194.9270 12.699 15.350 0.000 170.037 219.817
x34 -1038.1290 29.412 -35.296 0.000 -1095.776 -980.482
x35 -1429.4546 40.730 -35.096 0.000 -1509.284 -1349.625
x36 -1.021e+04 36.784 -277.658 0.000 -1.03e+04 -1.01e+04
x37 -5982.2095 15.651 -382.220 0.000 -6012.885 -5951.534
x38 3004.0730 28.298 106.159 0.000 2948.610 3059.536
x39 4535.2965 54.872 82.652 0.000 4427.749 4642.844
x40 -4645.1924 16.698 -278.195 0.000 -4677.919 -4612.466
x41 3110.6592 160.033 19.438 0.000 2797.000 3424.318
x42 7195.3346 48.059 149.718 0.000 7101.140 7289.529
x43 -7488.9490 24.289 -308.323 0.000 -7536.555 -7441.343
x44 -1.068e+04 53.542 -199.516 0.000 -1.08e+04 -1.06e+04
x45 -1.19e+04 45.546 -261.177 0.000 -1.2e+04 -1.18e+04
x46 1175.4639 83.574 14.065 0.000 1011.662 1339.266
x47 2354.8546 42.888 54.907 0.000 2270.795 2438.914
x48 2935.1657 35.917 81.721 0.000 2864.769 3005.562
x49 -1895.0141 134.688 -14.070 0.000 -2158.999 -1631.029
x50 -9003.5945 59.618 -151.022 0.000 -9120.444 -8886.745
x51 -1.194e+04 81.812 -145.944 0.000 -1.21e+04 -1.18e+04
x52 -1.158e+04 65.553 -176.632 0.000 -1.17e+04 -1.15e+04
x53 1489.1716 24.670 60.364 0.000 1440.819 1537.524
x54 2238.5714 93.608 23.914 0.000 2055.102 2422.041
x55 -732.7678 41.730 -17.560 0.000 -814.558 -650.978
x56 480.2321 29.776 16.128 0.000 421.872 538.592
x57 1076.8803 30.482 35.328 0.000 1017.136 1136.624
x58 1023.1860 128.939 7.935 0.000 770.470 1275.902
x59 987.0863 17.776 55.530 0.000 952.246 1021.926
x60 307.3852 45.456 6.762 0.000 218.293 396.478
x61 1979.9974 67.180 29.473 0.000 1848.327 2111.667
x62 441.5194 29.476 14.979 0.000 383.746 499.292
x63 203.3906 34.692 5.863 0.000 135.396 271.386
x64 250.2751 16.466 15.200 0.000 218.003 282.547
x65 653.6979 20.591 31.747 0.000 613.340 694.055
x66 893.8433 18.950 47.168 0.000 856.702 930.985
x67 1052.2746 29.336 35.870 0.000 994.777 1109.772
x68 1211.0301 61.789 19.599 0.000 1089.925 1332.135
x69 626.3778 131.545 4.762 0.000 368.553 884.202
x70 -3303.6544 99.019 -33.364 0.000 -3497.728 -3109.581
x71 678.0397 31.709 21.383 0.000 615.891 740.188
x72 449.4691 50.429 8.913 0.000 350.631 548.308
x73 1881.4959 33.873 55.546 0.000 1815.106 1947.886
x74 488.1976 34.729 14.057 0.000 420.130 556.266
x75 -818.2759 94.178 -8.689 0.000 -1002.861 -633.690
x76 -476.0159 78.144 -6.091 0.000 -629.176 -322.855
x77 369.1793 37.992 9.717 0.000 294.716 443.642
x78 -610.9179 49.224 -12.411 0.000 -707.395 -514.441
x79 217.0498 26.327 8.244 0.000 165.450 268.650
x80 -144.8580 24.612 -5.886 0.000 -193.097 -96.619
x81 475.4497 21.298 22.323 0.000 433.705 517.194
x82 1404.9458 27.294 51.474 0.000 1351.450 1458.442
x83 329.1859 49.154 6.697 0.000 232.846 425.526
==============================================================================
Omnibus: 27530.062 Durbin-Watson: 1.533
Prob(Omnibus): 0.000 Jarque-Bera (JB): 81968.349
Skew: -0.223 Prob(JB): 0.00
Kurtosis: 4.838 Cond. No. 48.5
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
如果您需要任何额外信息,请告诉我。我没有分享 sklearn 和 statmodels 中的确切代码细节,因为我认为这可能会使问题陈述复杂化,如果有必要,我愿意分享它。
线性回归的基本形式在 statsmodels
和 scikit-learn
中是相同的。然而,实现不同可能会在边缘情况下产生不同的结果,而 scikit learn 通常对更大的模型有更多的支持。比如statsmodels目前在极少数地方使用了稀疏矩阵。
最重要的区别在于周围的基础设施和直接支持的用例。
Statsmodels 主要遵循传统模型,我们想知道给定模型对数据的拟合程度,以及哪些变量 "explain" 或影响结果,或者是什么效果的大小是。 Scikit-learn 遵循机器学习传统,其中主要支持的任务是选择 "best" 模型进行预测。
因此,statsmodels 支持功能的重点是分析训练数据,包括假设检验和拟合优度度量,而 scikit-learn 支持基础设施的重点是模型选择用于样本外预测,因此在 "test data".
上进行交叉验证
旁注:您的问题更适合 https://stats.stackexchange.com
我正在解决线性回归问题。使用统计模型的分析得出 R 平方为 0.907,这是非常高的。因此,我使用 sklearn 计算的模型的方面分数应该更大,但我得到的分数只有 0.6478154705337766,这有点低。
我错过了什么吗?在统计模型摘要中,所有变量的 p 值都小于 0.05。我没有检查其他变量,比如系数,因为我听很多人说没有必要检查其他变量。详细的重新分级问题如下。
问题陈述和相关数据集: https://datahack.analyticsvidhya.com/contest/black-friday/
Sklearn 分数:0.6478154705337766
统计模型摘要:
OLS Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.907
Model: OLS Adj. R-squared (uncentered): 0.907
Method: Least Squares F-statistic: 6.458e+04
Date: Mon, 21 Oct 2019 Prob (F-statistic): 0.00
Time: 18:57:44 Log-Likelihood: -5.2226e+06
No. Observations: 550068 AIC: 1.045e+07
Df Residuals: 549985 BIC: 1.045e+07
Df Model: 83
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 -59.0946 9.426 -6.269 0.000 -77.569 -40.620
x2 401.0189 10.441 38.409 0.000 380.555 421.483
x3 1e+04 23.599 423.786 0.000 9954.518 1e+04
x4 1.035e+04 21.740 475.990 0.000 1.03e+04 1.04e+04
x5 1.04e+04 23.309 446.356 0.000 1.04e+04 1.04e+04
x6 1.041e+04 26.858 387.693 0.000 1.04e+04 1.05e+04
x7 1.065e+04 27.715 384.315 0.000 1.06e+04 1.07e+04
x8 1.041e+04 32.580 319.469 0.000 1.03e+04 1.05e+04
x9 614.8732 19.178 32.061 0.000 577.285 652.462
x10 710.7823 23.135 30.723 0.000 665.438 756.126
x11 865.8851 27.138 31.906 0.000 812.695 919.076
x12 849.9004 18.358 46.296 0.000 813.919 885.881
x13 596.9014 31.632 18.870 0.000 534.904 658.899
x14 762.7278 25.809 29.553 0.000 712.143 813.312
x15 638.7214 18.085 35.319 0.000 603.276 674.166
x16 450.8858 82.928 5.437 0.000 288.349 613.423
x17 831.6309 43.033 19.325 0.000 747.287 915.975
x18 9266.9203 32.520 284.958 0.000 9203.182 9330.659
x19 548.8524 32.358 16.962 0.000 485.432 612.273
x20 819.7812 21.937 37.370 0.000 776.786 862.776
x21 575.2436 41.598 13.829 0.000 493.713 656.775
x22 780.1032 22.922 34.032 0.000 735.176 825.030
x23 854.8429 31.605 27.048 0.000 792.898 916.788
x24 603.5181 23.772 25.388 0.000 556.926 650.111
x25 635.8521 20.312 31.305 0.000 596.042 675.662
x26 455.0734 41.495 10.967 0.000 373.745 536.402
x27 1241.9456 36.844 33.708 0.000 1169.732 1314.160
x28 491.6905 21.378 23.000 0.000 449.791 533.590
x29 599.4075 10.701 56.014 0.000 578.434 620.381
x30 1024.8516 11.618 88.210 0.000 1002.080 1047.623
x31 282.3561 11.849 23.830 0.000 259.133 305.579
x32 218.2959 12.181 17.921 0.000 194.421 242.171
x33 194.9270 12.699 15.350 0.000 170.037 219.817
x34 -1038.1290 29.412 -35.296 0.000 -1095.776 -980.482
x35 -1429.4546 40.730 -35.096 0.000 -1509.284 -1349.625
x36 -1.021e+04 36.784 -277.658 0.000 -1.03e+04 -1.01e+04
x37 -5982.2095 15.651 -382.220 0.000 -6012.885 -5951.534
x38 3004.0730 28.298 106.159 0.000 2948.610 3059.536
x39 4535.2965 54.872 82.652 0.000 4427.749 4642.844
x40 -4645.1924 16.698 -278.195 0.000 -4677.919 -4612.466
x41 3110.6592 160.033 19.438 0.000 2797.000 3424.318
x42 7195.3346 48.059 149.718 0.000 7101.140 7289.529
x43 -7488.9490 24.289 -308.323 0.000 -7536.555 -7441.343
x44 -1.068e+04 53.542 -199.516 0.000 -1.08e+04 -1.06e+04
x45 -1.19e+04 45.546 -261.177 0.000 -1.2e+04 -1.18e+04
x46 1175.4639 83.574 14.065 0.000 1011.662 1339.266
x47 2354.8546 42.888 54.907 0.000 2270.795 2438.914
x48 2935.1657 35.917 81.721 0.000 2864.769 3005.562
x49 -1895.0141 134.688 -14.070 0.000 -2158.999 -1631.029
x50 -9003.5945 59.618 -151.022 0.000 -9120.444 -8886.745
x51 -1.194e+04 81.812 -145.944 0.000 -1.21e+04 -1.18e+04
x52 -1.158e+04 65.553 -176.632 0.000 -1.17e+04 -1.15e+04
x53 1489.1716 24.670 60.364 0.000 1440.819 1537.524
x54 2238.5714 93.608 23.914 0.000 2055.102 2422.041
x55 -732.7678 41.730 -17.560 0.000 -814.558 -650.978
x56 480.2321 29.776 16.128 0.000 421.872 538.592
x57 1076.8803 30.482 35.328 0.000 1017.136 1136.624
x58 1023.1860 128.939 7.935 0.000 770.470 1275.902
x59 987.0863 17.776 55.530 0.000 952.246 1021.926
x60 307.3852 45.456 6.762 0.000 218.293 396.478
x61 1979.9974 67.180 29.473 0.000 1848.327 2111.667
x62 441.5194 29.476 14.979 0.000 383.746 499.292
x63 203.3906 34.692 5.863 0.000 135.396 271.386
x64 250.2751 16.466 15.200 0.000 218.003 282.547
x65 653.6979 20.591 31.747 0.000 613.340 694.055
x66 893.8433 18.950 47.168 0.000 856.702 930.985
x67 1052.2746 29.336 35.870 0.000 994.777 1109.772
x68 1211.0301 61.789 19.599 0.000 1089.925 1332.135
x69 626.3778 131.545 4.762 0.000 368.553 884.202
x70 -3303.6544 99.019 -33.364 0.000 -3497.728 -3109.581
x71 678.0397 31.709 21.383 0.000 615.891 740.188
x72 449.4691 50.429 8.913 0.000 350.631 548.308
x73 1881.4959 33.873 55.546 0.000 1815.106 1947.886
x74 488.1976 34.729 14.057 0.000 420.130 556.266
x75 -818.2759 94.178 -8.689 0.000 -1002.861 -633.690
x76 -476.0159 78.144 -6.091 0.000 -629.176 -322.855
x77 369.1793 37.992 9.717 0.000 294.716 443.642
x78 -610.9179 49.224 -12.411 0.000 -707.395 -514.441
x79 217.0498 26.327 8.244 0.000 165.450 268.650
x80 -144.8580 24.612 -5.886 0.000 -193.097 -96.619
x81 475.4497 21.298 22.323 0.000 433.705 517.194
x82 1404.9458 27.294 51.474 0.000 1351.450 1458.442
x83 329.1859 49.154 6.697 0.000 232.846 425.526
==============================================================================
Omnibus: 27530.062 Durbin-Watson: 1.533
Prob(Omnibus): 0.000 Jarque-Bera (JB): 81968.349
Skew: -0.223 Prob(JB): 0.00
Kurtosis: 4.838 Cond. No. 48.5
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
如果您需要任何额外信息,请告诉我。我没有分享 sklearn 和 statmodels 中的确切代码细节,因为我认为这可能会使问题陈述复杂化,如果有必要,我愿意分享它。
线性回归的基本形式在 statsmodels
和 scikit-learn
中是相同的。然而,实现不同可能会在边缘情况下产生不同的结果,而 scikit learn 通常对更大的模型有更多的支持。比如statsmodels目前在极少数地方使用了稀疏矩阵。
最重要的区别在于周围的基础设施和直接支持的用例。
Statsmodels 主要遵循传统模型,我们想知道给定模型对数据的拟合程度,以及哪些变量 "explain" 或影响结果,或者是什么效果的大小是。 Scikit-learn 遵循机器学习传统,其中主要支持的任务是选择 "best" 模型进行预测。
因此,statsmodels 支持功能的重点是分析训练数据,包括假设检验和拟合优度度量,而 scikit-learn 支持基础设施的重点是模型选择用于样本外预测,因此在 "test data".
上进行交叉验证旁注:您的问题更适合 https://stats.stackexchange.com