根据列的唯一值对组执行多元线性回归

Perform multiple linear regression for groups based on column unique values

我需要对取自列 df['status']df['status'].unique() 值为 (1, 4, 7, 9) 的 4 个不同组执行多元线性回归。回归后我需要将结果保存在新列 df['reg_results'].

数据示例:

Out[71]: 
    ID  status    y_Values     a   b      c      d
0    1       1  150.510000  0.26  23  0.151  1.215
1    2       1  153.110000  0.86  14  0.156  1.651
2    3       1  189.320000  0.46  51  0.151  2.154
3    4       9  145.650000  0.46  62  0.157  3.145
4    5       4  189.650000  0.91  11  0.123  2.104
5    6       4  144.230000  0.69  16  0.178  3.515
6    7       4  198.020000  0.62  18  0.891  1.561
7    8       9  178.090000  0.91  22  0.156  9.155

回归中需要的列是X = ['a', 'b', 'c', 'd']y = ['y_Values']

我找到了多个使用整列或多列执行回归的解决方案,例如:

data = pd.read_csv(r'E:\...\data.csv')
lm = smf.ols(formula='y_Values ~ a + b + c + d', data=data).fit()
print(lm.params)

结果:

Intercept   -403.803691
a              0.170452
b             40.866943
c             14.839920
d              1.618234
dtype: float64

但是,我想对每个 df['status'] == (1,4,7,9) 行执行相同的操作。并将数据存储在新列中。

我知道如何在 R 中执行此操作,但无法理解如何在分析中添加这些 df['status'] 参数:

lapply(c(1,4,7,9), function(k){

  data <- shape[status == k, c("ID", "a", "b", "c", "d", "y_Values")]
  reg <- lm(y_Values ~ a + 0 + b + c + d, data = data)
  reg2 <- step(reg, direction = "backward")

执行此操作的一种方法如下。如果要对整个数据框进行回归:

X = df[['a', 'b', 'c', 'd']]
Y = df['y_Values']
 

model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
 
print_model = model.summary()
print(print_model)

这将 return

                     OLS Regression Results                                
=======================================================================================
Dep. Variable:               y_Values   R-squared (uncentered):                   0.973
Model:                            OLS   Adj. R-squared (uncentered):              0.946
Method:                 Least Squares   F-statistic:                              35.97
Date:                Wed, 27 Oct 2021   Prob (F-statistic):                     0.00216
Time:                        13:12:10   Log-Likelihood:                         -37.992
No. Observations:                   8   AIC:                                      83.98
Df Residuals:                       4   BIC:                                      84.30
Df Model:                           4                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
a            167.3835     45.459      3.682      0.021      41.170     293.597
b              1.6286      0.621      2.622      0.059      -0.096       3.353
c             83.8313     55.572      1.509      0.206     -70.461     238.123
d             -2.7363      6.841     -0.400      0.710     -21.729      16.256
==============================================================================
Omnibus:                        1.673   Durbin-Watson:                   2.460
Prob(Omnibus):                  0.433   Jarque-Bera (JB):                0.446
Skew:                           0.574   Prob(JB):                        0.800
Kurtosis:                       2.860   Cond. No.                         146.
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

您可以选择要提取的值。

按个人状态做:

status = list(set(df['status']))
for status in status:
    print( status)
    df_redux = df[df['status']==status]
    print(df_redux)
    X = df_redux[['a', 'b', 'c', 'd']] # here we have 2 variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets
    Y = df_redux['y_Values']


    model = sm.OLS(Y, X).fit()
    predictions = model.predict(X) 

    print_model = model.summary()
    print(print_model)
   

给出:

1
   ID  status  y_Values     a   b      c      d
0   1       1    150.51  0.26  23  0.151  1.215
1   2       1    153.11  0.86  14  0.156  1.651
2   3       1    189.32  0.46  51  0.151  2.154
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               y_Values   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                    nan
Method:                 Least Squares   F-statistic:                       nan
Date:                Wed, 27 Oct 2021   Prob (F-statistic):                nan
Time:                        13:12:15   Log-Likelihood:                 77.832
No. Observations:                   3   AIC:                            -149.7
Df Residuals:                       0   BIC:                            -152.4
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
a           -248.8778        inf         -0        nan         nan         nan
b             -4.9837        inf         -0        nan         nan         nan
c            229.5383        inf          0        nan         nan         nan
d            242.9489        inf          0        nan         nan         nan
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.443
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.281
Skew:                           0.016   Prob(JB):                        0.869
Kurtosis:                       1.500   Cond. No.                         554.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.
4
   ID  status  y_Values     a   b      c      d
4   5       4    189.65  0.91  11  0.123  2.104
5   6       4    144.23  0.69  16  0.178  3.515
6   7       4    198.02  0.62  18  0.891  1.561
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               y_Values   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                    nan
Method:                 Least Squares   F-statistic:                       nan
Date:                Wed, 27 Oct 2021   Prob (F-statistic):                nan
Time:                        13:12:15   Log-Likelihood:                 82.381
No. Observations:                   3   AIC:                            -158.8
Df Residuals:                       0   BIC:                            -161.5
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
a            183.1273        inf          0        nan         nan         nan
b              8.9478        inf          0        nan         nan         nan
c            -25.7862        inf         -0        nan         nan         nan
d            -34.3392        inf         -0        nan         nan         nan
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.154
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.284
Skew:                           0.072   Prob(JB):                        0.868
Kurtosis:                       1.500   Cond. No.                         67.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.
9
   ID  status  y_Values     a   b      c      d
3   4       9    145.65  0.46  62  0.157  3.145
7   8       9    178.09  0.91  22  0.156  9.155
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               y_Values   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                    nan
Method:                 Least Squares   F-statistic:                       nan
Date:                Wed, 27 Oct 2021   Prob (F-statistic):                nan
Time:                        13:12:15   Log-Likelihood:                 58.629
No. Observations:                   2   AIC:                            -113.3
Df Residuals:                       0   BIC:                            -115.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
a              1.4521        inf          0        nan         nan         nan
b              1.5473        inf          0        nan         nan         nan
c              0.1974        inf          0        nan         nan         nan
d             15.5869        inf          0        nan         nan         nan
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.800
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.333
Skew:                           0.000   Prob(JB):                        0.846
Kurtosis:                       1.000   Cond. No.                         8.72
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.

当然,考虑到子集的大小,回归结果不是很好。我假设你有一个更大的数据框。

要提取特定信息(如 R2),只需添加 print(model.rsquared)

更新:

一个更完整的提取信息的方法是添加:

stats_1 = pd.read_html(model.summary().tables[0].as_html(),header=0,index_col=0)[0]
stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]

其中 return 两个数据帧:

stat_1 

Dep. Variable:          y_Values       R-squared (uncentered):     0.973
0             Model:               OLS  Adj. R-squared (uncentered):   0.94600
1            Method:     Least Squares                  F-statistic:  35.97000
2              Date:  Wed, 27 Oct 2021           Prob (F-statistic):   0.00216
3              Time:          13:52:04               Log-Likelihood: -37.99200
4  No. Observations:                 8                          AIC:  83.98000
5      Df Residuals:                 4                          BIC:  84.30000
6          Df Model:                 4                           NaN       NaN
7   Covariance Type:         nonrobust                           NaN       NaN

stat_2

index      coef  std err      t  P>|t|  [0.025   0.975]
0     a  167.3835   45.459  3.682  0.021  41.170  293.597
1     b    1.6286    0.621  2.622  0.059  -0.096    3.353
2     c   83.8313   55.572  1.509  0.206 -70.461  238.123
3     d   -2.7363    6.841 -0.400  0.710 -21.729   16.256

您现在可以选择所需的列,例如:

stat_2['coeff']

index      coef
0     a  167.3835
1     b    1.6286
2     c   83.8313
3     d   -2.7363

所以你的循环应该是这样的:

df_coef =[]
status = list(set(df['status']))
for status in status:
    
    df_redux = df[df['status']==status]
    print(df_redux)
    X = df_redux[['a', 'b', 'c', 'd']]
    Y = df_redux['y_Values']


    model = sm.OLS(Y, X).fit()
    predictions = model.predict(X)
    stats_1 = pd.read_html(model.summary().tables[0].as_html(),header=0,index_col=0)[0]
    stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]
    if len(stats_2)!=0:
        stats_2['status'] = status
        df_coef.append(stats_2)
    else:
        0
        
all_coef = pd.concat(df_coef)
df = all_coef[['status', 'coef']]
print(df)

给出:

status      coef
a       1 -248.8778
b       1   -4.9837
c       1  229.5383
d       1  242.9489
a       4  183.1273
b       4    8.9478
c       4  -25.7862
d       4  -34.3392
a       9    1.4521
b       9    1.5473
c       9    0.1974
d       9   15.5869

然后通过在 status

上合并将其附加到您的原始 df

更新 2

感谢您的解决方案,获得了所有系数,但我对 merging/concatenating 预测值的意思是,当我打印出预测时,我得到了这四个包含行 ID 和预测值的表。我需要的是合并这四个表(存储在一个变量 predictions 中),将其创建为具有列名 IDresults.

的 Dataframe

之后我可以按列 'ID' 将新数据框合并到原始数据框。

....
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(predictions)

0       401.094849
1       420.949054
2       407.918627
4       363.367876
8       255.865852
           ...    
1556    430.050556
1558    292.949037
1559    306.011285
1560    412.041196
1561    360.829533

Length: 958, dtype: float64
5       366.159418
12      204.606629
18      400.767161
20      401.544449
21      267.192577
           ...    
1530    384.151730
1533    275.356699
1539    376.165539
1543    334.024327
1547    272.197374
Length: 205, dtype: float64

我试图将 predictions 变量转换为列表或字典,但无法弄清楚如何连接所有四个表。可能是简单的解决方案,但我找不到它。

更新3

这对你有用吗?

df = pd.read_csv("df.csv", sep=";")
df_coef =[]
status = list(set(df['status']))
for status in status:
    df_redux = df[df['status']==status]
    X = df_redux[['a', 'b', 'c', 'd']]
    Y = df_redux['y_Values']

    model = sm.OLS(Y, X).fit()
    predictions = model.predict(X)
    stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]
    predictions = pd.DataFrame(predictions, columns = ['predictions'])
    gf = pd.concat([predictions, df_redux], axis=1)
    df_coef.append(gf)

all_coef = pd.concat(df_coef)

产生:

predictions  ID  status  y_Values     a   b      c      d
0       150.51   1       1    150.51  0.26  23  0.151  1.215
1       153.11   2       1    153.11  0.86  14  0.156  1.651
2       189.32   3       1    189.32  0.46  51  0.151  2.154
4       189.65   5       4    189.65  0.91  11  0.123  2.104
5       144.23   6       4    144.23  0.69  16  0.178  3.515
6       198.02   7       4    198.02  0.62  18  0.891  1.561
3       145.65   4       9    145.65  0.46  62  0.157  3.145
7       178.09   8       9    178.09  0.91  22  0.156  9.155

请注意,在此处的示例中,由于缺少数据,y_Valuespredictions 将重合。