pandas 中线性回归中 R 的 relevel() 和因子变量
R's relevel() and factor variables in linear regression in pandas
数据:
a,b,c,d
1,5,9,red
2,6,10,blue
3,7,11,green
4,8,12,red
3,4,3,orange
3,4,3,blue
3,4,3,red
在 R 中,如果我想构建一个考虑分类数据的线性回归模型(我认为它们在 R 中被称为因子变量),我可以简单地做:
df$d = relevel(df$d, 'green')
此后,为了构建模型,R 将为每种颜色添加列,例如:
dblue
0
1
0
0
0
1
0
不会有绿色的列,因为如果所有其他颜色值都是0,就意味着绿色=1(这是我们的参考水平)。现在,创建一个回归模型:
mod = lm(a ~ b + c + d, data=df)
summary(mod)
Call:
lm(formula = a ~ b + c + d, data = rel)
Residuals:
1 2 3 4 5 6 7
4.708e-16 -7.061e-16 2.219e-31 2.354e-16 -1.233e-31 7.061e-16 -7.061e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.600e+00 3.622e-15 -4.418e+14 1.44e-15 ***
b 1.600e+00 9.403e-16 1.702e+15 3.74e-16 ***
c -6.000e-01 3.766e-16 -1.593e+15 4.00e-16 ***
dblue 8.829e-16 1.823e-15 4.840e-01 0.713
dorange 1.589e-15 2.294e-15 6.930e-01 0.614
dred 2.295e-15 1.631e-15 1.407e+00 0.393
我正试图在 Python Pandas 中实现相同的目标。到目前为止,我只想到了这个:
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3], 'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'], dtype='category')}
df = pd.DataFrame(d)
df['d'] = pd.Categorical(df['d'], ordered=False)
for r in df['d'].cat.categories:
if r != 'green':
df['d%s' % r] = df['d'] == r
df = df.drop('d', 1)
它工作并产生相同的结果,但我想知道 pandas 中是否有为此的方法。
您可以使用 pd.get_dummies
:
import pandas as pd
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3],
'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'],
dtype='category')}
df = pd.DataFrame(d)
dummies = pd.get_dummies(df['d'])
df = pd.concat([df, dummies], axis=1)
df = df.drop(['d', 'green'], axis=1)
print(df)
产量
a b c blue orange red
0 1 5 9 0 0 1
1 2 6 10 1 0 0
2 3 7 11 0 0 0
3 4 8 12 0 0 1
4 3 4 3 0 1 0
5 3 4 3 1 0 0
6 3 4 3 0 0 1
使用statsmodels,
import statsmodels.formula.api as smf
model = smf.ols('a ~ b + c + blue + orange + red', df).fit()
print(model.summary())
产量
OLS Regression Results
==============================================================================
Dep. Variable: a R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 2.149e+25
Date: Sun, 22 Mar 2015 Prob (F-statistic): 1.64e-13
Time: 05:57:33 Log-Likelihood: 200.74
No. Observations: 7 AIC: -389.5
Df Residuals: 1 BIC: -389.8
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept -1.6000 6.11e-13 -2.62e+12 0.000 -1.600 -1.600
b 1.6000 1.59e-13 1.01e+13 0.000 1.600 1.600
c -0.6000 6.36e-14 -9.44e+12 0.000 -0.600 -0.600
blue 1.11e-16 3.08e-13 0.000 1.000 -3.91e-12 3.91e-12
orange 7.994e-15 3.87e-13 0.021 0.987 -4.91e-12 4.93e-12
red 4.829e-15 2.75e-13 0.018 0.989 -3.49e-12 3.5e-12
==============================================================================
Omnibus: nan Durbin-Watson: 0.203
Prob(Omnibus): nan Jarque-Bera (JB): 0.752
Skew: 0.200 Prob(JB): 0.687
Kurtosis: 1.445 Cond. No. 85.2
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
或者,您可以 use a patsy formula to specify the dummy contrast:
import pandas as pd
import statsmodels.formula.api as smf
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3],
'd': ['red', 'blue', 'green', 'red', 'orange', 'blue', 'red']}
df = pd.DataFrame(d)
model = smf.ols('a ~ b + c + C(d, Treatment(reference="green"))', df).fit()
print(model.summary())
参考文献:
也可以这样简化;
import pandas as pd
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3],
'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'],
dtype='category')}
df = pd.DataFrame(d)
df = pd.get_dummies(df,prefix='color',drop_first=True)
数据:
a,b,c,d
1,5,9,red
2,6,10,blue
3,7,11,green
4,8,12,red
3,4,3,orange
3,4,3,blue
3,4,3,red
在 R 中,如果我想构建一个考虑分类数据的线性回归模型(我认为它们在 R 中被称为因子变量),我可以简单地做:
df$d = relevel(df$d, 'green')
此后,为了构建模型,R 将为每种颜色添加列,例如:
dblue
0
1
0
0
0
1
0
不会有绿色的列,因为如果所有其他颜色值都是0,就意味着绿色=1(这是我们的参考水平)。现在,创建一个回归模型:
mod = lm(a ~ b + c + d, data=df)
summary(mod)
Call:
lm(formula = a ~ b + c + d, data = rel)
Residuals:
1 2 3 4 5 6 7
4.708e-16 -7.061e-16 2.219e-31 2.354e-16 -1.233e-31 7.061e-16 -7.061e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.600e+00 3.622e-15 -4.418e+14 1.44e-15 ***
b 1.600e+00 9.403e-16 1.702e+15 3.74e-16 ***
c -6.000e-01 3.766e-16 -1.593e+15 4.00e-16 ***
dblue 8.829e-16 1.823e-15 4.840e-01 0.713
dorange 1.589e-15 2.294e-15 6.930e-01 0.614
dred 2.295e-15 1.631e-15 1.407e+00 0.393
我正试图在 Python Pandas 中实现相同的目标。到目前为止,我只想到了这个:
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3], 'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'], dtype='category')}
df = pd.DataFrame(d)
df['d'] = pd.Categorical(df['d'], ordered=False)
for r in df['d'].cat.categories:
if r != 'green':
df['d%s' % r] = df['d'] == r
df = df.drop('d', 1)
它工作并产生相同的结果,但我想知道 pandas 中是否有为此的方法。
您可以使用 pd.get_dummies
:
import pandas as pd
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3],
'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'],
dtype='category')}
df = pd.DataFrame(d)
dummies = pd.get_dummies(df['d'])
df = pd.concat([df, dummies], axis=1)
df = df.drop(['d', 'green'], axis=1)
print(df)
产量
a b c blue orange red
0 1 5 9 0 0 1
1 2 6 10 1 0 0
2 3 7 11 0 0 0
3 4 8 12 0 0 1
4 3 4 3 0 1 0
5 3 4 3 1 0 0
6 3 4 3 0 0 1
使用statsmodels,
import statsmodels.formula.api as smf
model = smf.ols('a ~ b + c + blue + orange + red', df).fit()
print(model.summary())
产量
OLS Regression Results
==============================================================================
Dep. Variable: a R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 2.149e+25
Date: Sun, 22 Mar 2015 Prob (F-statistic): 1.64e-13
Time: 05:57:33 Log-Likelihood: 200.74
No. Observations: 7 AIC: -389.5
Df Residuals: 1 BIC: -389.8
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept -1.6000 6.11e-13 -2.62e+12 0.000 -1.600 -1.600
b 1.6000 1.59e-13 1.01e+13 0.000 1.600 1.600
c -0.6000 6.36e-14 -9.44e+12 0.000 -0.600 -0.600
blue 1.11e-16 3.08e-13 0.000 1.000 -3.91e-12 3.91e-12
orange 7.994e-15 3.87e-13 0.021 0.987 -4.91e-12 4.93e-12
red 4.829e-15 2.75e-13 0.018 0.989 -3.49e-12 3.5e-12
==============================================================================
Omnibus: nan Durbin-Watson: 0.203
Prob(Omnibus): nan Jarque-Bera (JB): 0.752
Skew: 0.200 Prob(JB): 0.687
Kurtosis: 1.445 Cond. No. 85.2
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
或者,您可以 use a patsy formula to specify the dummy contrast:
import pandas as pd
import statsmodels.formula.api as smf
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3],
'd': ['red', 'blue', 'green', 'red', 'orange', 'blue', 'red']}
df = pd.DataFrame(d)
model = smf.ols('a ~ b + c + C(d, Treatment(reference="green"))', df).fit()
print(model.summary())
参考文献:
也可以这样简化;
import pandas as pd
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3],
'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'],
dtype='category')}
df = pd.DataFrame(d)
df = pd.get_dummies(df,prefix='color',drop_first=True)