使用 rpy2 从 R lm 回归到 pandas 中检索统计数据
Retrieve statistics from R lm regression into pandas with rpy2
受文档中 linear models example 的启发,我想在 运行 执行 lm
命令后打印一个很好的摘要。
当我运行(见例子的最后一行)
print(base.summary(stats.lm('foo ~ bar'))
我得到一个完整的函数列表,其开头如下:
Call:
(function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
{
ret.x <- x
ret.y <- y
cl <- match.call()
mf <- match.call(expand.dots = FALSE)
所需的 R
输出在底部:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
foo 5.0320 0.2202 22.85 9.55e-15 ***
bar 4.6610 0.2202 21.16 3.62e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-squared: 0.9818, Adjusted R-squared: 0.9798
F-statistic: 485.1 on 2 and 18 DF, p-value: < 2.2e-16
这有点问题,但当输入 lm
的数据是 pandas.DataFrame
时变得不可行,因为 base.summary
似乎想要打印所有数据。
有没有办法在 pd.DataFrame
中获得格式良好的 R
输出而无需所有额外的 gubbins?
为了后代,这里有一个非常好的方法,可以将 lm
中的数字返回到 pd.DataFrame
(感谢@Metrics 提供有关扫帚的提示)
def _run_regression(data, y_name):
"""
Run a linear regression, in R, using `data` with dependent variable
`y_name` and independent variables all other columns of `data`.
"""
from rpy2.robjects.packages import importr
stats = importr('stats')
broom = importr('broom')
lm = broom.tidy(stats.lm('%s ~ . ' % y_name, data=data))
return _extract_R_df(lm).set_index('term')
def _extract_R_df(df):
"""
Extract the R DataFrame `df` as a pd.DataFrame. This slightly
longer method is necessary because `np.asarray(df)` drops the
exponent on very small numbers!
"""
return pd.DataFrame({name:np.asarray(df.rx(name))[0] for name in df.names})
这会产生类似于以下的 DataFrame:
estimate p.value statistic std.error
term
(Intercept) -3.709995e-16 0.000056 -4.712554e+00 7.872579e-17
x_is 8.000000e-01 0.000000 1.067919e+16 7.491204e-17
v_is 2.000000e-01 0.000000 2.107838e+15 9.488394e-17
d_ij -2.000000e-01 0.000000 -2.970482e+14 6.732913e-16
d1 1.000000e-01 0.000000 4.045155e+14 2.472093e-16
d2 3.000000e-01 0.000000 5.320521e+14 5.638545e-16
d3 7.000000e-01 0.000000 1.779338e+15 3.934048e-16
受文档中 linear models example 的启发,我想在 运行 执行 lm
命令后打印一个很好的摘要。
当我运行(见例子的最后一行)
print(base.summary(stats.lm('foo ~ bar'))
我得到一个完整的函数列表,其开头如下:
Call:
(function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
{
ret.x <- x
ret.y <- y
cl <- match.call()
mf <- match.call(expand.dots = FALSE)
所需的 R
输出在底部:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
foo 5.0320 0.2202 22.85 9.55e-15 ***
bar 4.6610 0.2202 21.16 3.62e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-squared: 0.9818, Adjusted R-squared: 0.9798
F-statistic: 485.1 on 2 and 18 DF, p-value: < 2.2e-16
这有点问题,但当输入 lm
的数据是 pandas.DataFrame
时变得不可行,因为 base.summary
似乎想要打印所有数据。
有没有办法在 pd.DataFrame
中获得格式良好的 R
输出而无需所有额外的 gubbins?
为了后代,这里有一个非常好的方法,可以将 lm
中的数字返回到 pd.DataFrame
(感谢@Metrics 提供有关扫帚的提示)
def _run_regression(data, y_name):
"""
Run a linear regression, in R, using `data` with dependent variable
`y_name` and independent variables all other columns of `data`.
"""
from rpy2.robjects.packages import importr
stats = importr('stats')
broom = importr('broom')
lm = broom.tidy(stats.lm('%s ~ . ' % y_name, data=data))
return _extract_R_df(lm).set_index('term')
def _extract_R_df(df):
"""
Extract the R DataFrame `df` as a pd.DataFrame. This slightly
longer method is necessary because `np.asarray(df)` drops the
exponent on very small numbers!
"""
return pd.DataFrame({name:np.asarray(df.rx(name))[0] for name in df.names})
这会产生类似于以下的 DataFrame:
estimate p.value statistic std.error
term
(Intercept) -3.709995e-16 0.000056 -4.712554e+00 7.872579e-17
x_is 8.000000e-01 0.000000 1.067919e+16 7.491204e-17
v_is 2.000000e-01 0.000000 2.107838e+15 9.488394e-17
d_ij -2.000000e-01 0.000000 -2.970482e+14 6.732913e-16
d1 1.000000e-01 0.000000 4.045155e+14 2.472093e-16
d2 3.000000e-01 0.000000 5.320521e+14 5.638545e-16
d3 7.000000e-01 0.000000 1.779338e+15 3.934048e-16