难度 运行 R 中的逻辑回归
Difficulty running a logistic regression in R
我在 运行 使用 glm 在 R 中进行逻辑回归时遇到了一些困难。有两种方法可以将二元响应变量传递给 glm 以执行逻辑回归。您可以将数据以串行数据格式传递给 glm(例如,每次观察一行,响应变量为 0 或 1,独立变量采用您拥有的任何值),或者您可以将其传递给作为一个table,至少有三列:第一列表示试验次数,第二列表示成功次数,第三列是自变量。
当我使用后一种数据格式(例如具有三列的数据框)使用 glm 时,我得到了预期的输出,但是当我使用前者(即串行数据格式)输入数据时,我没有得到预期的答案。
这是一个例子
prices <- c(89.99, 99.99, 149.99)
non_purchases <- c(11907, 2024, 5046)
purchases <- c(1369, 215, 31)
trials <- cbind(non_purchases, purchases)
model <- glm(trials ~ prices, family=binomial(link="logit"))
> summary(model)
Call:
glm(formula = trials ~ prices, family = binomial)
Deviance Residuals:
1 2 3
1.332 -4.440 1.553
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.923863 0.241677 -7.96 1.71e-15 ***
prices 0.044995 0.002593 17.35 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 715.832 on 2 degrees of freedom
Residual deviance: 23.897 on 1 degrees of freedom
AIC: 49.228
Number of Fisher Scoring iterations: 4
在这种情况下,我得到了预期值,但是使用串行数据
> head(atable)
ordered sale_price
1 0 149.99
2 0 149.99
3 0 149.99
4 0 149.99
5 0 149.99
6 0 149.99
> summary(atable)
ordered sale_price
Min. :0.00000 Min. : 89.99
1st Qu.:0.00000 1st Qu.: 89.99
Median :0.00000 Median : 89.99
Mean :0.07843 Mean :105.87
3rd Qu.:0.00000 3rd Qu.: 99.99
Max. :1.00000 Max. :149.99
> conv_model <- glm(ordered ~ sale_price, family=binomial(link="logit"), data=atable)
> summary(conv_model)
Call:
glm(formula = ordered ~ sale_price, family = binomial(link = "logit"),
data = atable)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.4743 -0.4743 -0.4743 -0.1209 3.1376
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.549136 0.095341 5.76 8.43e-09 ***
sale_price -0.019949 0.001002 -19.90 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 11322 on 20591 degrees of freedom
Residual deviance: 10623 on 20590 degrees of freedom
AIC: 10627
Number of Fisher Scoring iterations: 7
并且只是为了表明它是相同的数据
> table(atable$ordered, atable$sale_price)
89.99 99.99 149.99
0 11907 2024 5046
1 1369 215 31
我得到的输出完全不同,我完全糊涂了。谁能帮我吗?我假设我在做一些简单的事情
我认为您的问题是您正在切换 "success" 的定义。
来自 ?glm
(强调我的)
For binomial and quasibinomial families the response can also be specified as... a two-column matrix with the columns giving the numbers of successes and failures.
所以第一列是 "successes"。在您的代码中,您使用 cbind(non_purchases, purchases)
,这使得 non_purchases
成为 "success" 列。但是在您的 table 中,非购买被编码为 0
表示失败。使用下面的代码,我们得到相同的结果:
prices <- c(89.99, 99.99, 149.99)
non_purchases <- c(11907, 2024, 5046)
purchases <- c(1369, 215, 31)
trials <- cbind(non_purchases, purchases)
dd = data.frame(
price = c(rep(prices, non_purchases), rep(prices, purchases)),
purchase = c(rep(0, sum(non_purchases)), rep(1, sum(purchases)))
)
coef(glm(purchase ~ price, data = dd, family = "binomial"))
# (Intercept) price
# 1.92386320 -0.04499477
coef(glm(cbind(purchases, non_purchases) ~ prices, family = "binomial"))
# (Intercept) price
# 1.92386320 -0.04499477
我在 运行 使用 glm 在 R 中进行逻辑回归时遇到了一些困难。有两种方法可以将二元响应变量传递给 glm 以执行逻辑回归。您可以将数据以串行数据格式传递给 glm(例如,每次观察一行,响应变量为 0 或 1,独立变量采用您拥有的任何值),或者您可以将其传递给作为一个table,至少有三列:第一列表示试验次数,第二列表示成功次数,第三列是自变量。
当我使用后一种数据格式(例如具有三列的数据框)使用 glm 时,我得到了预期的输出,但是当我使用前者(即串行数据格式)输入数据时,我没有得到预期的答案。
这是一个例子
prices <- c(89.99, 99.99, 149.99)
non_purchases <- c(11907, 2024, 5046)
purchases <- c(1369, 215, 31)
trials <- cbind(non_purchases, purchases)
model <- glm(trials ~ prices, family=binomial(link="logit"))
> summary(model)
Call:
glm(formula = trials ~ prices, family = binomial)
Deviance Residuals:
1 2 3
1.332 -4.440 1.553
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.923863 0.241677 -7.96 1.71e-15 ***
prices 0.044995 0.002593 17.35 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 715.832 on 2 degrees of freedom
Residual deviance: 23.897 on 1 degrees of freedom
AIC: 49.228
Number of Fisher Scoring iterations: 4
在这种情况下,我得到了预期值,但是使用串行数据
> head(atable)
ordered sale_price
1 0 149.99
2 0 149.99
3 0 149.99
4 0 149.99
5 0 149.99
6 0 149.99
> summary(atable)
ordered sale_price
Min. :0.00000 Min. : 89.99
1st Qu.:0.00000 1st Qu.: 89.99
Median :0.00000 Median : 89.99
Mean :0.07843 Mean :105.87
3rd Qu.:0.00000 3rd Qu.: 99.99
Max. :1.00000 Max. :149.99
> conv_model <- glm(ordered ~ sale_price, family=binomial(link="logit"), data=atable)
> summary(conv_model)
Call:
glm(formula = ordered ~ sale_price, family = binomial(link = "logit"),
data = atable)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.4743 -0.4743 -0.4743 -0.1209 3.1376
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.549136 0.095341 5.76 8.43e-09 ***
sale_price -0.019949 0.001002 -19.90 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 11322 on 20591 degrees of freedom
Residual deviance: 10623 on 20590 degrees of freedom
AIC: 10627
Number of Fisher Scoring iterations: 7
并且只是为了表明它是相同的数据
> table(atable$ordered, atable$sale_price)
89.99 99.99 149.99
0 11907 2024 5046
1 1369 215 31
我得到的输出完全不同,我完全糊涂了。谁能帮我吗?我假设我在做一些简单的事情
我认为您的问题是您正在切换 "success" 的定义。
来自 ?glm
(强调我的)
For binomial and quasibinomial families the response can also be specified as... a two-column matrix with the columns giving the numbers of successes and failures.
所以第一列是 "successes"。在您的代码中,您使用 cbind(non_purchases, purchases)
,这使得 non_purchases
成为 "success" 列。但是在您的 table 中,非购买被编码为 0
表示失败。使用下面的代码,我们得到相同的结果:
prices <- c(89.99, 99.99, 149.99)
non_purchases <- c(11907, 2024, 5046)
purchases <- c(1369, 215, 31)
trials <- cbind(non_purchases, purchases)
dd = data.frame(
price = c(rep(prices, non_purchases), rep(prices, purchases)),
purchase = c(rep(0, sum(non_purchases)), rep(1, sum(purchases)))
)
coef(glm(purchase ~ price, data = dd, family = "binomial"))
# (Intercept) price
# 1.92386320 -0.04499477
coef(glm(cbind(purchases, non_purchases) ~ prices, family = "binomial"))
# (Intercept) price
# 1.92386320 -0.04499477