运行 具有二元结果的泊松回归时出错

Error when running poisson regression with a binary outcome

我正在尝试 运行 泊松回归来预测 常见 二元结果。

这是我第一次尝试使用dput - 如果我使用不当,请告诉我,以便我改正。

示例数据:

df <- structure(list(id = 1:30, sex = structure(c(1L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 
2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L), .Label = c("Female", "Male"
), class = "factor"), migStat = structure(c(1L, 2L, 1L, 1L, 1L, 
1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L), .Label = c("Australian-born", 
"Migrant"), class = "factor"), mhAreaBi = structure(c(1L, 1L, 
1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 
1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("Metropolitan", 
"Regional"), class = "factor"), empStatBi = structure(c(2L, 2L, 
1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 
2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Student / employed", 
"Unemployed"), class = "factor"), pensBenBi = structure(c(1L, 
2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 
1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L), .Label = c("No benefit", 
"In receipt of pension benefit"), class = "factor"), maritStatBi = structure(c(2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("Married (including de facto)", 
"Not married"), class = "factor"), cto = structure(c(1L, 2L, 
2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 
2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L), .Label = c("No", 
"Yes"), class = "factor")), .Names = c("id", "sex", "migStat", 
"mhAreaBi", "empStatBi", "pensBenBi", "maritStatBi", "cto"), row.names = c(NA, 
-30L), class = "data.frame")

当 运行 在 R 中使用 glm 进行回归时,我收到一个错误:

fit <- glm(cto ~ sex + migStat + mhAreaBi + empStatBi + pensBenBi + maritStatBi, df, family = poisson)

Error in if (any(y < 0)) stop("negative values not allowed for the 'Poisson' family") : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In Ops.factor(y, 0) : ‘<’ not meaningful for factors

同样的错误已简单解释in this thread:

Because the "<" operator is not defined for factors the result that is passed to if is of length 0. Setting the factor variable on the RHS and using the integer values on hte LHS succeeds.

当我将结果转换为整数时,没有出现错误;然而,这:

  1. 似乎违背了预测二元结果的目的(除非范围为 0-1 的数字变量被视为与具有两个水平的因子变量相同);和
  2. 似乎没有必要(至少根据这个 post,它使用 geepack 中的 geeglm 来预测二进制结果 [不幸的是,我在适应时收到相同的错误我自己数据集的代码])

问题:

我可以得到关于错误的进一步解释吗?

如果我将结果转换为范围为 0-1 的整数,glm 是否会将其视为二进制变量?如果没有,是否有更适合 运行 对常见二元结果进行回归的方法?

我认为最好的选择是:

df$cto_binary <- as.numeric(df$cto == "Yes")
fit <- glm(cto_binary ~ sex + migStat + mhAreaBi + empStatBi + pensBenBi + maritStatBi, 
           df, family = poisson)

通过这种方式,您可以在代码中明确显示二元结果中的 1/成功,并且不会被诸如因子水平排序之类的事情绊倒。请注意,在 R 中 as.numeric(c(FALSE, TRUE)) 给出 c(0, 1),因此您始终知道您将从逻辑比较中得到什么。