为什么自变量中的这个特定序列会导致 R GLM 中的错误？

Question

GLM 显示“是”和“否”的系数，这是错误的。 GLM 函数通常会自动对二进制因子进行虚拟编码，以便只有一个级别具有系数。

所以在这种情况下它应该提供“是”的系数，而“否”不应该有系数，因为它是参考水平。

我没有遇到过任何其他类似编码的独立变量的问题，这个特定的 Yes、No 和 NA 序列似乎有问题。为什么要这样做？

#Generate specific sequence of Yes and No

c <- replicate(5,"No")
d <- c("Yes","No","Yes","No","NA","Yes")

#Concatenate and add into dataframe and generate dependent variable f
df <- data.frame(e=c(c,d),
f=sample(c(0,1,2,3,4), 11, replace = TRUE, prob = NULL))

#Convert e to a factor
df$e <- as.factor(df$e)

nbd_attend<-glm.nb(f ~ e, data = df)
summary(nbd_attend)

Answer 1

您已将“NA”作为数据的字符串，而不是特殊的缺失值 NA。如果您改用

d <- c("Yes", "No", "Yes", "No", "NA", "Yes")  # bad
d <- c("Yes", "No", "Yes", "No", NA, "Yes")    # good

那就可以了。

基本上你做了一个有三个水平的因素，“NA”是按字母顺序排列的第一个所以它成为参考水平。

levels(df$e)
# [1] "NA"  "No"  "Yes"

为什么自变量中的这个特定序列会导致 R GLM 中的错误？

Why does this specific sequence in the independent variable cause a bug in R GLM?

r

glm