以 cbind(Count_1, Count_0) 形式计算响应的 AUC 曲线

Question

我使用 glm(Xtrain, ytrain, formula='cbind(Response, n - Response) ~ features', family='binomial') 训练了一个二项式模型，其中 ytrain 是一个包含计数列（是）、计数列（否）的响应矩阵。

我提供的测试响应采用相同形式的响应矩阵。但是，predict() 函数 returns 概率——每行训练数据一个。我现在想使用 ROCR 或 AUC 包生成 AUC 曲线，但我的预测和观察结果格式不同。有谁知道如何做到这一点？

好的。添加示例。原谅它是 meaningless/rank deficient/small，我只是想说明我的情况。

plants <- c('Cactus', 'Tree', 'Cactus', 'Tree', 'Flower', 'Tree', 'Tree')
sun <- c('Full', 'Half', 'Half', 'Full', 'Full', 'Half', 'Full')
water <- c('N', 'Y', 'Y', 'N', 'Y', 'N', 'N')
died <- c(10, 10, 8, 2, 15, 20, 12)
didntdie <- c(2, 10, 8, 20, 10, 10, 10)
df <- data.frame(died, didntdie, plants, sun, water)
dftrain <- head(df, 5)
dftest <- tail(df, 2)
model <- glm("cbind(died, didntdie) ~ plants + sun + water", data=dftrain, family="binomial")

此时，predict(model, dftest) returns 我数据框中最后两组特征的对数赔率（给出死亡概率）。现在我想计算 AUC 曲线。我的观察结果在 dftest[c('died','didntdie')]。我的预测本质上是概率。 AUC、ROCR 等期望预测和观察都是伯努利响应的列表。我找不到有关如何使用此响应矩阵的文档。任何帮助表示赞赏。

Answer 1

对于初学者，您可以扩展数据框以合成二进制结果，其中的计数利用 glm() 的 weight= 参数。

obs <- died + didntdie
df <- df[rep(1:length(obs), each= 2),] # one row for died and one for didn't
df$survived <- rep(c(0L,1L), times=length(obs)) # create binary outcome for survival
df$weight <- c(rbind(died, didntdie)) # assign weights
df

#     died didntdie plants  sun water survived weight
# 1     10        2 Cactus Full     N        0     10
# 1.1   10        2 Cactus Full     N        1      2
# 2     10       10   Tree Half     Y        0     10
# 2.1   10       10   Tree Half     Y        1     10
# 3      8        8 Cactus Half     Y        0      8
# 3.1    8        8 Cactus Half     Y        1      8
# 4      2       20   Tree Full     N        0      2
# 4.1    2       20   Tree Full     N        1     20
# 5     15       10 Flower Full     Y        0     15
# 5.1   15       10 Flower Full     Y        1     10
# 6     20       10   Tree Half     N        0     20
# 6.1   20       10   Tree Half     N        1     10
# 7     12       10   Tree Full     N        0     12
# 7.1   12       10   Tree Full     N        1     10

model <- glm(survived ~ plants + sun + water, data=df, family="binomial", weights = weight)

如果您想进行 train/test 拆分，则需要进行另一次扩展，这次是在 weight 上复制行。否则，您的测试集不是随机的，至少是在单个工厂级别随机化的，这可能会使您的结果无效（取决于您要得出的结论）。

因此你会做类似

df <- df[rep(1:nrow(df), times = df$weight),]
model <- glm(survived ~ plants + sun + water, data=df, family="binomial") 
# note the model does not change

library(pROC)
auc(model$fitted.values, df$survived)
# Area under the curve: 0.5833

请注意，这是样本内 AUC。您应该使用随机保留（或者更好的是，交叉验证）来估计样本外 AUC。使用 data.frame 的前 N 行进行拆分不是一个好主意，除非行顺序已经随机化。

以 cbind(Count_1, Count_0) 形式计算响应的 AUC 曲线

Calculate AUC curve for responses in form cbind(Count_1, Count_0)

r

glm

cross-validation

auc