GLM 回归预测——了解哪个因素水平是成功的

GLM regression prediction- understanding which factor level is success

我建立了一个二项式 glm 模型。该模型预测两个潜在 类 之间的输出:AD 或 Control。这些变量是具有水平的因素:{AD, Control}。我使用这个模型来预测和获取每个样本的概率,但我不清楚概率超过 0.5 是否表示 AD 或 Control。

这是我的数据集:

> head(example)
          cleaned_mayo$Diagnosis pca_results$x[, 1]
1052_TCX                      AD          0.9613241
1104_TCX                      AD         -0.9327390
742_TCX                       AD          1.6908874
1945_TCX                 Control          0.6819104
134_TCX                       AD          0.5184748
11386_TCX                Control          0.4669661

这是我计算模型和进行预测的代码:

# Randomize rows of top performer
example<- example[sample(nrow(example)),]

# Subset data for training and testing
N_train<- round(nrow(example)*0.75)
train<- example[1:N_train,]
test<- example[(N_train+1):nrow(example),]
colnames(train)[1:2]<- c("Diagnosis", "Eigen_gene")
colnames(test)[1:2]<- c("Diagnosis", "Eigen_gene")

# Build model and predict   
model_IFGyel<- glm(Diagnosis ~ Eigen_gene, data = train, family = binomial())
pred<- predict(model_IFGyel, newdata= test, type= "response")

# Convert predictions to accuracy metric
pred[which(pred<0.5)]<- "AD"
pred[which(pred!="AD")]<- "Control"
test$Diagnosis<- as.character(test$Diagnosis)
example_acc<- sum(test$Diagnosis==pred, na.rm = T)/nrow(test)

如能帮助阐明这些预测概率的含义,我们将不胜感激。

?glm 我们注意到:

Details:

A typical predictor has the form ‘response ~ terms’ where ‘response’ is the (numeric) response vector and ‘terms’ is a series of terms which specifies a linear predictor for ‘response’. For ‘binomial’ and ‘quasibinomial’ families the response can also be specified as a ‘factor’ (when the first level denotes failure and all others success) or as a two-column matrix with the columns giving the numbers of successes and failures.

重点部分突出显示。假设您没有指定级别(即发生了 R 的默认分配),那么 AD 将失败,而 Control 将成功。因此 coefficients/model 将根据观察在 Control class.

中的概率来表示

如果你想改变它,使用 factor(...., levels = c('Control', 'AD')) 或只做 1 - prob(Control) (1 - predicted value) 来根据 AD.

得到它