GLM 回归预测——了解哪个因素水平是成功的
GLM regression prediction- understanding which factor level is success
我建立了一个二项式 glm 模型。该模型预测两个潜在 类 之间的输出:AD 或 Control。这些变量是具有水平的因素:{AD, Control}。我使用这个模型来预测和获取每个样本的概率,但我不清楚概率超过 0.5 是否表示 AD 或 Control。
这是我的数据集:
> head(example)
cleaned_mayo$Diagnosis pca_results$x[, 1]
1052_TCX AD 0.9613241
1104_TCX AD -0.9327390
742_TCX AD 1.6908874
1945_TCX Control 0.6819104
134_TCX AD 0.5184748
11386_TCX Control 0.4669661
这是我计算模型和进行预测的代码:
# Randomize rows of top performer
example<- example[sample(nrow(example)),]
# Subset data for training and testing
N_train<- round(nrow(example)*0.75)
train<- example[1:N_train,]
test<- example[(N_train+1):nrow(example),]
colnames(train)[1:2]<- c("Diagnosis", "Eigen_gene")
colnames(test)[1:2]<- c("Diagnosis", "Eigen_gene")
# Build model and predict
model_IFGyel<- glm(Diagnosis ~ Eigen_gene, data = train, family = binomial())
pred<- predict(model_IFGyel, newdata= test, type= "response")
# Convert predictions to accuracy metric
pred[which(pred<0.5)]<- "AD"
pred[which(pred!="AD")]<- "Control"
test$Diagnosis<- as.character(test$Diagnosis)
example_acc<- sum(test$Diagnosis==pred, na.rm = T)/nrow(test)
如能帮助阐明这些预测概率的含义,我们将不胜感激。
从 ?glm
我们注意到:
Details:
A typical predictor has the form ‘response ~ terms’ where
‘response’ is the (numeric) response vector and ‘terms’ is a
series of terms which specifies a linear predictor for ‘response’.
For ‘binomial’ and ‘quasibinomial’ families the response can also
be specified as a ‘factor’ (when the first level denotes failure
and all others success) or as a two-column matrix with the columns
giving the numbers of successes and failures.
重点部分突出显示。假设您没有指定级别(即发生了 R 的默认分配),那么 AD
将失败,而 Control
将成功。因此 coefficients/model 将根据观察在 Control
class.
中的概率来表示
如果你想改变它,使用 factor(...., levels = c('Control', 'AD'))
或只做 1 - prob(Control) (1 - predicted value) 来根据 AD
.
得到它
我建立了一个二项式 glm 模型。该模型预测两个潜在 类 之间的输出:AD 或 Control。这些变量是具有水平的因素:{AD, Control}。我使用这个模型来预测和获取每个样本的概率,但我不清楚概率超过 0.5 是否表示 AD 或 Control。
这是我的数据集:
> head(example)
cleaned_mayo$Diagnosis pca_results$x[, 1]
1052_TCX AD 0.9613241
1104_TCX AD -0.9327390
742_TCX AD 1.6908874
1945_TCX Control 0.6819104
134_TCX AD 0.5184748
11386_TCX Control 0.4669661
这是我计算模型和进行预测的代码:
# Randomize rows of top performer
example<- example[sample(nrow(example)),]
# Subset data for training and testing
N_train<- round(nrow(example)*0.75)
train<- example[1:N_train,]
test<- example[(N_train+1):nrow(example),]
colnames(train)[1:2]<- c("Diagnosis", "Eigen_gene")
colnames(test)[1:2]<- c("Diagnosis", "Eigen_gene")
# Build model and predict
model_IFGyel<- glm(Diagnosis ~ Eigen_gene, data = train, family = binomial())
pred<- predict(model_IFGyel, newdata= test, type= "response")
# Convert predictions to accuracy metric
pred[which(pred<0.5)]<- "AD"
pred[which(pred!="AD")]<- "Control"
test$Diagnosis<- as.character(test$Diagnosis)
example_acc<- sum(test$Diagnosis==pred, na.rm = T)/nrow(test)
如能帮助阐明这些预测概率的含义,我们将不胜感激。
从 ?glm
我们注意到:
Details:
A typical predictor has the form ‘response ~ terms’ where ‘response’ is the (numeric) response vector and ‘terms’ is a series of terms which specifies a linear predictor for ‘response’. For ‘binomial’ and ‘quasibinomial’ families the response can also be specified as a ‘factor’ (when the first level denotes failure and all others success) or as a two-column matrix with the columns giving the numbers of successes and failures.
重点部分突出显示。假设您没有指定级别(即发生了 R 的默认分配),那么 AD
将失败,而 Control
将成功。因此 coefficients/model 将根据观察在 Control
class.
如果你想改变它,使用 factor(...., levels = c('Control', 'AD'))
或只做 1 - prob(Control) (1 - predicted value) 来根据 AD
.