如何将诊断预测模型应用于新数据

Question

在一些帮助下，我对自举数据集和多重推算数据集执行了 LASSO 回归，以构建一个诊断模型，该模型可以使用大量预测变量区分疾病 A 和疾病 B。

最终，我得到以下 table 与所选变量（它们都是分类变量，结果为 yes/no）及其系数：

Predictor	mean regression coefficient
Intercept	10.141
var1	1.671
Var2	-1.971
Var3	-5.266
Var4	-2.244
Var5	5.266

我的问题是：我如何使用以上 table 来预测新患者（尚未用于构建模型）是否患有疾病 A 或疾病 B。

我想到了以下几点：

截距 + (1.671 (var1) x 0 或 1) - (1.971 (var2) x 0 或 1) - (5.266 (var3) x 0 或 1) ..... + (5.266 (var5) x 0 或 1) = X

患有疾病 A 的概率（在数据集中编码为 1）= e^X / (1+ e^X)

但是这种做法正确吗？

我希望有人能帮我解决这个问题！

Answer 1

是的，因为你描述的是逻辑回归，所以步骤是正确的。这些是根据您的模型计算预测的步骤。

a) 将系数乘以 x 变量，确保包括截距（如果适用）（值为 1）

b) 对 a)

的结果求和

c) 取幂产生对数赔率

d) 用log_odds / (1 + log_odds)

计算最终概率

你没有提到特定的语言，但这里有一些伪代码 python 使用 pandas/numpy，假设数据集 x_variables 和 pandas [= coefficients 的 15=]。

scores = x_variables.transpose()
scores = transpose_predictors.mul(coefficients, axis = 0)
sum_scores = scores.sum(axis = 0, skipna = True)
log_odds = np.exp(sum_scores)
final_scores = log_odds / (1 + log_odds)

编辑：R 中的相同代码，其中 coefficients 是系数值的向量。

# do the scoring via matrix multiplication
scores <- t(t(x_variables) * coefficients)

# sum the scores by row and exponentiate. 
log_odds <- exp(rowSums(scores, na.rm = TRUE))
final_scores <- log_odds / (1 + log_odds)

如何将诊断预测模型应用于新数据

How to apply diagnostic prediction model to new data

r

lasso-regression

coefficients