比较测试数据和预测结果

Question

我正在尝试对数据集进行逻辑回归。我已经成功地将我的数据集划分为训练和测试。回归模型也可以正常工作，但是当我将其应用于我的测试时，当我的测试数据集的长度为 480 时，我只得到 393 个观察结果。我如何比较并找出不匹配或找出问题所在？

我的数据没有 NA。

我正在尝试创建一个混淆矩阵。

这是我的代码：

n=nrow(wine_log)
shuffled=wine_log[sample(n),]

train_indices=1:round(0.7*n)
test_indices=(round(0.7*n)+1):n

#Making a new dataset
train=shuffled[train_indices,]
test=shuffled[test_indices,]

wmodel = glm(final_take~., family = binomial, data=train)
summary(wmodel)

result1 = predict(wmodel, newdata = test, type = 'response')
result1 = ifelse(result > 0.5, 1, 0) - Can someone also explain how will removing this affect the outcome?
result1

> table(result1)
result1
  0   1 
255 138 
> table(test$final_take)

 Bad Good 
 418   62 

structure(list(fixed_acid = c(7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 
7.9, 7.3, 7.8, 7.5), vol_acid = c(0.7, 0.88, 0.76, 0.28, 0.7, 
0.66, 0.6, 0.65, 0.58, 0.5), c_acid = c(0, 0, 0.04, 0.56, 0, 
0, 0.06, 0, 0.02, 0.36), res_sugar = c(1.9, 2.6, 2.3, 1.9, 1.9, 
1.8, 1.6, 1.2, 2, 6.1), chlorides = c(0.076, 0.098, 0.092, 0.075, 
0.076, 0.075, 0.069, 0.065, 0.073, 0.071), free_siox = c(11, 
25, 15, 17, 11, 13, 15, 15, 9, 17), total_diox = c(34, 67, 54, 
60, 34, 40, 59, 21, 18, 102), density = c(0.9978, 0.9968, 0.997, 
0.998, 0.9978, 0.9978, 0.9964, 0.9946, 0.9968, 0.9978), pH = c(3.51, 
3.2, 3.26, 3.16, 3.51, 3.51, 3.3, 3.39, 3.36, 3.35), sulphates = c(0.56, 
0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0.57, 0.8), alcohol = c(9.4, 
9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10, 9.5, 10.5), final_take = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L), .Label = c("Bad", "Good"
), class = "factor")), row.names = c(NA, -10L), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"),

Answer 1

你的代码行：

result1 = ifelse(result > 0.5, 1, 0)

应该在 ifelse 语句中引用 result1。我猜 result 是您环境中的另一个对象，它不是 480 行。

所以你应该改用这个。

result1 = ifelse(result1 > 0.5, 1, 0)

你还问了这行代码是做什么的。它基本上是您根据 glm 模型进行预测的阈值。如果模型的预测值大于 0.50，则您将预测值转换为“1”。如果它小于或等于 0.50，那么您将该预测转换为“0”。这是一种将概率转换为 TRUE/FALSE 或 1/0 的方法。

比较测试数据和预测结果

Comparing test data and prediction outcome

comparison

regression

r