比较测试数据和预测结果
Comparing test data and prediction outcome
我正在尝试对数据集进行逻辑回归。我已经成功地将我的数据集划分为训练和测试。回归模型也可以正常工作,但是当我将其应用于我的测试时,当我的测试数据集的长度为 480 时,我只得到 393 个观察结果。我如何比较并找出不匹配或找出问题所在?
我的数据没有 NA。
我正在尝试创建一个混淆矩阵。
这是我的代码:
n=nrow(wine_log)
shuffled=wine_log[sample(n),]
train_indices=1:round(0.7*n)
test_indices=(round(0.7*n)+1):n
#Making a new dataset
train=shuffled[train_indices,]
test=shuffled[test_indices,]
wmodel = glm(final_take~., family = binomial, data=train)
summary(wmodel)
result1 = predict(wmodel, newdata = test, type = 'response')
result1 = ifelse(result > 0.5, 1, 0) - Can someone also explain how will removing this affect the outcome?
result1
> table(result1)
result1
0 1
255 138
> table(test$final_take)
Bad Good
418 62
structure(list(fixed_acid = c(7.4, 7.8, 7.8, 11.2, 7.4, 7.4,
7.9, 7.3, 7.8, 7.5), vol_acid = c(0.7, 0.88, 0.76, 0.28, 0.7,
0.66, 0.6, 0.65, 0.58, 0.5), c_acid = c(0, 0, 0.04, 0.56, 0,
0, 0.06, 0, 0.02, 0.36), res_sugar = c(1.9, 2.6, 2.3, 1.9, 1.9,
1.8, 1.6, 1.2, 2, 6.1), chlorides = c(0.076, 0.098, 0.092, 0.075,
0.076, 0.075, 0.069, 0.065, 0.073, 0.071), free_siox = c(11,
25, 15, 17, 11, 13, 15, 15, 9, 17), total_diox = c(34, 67, 54,
60, 34, 40, 59, 21, 18, 102), density = c(0.9978, 0.9968, 0.997,
0.998, 0.9978, 0.9978, 0.9964, 0.9946, 0.9968, 0.9978), pH = c(3.51,
3.2, 3.26, 3.16, 3.51, 3.51, 3.3, 3.39, 3.36, 3.35), sulphates = c(0.56,
0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0.57, 0.8), alcohol = c(9.4,
9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10, 9.5, 10.5), final_take = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L), .Label = c("Bad", "Good"
), class = "factor")), row.names = c(NA, -10L), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"),
你的代码行:
result1 = ifelse(result > 0.5, 1, 0)
应该在 ifelse
语句中引用 result1
。我猜 result
是您环境中的另一个对象,它不是 480 行。
所以你应该改用这个。
result1 = ifelse(result1 > 0.5, 1, 0)
你还问了这行代码是做什么的。它基本上是您根据 glm
模型进行预测的阈值。如果模型的预测值大于 0.50,则您将预测值转换为“1”。如果它小于或等于 0.50,那么您将该预测转换为“0”。这是一种将概率转换为 TRUE/FALSE 或 1/0 的方法。
我正在尝试对数据集进行逻辑回归。我已经成功地将我的数据集划分为训练和测试。回归模型也可以正常工作,但是当我将其应用于我的测试时,当我的测试数据集的长度为 480 时,我只得到 393 个观察结果。我如何比较并找出不匹配或找出问题所在?
我的数据没有 NA。
我正在尝试创建一个混淆矩阵。
这是我的代码:
n=nrow(wine_log)
shuffled=wine_log[sample(n),]
train_indices=1:round(0.7*n)
test_indices=(round(0.7*n)+1):n
#Making a new dataset
train=shuffled[train_indices,]
test=shuffled[test_indices,]
wmodel = glm(final_take~., family = binomial, data=train)
summary(wmodel)
result1 = predict(wmodel, newdata = test, type = 'response')
result1 = ifelse(result > 0.5, 1, 0) - Can someone also explain how will removing this affect the outcome?
result1
> table(result1)
result1
0 1
255 138
> table(test$final_take)
Bad Good
418 62
structure(list(fixed_acid = c(7.4, 7.8, 7.8, 11.2, 7.4, 7.4,
7.9, 7.3, 7.8, 7.5), vol_acid = c(0.7, 0.88, 0.76, 0.28, 0.7,
0.66, 0.6, 0.65, 0.58, 0.5), c_acid = c(0, 0, 0.04, 0.56, 0,
0, 0.06, 0, 0.02, 0.36), res_sugar = c(1.9, 2.6, 2.3, 1.9, 1.9,
1.8, 1.6, 1.2, 2, 6.1), chlorides = c(0.076, 0.098, 0.092, 0.075,
0.076, 0.075, 0.069, 0.065, 0.073, 0.071), free_siox = c(11,
25, 15, 17, 11, 13, 15, 15, 9, 17), total_diox = c(34, 67, 54,
60, 34, 40, 59, 21, 18, 102), density = c(0.9978, 0.9968, 0.997,
0.998, 0.9978, 0.9978, 0.9964, 0.9946, 0.9968, 0.9978), pH = c(3.51,
3.2, 3.26, 3.16, 3.51, 3.51, 3.3, 3.39, 3.36, 3.35), sulphates = c(0.56,
0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0.57, 0.8), alcohol = c(9.4,
9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10, 9.5, 10.5), final_take = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L), .Label = c("Bad", "Good"
), class = "factor")), row.names = c(NA, -10L), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"),
你的代码行:
result1 = ifelse(result > 0.5, 1, 0)
应该在 ifelse
语句中引用 result1
。我猜 result
是您环境中的另一个对象,它不是 480 行。
所以你应该改用这个。
result1 = ifelse(result1 > 0.5, 1, 0)
你还问了这行代码是做什么的。它基本上是您根据 glm
模型进行预测的阈值。如果模型的预测值大于 0.50,则您将预测值转换为“1”。如果它小于或等于 0.50,那么您将该预测转换为“0”。这是一种将概率转换为 TRUE/FALSE 或 1/0 的方法。