Na.fail.default 使用交叉验证的最佳 k 错误
Na.fail.default error for best k using cross validation
我正在处理威斯康星乳腺癌诊断的数据集。 (https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)
我正在使用使用 kNN 的交叉验证来计算 k 的最佳值。
我将 csv 文件读入 wbcd,当我 运行 下面的代码时,我得到以下错误:
fit <- train(diagnosis ~ ., method = "knn", tuneGrid = expand.grid(k = 1:50), trControl= trControl, metric = "Accuracy", data = wbcd)
plot(fit)
错误 na.fail.default(list(diagnosis = c("M", "M", "M", "M", "M", "M", :
对象中缺少值
我在数据集中的诊断字段中没有看到任何缺失值。知道是什么原因造成的吗?
我注意到最后一列很奇怪..所以要重现错误:
library(caret)
wbcd = read.csv("datasets_180_408_data.csv",stringsAsFactors=FALSE)
fit <- train(diagnosis ~ ., method = "knn", tuneGrid = expand.grid(k = 1:50),
trControl= trainControl(method="cv",number=10), metric = "Accuracy", data = wbcd[,-1])
Error in na.fail.default(list(diagnosis = c("M", "M", "M", "M", "M", "M", :
missing values in object
如果您查看摘要:
summary(wbcd)
[...]
concavity_worst concave.points_worst symmetry_worst fractal_dimension_worst
Min. :0.0000 Min. :0.00000 Min. :0.1565 Min. :0.05504
1st Qu.:0.1145 1st Qu.:0.06493 1st Qu.:0.2504 1st Qu.:0.07146
Median :0.2267 Median :0.09993 Median :0.2822 Median :0.08004
Mean :0.2722 Mean :0.11461 Mean :0.2901 Mean :0.08395
3rd Qu.:0.3829 3rd Qu.:0.16140 3rd Qu.:0.3179 3rd Qu.:0.09208
Max. :1.2520 Max. :0.29100 Max. :0.6638 Max. :0.20750
X
Mode:logical
NA's:569
如果你取出最后一列,还要注意不要与id列对齐(因此wbcd[,-1]
),效果很好:
wbcd$X = NULL
fit <- train(diagnosis ~ ., method = "knn",
tuneGrid = expand.grid(k = 1:50),
trControl= trainControl(method="cv",number=10),
metric = "Accuracy", data = wbcd[,-1])
fit
k-Nearest Neighbors
569 samples
30 predictor
2 classes: 'B', 'M'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 512, 513, 511, 512, 513, 512, ...
Resampling results across tuning parameters:
k Accuracy Kappa
1 0.9156231 0.8174624
2 0.9085407 0.8013572
3 0.9263039 0.8415912
4 0.9263342 0.8415714
5 0.9314752 0.8520796
6 0.9279665 0.8451175
7 0.9297511 0.8489385
8 0.9296582 0.8476492
[...]
我正在处理威斯康星乳腺癌诊断的数据集。 (https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)
我正在使用使用 kNN 的交叉验证来计算 k 的最佳值。
我将 csv 文件读入 wbcd,当我 运行 下面的代码时,我得到以下错误:
fit <- train(diagnosis ~ ., method = "knn", tuneGrid = expand.grid(k = 1:50), trControl= trControl, metric = "Accuracy", data = wbcd)
plot(fit)
错误 na.fail.default(list(diagnosis = c("M", "M", "M", "M", "M", "M", : 对象中缺少值
我在数据集中的诊断字段中没有看到任何缺失值。知道是什么原因造成的吗?
我注意到最后一列很奇怪..所以要重现错误:
library(caret)
wbcd = read.csv("datasets_180_408_data.csv",stringsAsFactors=FALSE)
fit <- train(diagnosis ~ ., method = "knn", tuneGrid = expand.grid(k = 1:50),
trControl= trainControl(method="cv",number=10), metric = "Accuracy", data = wbcd[,-1])
Error in na.fail.default(list(diagnosis = c("M", "M", "M", "M", "M", "M", :
missing values in object
如果您查看摘要:
summary(wbcd)
[...]
concavity_worst concave.points_worst symmetry_worst fractal_dimension_worst
Min. :0.0000 Min. :0.00000 Min. :0.1565 Min. :0.05504
1st Qu.:0.1145 1st Qu.:0.06493 1st Qu.:0.2504 1st Qu.:0.07146
Median :0.2267 Median :0.09993 Median :0.2822 Median :0.08004
Mean :0.2722 Mean :0.11461 Mean :0.2901 Mean :0.08395
3rd Qu.:0.3829 3rd Qu.:0.16140 3rd Qu.:0.3179 3rd Qu.:0.09208
Max. :1.2520 Max. :0.29100 Max. :0.6638 Max. :0.20750
X
Mode:logical
NA's:569
如果你取出最后一列,还要注意不要与id列对齐(因此wbcd[,-1]
),效果很好:
wbcd$X = NULL
fit <- train(diagnosis ~ ., method = "knn",
tuneGrid = expand.grid(k = 1:50),
trControl= trainControl(method="cv",number=10),
metric = "Accuracy", data = wbcd[,-1])
fit
k-Nearest Neighbors
569 samples
30 predictor
2 classes: 'B', 'M'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 512, 513, 511, 512, 513, 512, ...
Resampling results across tuning parameters:
k Accuracy Kappa
1 0.9156231 0.8174624
2 0.9085407 0.8013572
3 0.9263039 0.8415912
4 0.9263342 0.8415714
5 0.9314752 0.8520796
6 0.9279665 0.8451175
7 0.9297511 0.8489385
8 0.9296582 0.8476492
[...]