使用 pROC 绘制 ROC 曲线失败
Failure plotting ROC curve using pROC
我有一个这样组织的数据集:
> head(crypto_data)
time btc_price btc_change btc_change_label eth_price block_size difficulty estimated_btc_sent estimated_transaction_volume_usd
1 2017-09-02 21:54:00 4537.834 -0.06630663 buy 330.727 142521291 8.88e+11 2.04e+13 923315360
2 2017-09-02 22:29:00 4577.605 -0.05629429 buy 337.804 136524566 8.88e+11 2.03e+13 918188067
3 2017-09-02 23:04:00 4566.360 -0.05971624 buy 336.938 134845546 8.88e+11 2.01e+13 910440916
4 2017-09-02 23:39:00 4590.031 -0.05624237 buy 342.929 133910638 8.88e+11 1.99e+13 901565930
5 2017-09-03 00:14:00 4676.193 -0.03585697 hold 354.171 130678099 8.88e+11 2.01e+13 922422228
6 2017-09-03 00:49:00 4699.936 -0.03358492 hold 352.299 127557140 8.88e+11 1.99e+13 910457430
hash_rate miners_revenue_btc miners_revenue_usd minutes_between_blocks n_blocks_mined n_blocks_total n_btc_mined n_tx nextretarget
1 7417412092 2395 10839520 8.00 168 483207 2.10e+11 241558 483839
2 7152504517 2317 10482320 8.33 162 483208 2.03e+11 236661 483839
3 7240807042 2342 10596900 8.22 164 483216 2.05e+11 238682 483839
4 7284958305 2352 10642439 8.14 165 483220 2.06e+11 237159 483839
5 7152504517 2316 10611798 8.38 162 483223 2.03e+11 237464 483839
6 7064201992 2288 10481960 8.41 160 483226 2.00e+11 234472 483839
total_btc_sent total_fees_btc totalbtc trade_volume_btc trade_volume_usd
1 1.62e+14 29597881711 1.65e+15 102451.92 463497285
2 1.60e+14 29202300823 1.65e+15 102451.92 463497285
3 1.60e+14 29234981721 1.65e+15 102451.92 463497285
4 1.58e+14 28991577368 1.65e+15 102451.92 463497285
5 1.58e+14 29179041967 1.65e+15 96216.78 440710136
6 1.57e+14 28844391629 1.65e+15 96216.78 440710136
> str(crypto_data)
'data.frame': 895 obs. of 23 variables:
$ time : POSIXct, format: "2017-09-02 21:54:00" "2017-09-02 22:29:00" "2017-09-02 23:04:00" "2017-09-02 23:39:00" ...
$ btc_price : num 4538 4578 4566 4590 4676 ...
$ btc_change : num -0.0663 -0.0563 -0.0597 -0.0562 -0.0359 ...
$ btc_change_label : Factor w/ 3 levels "buy","hold","sell": 1 1 1 1 2 2 2 2 2 2 ...
$ eth_price : num 331 338 337 343 354 ...
$ block_size : num 1.43e+08 1.37e+08 1.35e+08 1.34e+08 1.31e+08 ...
$ difficulty : num 8.88e+11 8.88e+11 8.88e+11 8.88e+11 8.88e+11 ...
$ estimated_btc_sent : num 2.04e+13 2.03e+13 2.01e+13 1.99e+13 2.01e+13 ...
$ estimated_transaction_volume_usd: num 9.23e+08 9.18e+08 9.10e+08 9.02e+08 9.22e+08 ...
$ hash_rate : num 7.42e+09 7.15e+09 7.24e+09 7.28e+09 7.15e+09 ...
$ miners_revenue_btc : num 2395 2317 2342 2352 2316 ...
$ miners_revenue_usd : num 10839520 10482320 10596900 10642439 10611798 ...
$ minutes_between_blocks : num 8 8.33 8.22 8.14 8.38 8.41 8.26 8.33 8.5 8.69 ...
$ n_blocks_mined : num 168 162 164 165 162 160 157 161 159 156 ...
$ n_blocks_total : num 483207 483208 483216 483220 483223 ...
$ n_btc_mined : num 2.10e+11 2.03e+11 2.05e+11 2.06e+11 2.03e+11 ...
$ n_tx : num 241558 236661 238682 237159 237464 ...
$ nextretarget : num 483839 483839 483839 483839 483839 ...
$ total_btc_sent : num 1.62e+14 1.60e+14 1.60e+14 1.58e+14 1.58e+14 ...
$ total_fees_btc : num 2.96e+10 2.92e+10 2.92e+10 2.90e+10 2.92e+10 ...
$ totalbtc : num 1.65e+15 1.65e+15 1.65e+15 1.65e+15 1.65e+15 ...
$ trade_volume_btc : num 102452 102452 102452 102452 96217 ...
$ trade_volume_usd : num 4.63e+08 4.63e+08 4.63e+08 4.63e+08 4.41e+08 ...
然后我 运行 SVM 并尝试绘制 ROC 曲线:
crypto_linear_svm <- svm(btc_change_label ~ ., data = crypto_trainingDS, method = "C-classification", kernel = "linear")
crypto_linear_svm_pred <- predict(crypto_linear_svm, crypto_testDS[,-3])
linear_crypto_conf_mat <- table(pred = crypto_linear_svm_pred, true = crypto_testDS[,3])
linear_svm_crypto_roc <- plot(multiclass.roc(crypto_testDS$btc_change_label, crypto_linear_svm_pred, direction="<"),
col="yellow", lwd=3, main="Linear Kernal SVM results, Cryptocurrency Data")
然而,最后一行给出了以下错误:
Error in roc.default(response, predictor, levels = X, percent =
percent, : Predictor must be numeric or ordered.
我做错了什么,我该如何解决?我有两个具有不同结构和组织的不同数据集——显示的一个是多类的,另一个是二进制的(是或否)。我在两者上都有 运行 个 SVM,但是当我尝试绘制 ROC 时,我得到了相同的错误。
编辑
这是预测的输出:
> crypto_linear_svm_pred
3 4 5 6 7 8 14 16 17 19 21 26 29 32 34 36 38 39 45 47 49 53 54 57 59 60 61 63 65
buy buy hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold buy buy buy buy buy buy
67 69 71 74 78 86 89 91 92 95 96 97 98 105 111 113 115 116 122 123 124 127 132 135 140 141 156 160 161
buy buy hold hold hold hold hold hold hold hold hold hold sell sell buy buy buy buy buy buy buy buy buy hold hold hold hold buy hold
164 166 170 173 174 175 179 184 188 190 196 208 210 212 214 217 218 219 224 225 227 229 238 240 245 249 259 263 267
hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold buy
273 274 281 282 284 306 307 311 313 315 320 323 324 328 330 332 333 334 336 340 342 343 346 347 349 353 358 361 365
hold hold buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy
374 380 381 382 383 390 392 393 396 399 403 406 407 408 410 435 440 441 444 445 449 453 457 459 460 464 467 468 473
sell sell sell sell sell sell sell sell sell sell sell sell sell hold hold buy buy buy hold hold hold hold hold hold hold hold hold hold hold
483 489 490 492 499 503 511 520 521 530 534 536 538 546 548 553 555 557 558 559 567 571 573 579 581 583 584 586 587
hold hold hold hold hold hold hold hold hold hold hold hold buy buy buy buy buy buy buy buy buy buy buy buy buy hold hold hold hold
593 595 597 602 603 608 609 614 616 618 628 630 636 639 642 643 645 646 647 648 649 655 660 661 665 668 669 674 675
hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold
676 680 685 687 688 695 698 703 704 713 715 719 720 722 725 729 737 738 740 744 745 746 752 757 760 762 764 768 771
hold hold hold hold hold hold sell sell sell sell hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold
776 778 781 783 784 790 792 805 811 813 814 815 821 822 824 828 829 833 836 837 838 839 843 846 847 848 852 859 861
hold sell hold hold sell sell sell sell hold sell hold sell hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold
862 865 869 873 879 881 886 895
hold hold hold hold hold hold hold sell
Levels: buy hold sell
这是一个虹膜数据的例子:
data(iris)
library(e1071)
svm_model = svm(Species~., data = iris)
prob_svm = predict(svm_model, iris)
m.roc = multiclass.roc(iris$Species, as.numeric(prob_svm))
rs <- m.roc[['rocs']]
plot.roc(rs[[1]], lty=4)
sapply(2:length(rs),function(i) lines.roc(rs[[i]],col=i, lty=i))
此方法计算三个 ROC 曲线(setosa : versicolor、setosa : virginica 和 versicolor : virginica)并对它们的 AUC 取平均值。
它有几个缺陷。将预测的 class 转换为数字是其中之一。更好的方法是使用预测概率,但 pROC 不支持这种行为(我试过)。正如 Calimo 指出的那样,ROC 是二进制 classifier 的模式,当存在超过 2 个 classes 时应小心使用。
我仅将对火车数据的预测用作示例,在评估 classifier 时不应该这样做,因为它会高估模型的准确性。
我有一个这样组织的数据集:
> head(crypto_data)
time btc_price btc_change btc_change_label eth_price block_size difficulty estimated_btc_sent estimated_transaction_volume_usd
1 2017-09-02 21:54:00 4537.834 -0.06630663 buy 330.727 142521291 8.88e+11 2.04e+13 923315360
2 2017-09-02 22:29:00 4577.605 -0.05629429 buy 337.804 136524566 8.88e+11 2.03e+13 918188067
3 2017-09-02 23:04:00 4566.360 -0.05971624 buy 336.938 134845546 8.88e+11 2.01e+13 910440916
4 2017-09-02 23:39:00 4590.031 -0.05624237 buy 342.929 133910638 8.88e+11 1.99e+13 901565930
5 2017-09-03 00:14:00 4676.193 -0.03585697 hold 354.171 130678099 8.88e+11 2.01e+13 922422228
6 2017-09-03 00:49:00 4699.936 -0.03358492 hold 352.299 127557140 8.88e+11 1.99e+13 910457430
hash_rate miners_revenue_btc miners_revenue_usd minutes_between_blocks n_blocks_mined n_blocks_total n_btc_mined n_tx nextretarget
1 7417412092 2395 10839520 8.00 168 483207 2.10e+11 241558 483839
2 7152504517 2317 10482320 8.33 162 483208 2.03e+11 236661 483839
3 7240807042 2342 10596900 8.22 164 483216 2.05e+11 238682 483839
4 7284958305 2352 10642439 8.14 165 483220 2.06e+11 237159 483839
5 7152504517 2316 10611798 8.38 162 483223 2.03e+11 237464 483839
6 7064201992 2288 10481960 8.41 160 483226 2.00e+11 234472 483839
total_btc_sent total_fees_btc totalbtc trade_volume_btc trade_volume_usd
1 1.62e+14 29597881711 1.65e+15 102451.92 463497285
2 1.60e+14 29202300823 1.65e+15 102451.92 463497285
3 1.60e+14 29234981721 1.65e+15 102451.92 463497285
4 1.58e+14 28991577368 1.65e+15 102451.92 463497285
5 1.58e+14 29179041967 1.65e+15 96216.78 440710136
6 1.57e+14 28844391629 1.65e+15 96216.78 440710136
> str(crypto_data)
'data.frame': 895 obs. of 23 variables:
$ time : POSIXct, format: "2017-09-02 21:54:00" "2017-09-02 22:29:00" "2017-09-02 23:04:00" "2017-09-02 23:39:00" ...
$ btc_price : num 4538 4578 4566 4590 4676 ...
$ btc_change : num -0.0663 -0.0563 -0.0597 -0.0562 -0.0359 ...
$ btc_change_label : Factor w/ 3 levels "buy","hold","sell": 1 1 1 1 2 2 2 2 2 2 ...
$ eth_price : num 331 338 337 343 354 ...
$ block_size : num 1.43e+08 1.37e+08 1.35e+08 1.34e+08 1.31e+08 ...
$ difficulty : num 8.88e+11 8.88e+11 8.88e+11 8.88e+11 8.88e+11 ...
$ estimated_btc_sent : num 2.04e+13 2.03e+13 2.01e+13 1.99e+13 2.01e+13 ...
$ estimated_transaction_volume_usd: num 9.23e+08 9.18e+08 9.10e+08 9.02e+08 9.22e+08 ...
$ hash_rate : num 7.42e+09 7.15e+09 7.24e+09 7.28e+09 7.15e+09 ...
$ miners_revenue_btc : num 2395 2317 2342 2352 2316 ...
$ miners_revenue_usd : num 10839520 10482320 10596900 10642439 10611798 ...
$ minutes_between_blocks : num 8 8.33 8.22 8.14 8.38 8.41 8.26 8.33 8.5 8.69 ...
$ n_blocks_mined : num 168 162 164 165 162 160 157 161 159 156 ...
$ n_blocks_total : num 483207 483208 483216 483220 483223 ...
$ n_btc_mined : num 2.10e+11 2.03e+11 2.05e+11 2.06e+11 2.03e+11 ...
$ n_tx : num 241558 236661 238682 237159 237464 ...
$ nextretarget : num 483839 483839 483839 483839 483839 ...
$ total_btc_sent : num 1.62e+14 1.60e+14 1.60e+14 1.58e+14 1.58e+14 ...
$ total_fees_btc : num 2.96e+10 2.92e+10 2.92e+10 2.90e+10 2.92e+10 ...
$ totalbtc : num 1.65e+15 1.65e+15 1.65e+15 1.65e+15 1.65e+15 ...
$ trade_volume_btc : num 102452 102452 102452 102452 96217 ...
$ trade_volume_usd : num 4.63e+08 4.63e+08 4.63e+08 4.63e+08 4.41e+08 ...
然后我 运行 SVM 并尝试绘制 ROC 曲线:
crypto_linear_svm <- svm(btc_change_label ~ ., data = crypto_trainingDS, method = "C-classification", kernel = "linear")
crypto_linear_svm_pred <- predict(crypto_linear_svm, crypto_testDS[,-3])
linear_crypto_conf_mat <- table(pred = crypto_linear_svm_pred, true = crypto_testDS[,3])
linear_svm_crypto_roc <- plot(multiclass.roc(crypto_testDS$btc_change_label, crypto_linear_svm_pred, direction="<"),
col="yellow", lwd=3, main="Linear Kernal SVM results, Cryptocurrency Data")
然而,最后一行给出了以下错误:
Error in roc.default(response, predictor, levels = X, percent = percent, : Predictor must be numeric or ordered.
我做错了什么,我该如何解决?我有两个具有不同结构和组织的不同数据集——显示的一个是多类的,另一个是二进制的(是或否)。我在两者上都有 运行 个 SVM,但是当我尝试绘制 ROC 时,我得到了相同的错误。
编辑 这是预测的输出:
> crypto_linear_svm_pred
3 4 5 6 7 8 14 16 17 19 21 26 29 32 34 36 38 39 45 47 49 53 54 57 59 60 61 63 65
buy buy hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold buy buy buy buy buy buy
67 69 71 74 78 86 89 91 92 95 96 97 98 105 111 113 115 116 122 123 124 127 132 135 140 141 156 160 161
buy buy hold hold hold hold hold hold hold hold hold hold sell sell buy buy buy buy buy buy buy buy buy hold hold hold hold buy hold
164 166 170 173 174 175 179 184 188 190 196 208 210 212 214 217 218 219 224 225 227 229 238 240 245 249 259 263 267
hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold buy
273 274 281 282 284 306 307 311 313 315 320 323 324 328 330 332 333 334 336 340 342 343 346 347 349 353 358 361 365
hold hold buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy buy
374 380 381 382 383 390 392 393 396 399 403 406 407 408 410 435 440 441 444 445 449 453 457 459 460 464 467 468 473
sell sell sell sell sell sell sell sell sell sell sell sell sell hold hold buy buy buy hold hold hold hold hold hold hold hold hold hold hold
483 489 490 492 499 503 511 520 521 530 534 536 538 546 548 553 555 557 558 559 567 571 573 579 581 583 584 586 587
hold hold hold hold hold hold hold hold hold hold hold hold buy buy buy buy buy buy buy buy buy buy buy buy buy hold hold hold hold
593 595 597 602 603 608 609 614 616 618 628 630 636 639 642 643 645 646 647 648 649 655 660 661 665 668 669 674 675
hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold
676 680 685 687 688 695 698 703 704 713 715 719 720 722 725 729 737 738 740 744 745 746 752 757 760 762 764 768 771
hold hold hold hold hold hold sell sell sell sell hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold
776 778 781 783 784 790 792 805 811 813 814 815 821 822 824 828 829 833 836 837 838 839 843 846 847 848 852 859 861
hold sell hold hold sell sell sell sell hold sell hold sell hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold hold
862 865 869 873 879 881 886 895
hold hold hold hold hold hold hold sell
Levels: buy hold sell
这是一个虹膜数据的例子:
data(iris)
library(e1071)
svm_model = svm(Species~., data = iris)
prob_svm = predict(svm_model, iris)
m.roc = multiclass.roc(iris$Species, as.numeric(prob_svm))
rs <- m.roc[['rocs']]
plot.roc(rs[[1]], lty=4)
sapply(2:length(rs),function(i) lines.roc(rs[[i]],col=i, lty=i))
此方法计算三个 ROC 曲线(setosa : versicolor、setosa : virginica 和 versicolor : virginica)并对它们的 AUC 取平均值。
它有几个缺陷。将预测的 class 转换为数字是其中之一。更好的方法是使用预测概率,但 pROC 不支持这种行为(我试过)。正如 Calimo 指出的那样,ROC 是二进制 classifier 的模式,当存在超过 2 个 classes 时应小心使用。
我仅将对火车数据的预测用作示例,在评估 classifier 时不应该这样做,因为它会高估模型的准确性。