cross_val_predict 未完成。没有错误信息

Question

我正在尝试在 MNIST 示例数据集上实现 KNearestNeighbors 的使用。

尝试使用 cross_val_predict 时，无论我将其放置多长时间，脚本都会继续运行。

我有什么地方missing/doing不对吗？

欢迎任何反馈。

from sklearn.datasets import fetch_openml
import numpy as np
mnist = fetch_openml('mnist_784', version=1) #Imports the dataset into the notebook

X, y = mnist["data"], mnist["target"]

y=y.astype(np.uint8)
X=X.astype(np.uint8)#For machine learning models to understand the output must be casted to an interger not a string.

X.shape, y.shape

y=y.astype(np.uint8) #For machine learning models to understand the output must be casted to an interger not a string.
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:] #Separate the data into training and testing sets

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score

y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, cv=3)

f1_score(y_train, y_train_knn_pred, average="macro")

Answer 1

使用n_jobs=-1

The number of CPUs to use to do the computation. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors

from sklearn.datasets import fetch_openml
import numpy as np
mnist = fetch_openml('mnist_784', version=1) #Imports the dataset into the notebook

X, y = mnist["data"], mnist["target"]
y=y.astype(np.uint8)
X=X.astype(np.uint8)#For machine learning models to understand the output must be casted to an interger not a string.


y=y.astype(np.uint8) #For machine learning models to understand the output must be casted to an interger not a string.
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:] #Separate the data into training and testing sets

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_jobs=-1) # HERE
knn_clf.fit(X_train, y_train) # this took seconds on my macbook pro

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score

y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, cv=3, n_jobs=-1) # AND HERE

f1_score(y_train, y_train_knn_pred, average="macro")

Answer 2

我认为混淆来自于 KNN 算法拟合调用比预测快得多这一点。来自另一个 SO post:

KNN is also called as lazy algorithm because during fitting it does nothing but saves the input data, specifically there is no learning at all.

During predict is the actual distance calculation happens for each test datapoint. Hence, you could understand that when using cross_val_predict, KNN has to predict on the validation data points, which makes the computation time higher!

因此，当您查看输入的大小时，需要大量的计算能力。数据。使用多个 CPU 或最小化维度可能会有用。

如果您想使用多个 CPU 核心，您可以将参数“n_jobs”传递给 cross_val_predict 和 KNeighborsClassifier 以将核心数量设置为用过的。将其设置为 -1 以使用所有可用内核

cross_val_predict 未完成。没有错误信息

cross_val_predict is not completing. No error message

python

numpy

knn

scikit-learn