For循环不断让jupyter卡住

Question

我是数据分析的新手，python，在 class 中，我们制作了一个 for 循环来检查随机森林中不同的 no trees 并找出最准确的树，我们的老师给我们分配了通过 for 循环实现此目的的任务，但是我编写的代码使笔记本卡住了并且没有给出输出，有人可以告诉我我的代码有什么问题吗？就像，我终于得到了答案，但是经过很长时间，谁能告诉我如何在不对范围进行任何更改的情况下更有效地做同样的事情？

accuracy_scores = []
for i in range(50,500,10):
    model_random = RandomForestClassifier(n_estimators=i,criterion="entropy",max_features=10,min_samples_leaf=50)
    model_random.fit(X_train,Y_train)
    Y_pred=model_random.predict(X_test)
    accuracy=round(accuracy_score(Y_pred, Y_test)*100,2)
    accuracy_scores.append(accuracy)
print(max(accuracy_scores))
y=accuracy_scores.index(max(accuracy_scores))*10+50
print(y)

Answer 1

for 循环是正确的。我假设您的 RandomForestClassifier 函数需要一些时间来计算。

我的建议是添加一些 print/logger 语句来查看哪里和什么花费的时间最多。

from time import per_counter

for i in range(50, 500, 10):
    start = perf_counter()
    model_random = RandomForestClassifier(n_estimators=i,criterion="entropy",max_features=10,min_samples_leaf=50)
    calc_time = perf_counter() - start
    print(f"RandomForestClassifier took {calc_time:.2f}")
    Y_pred=model_random.predict(X_test)
    accuracy=round(accuracy_score(Y_pred, Y_test)*100,2)
    accuracy_scores.append(accuracy)

print(max(accuracy_scores))
y=accuracy_scores.index(max(accuracy_scores))*10+50
print(y)

Answer 2

使用 GridSearchCV() 而不是 for 循环。

您可以在代码中应用 here 所示示例，看看与 for 循环相比性能是否有所提高。

Answer 3

没有卡住，只是处理中。我已经修改了你的代码以显示要点

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np
import timeit

M = 5000
features = 10000
train_prop = 0.75
M_train = int(M*train_prop)
M_test = M - M_train

X_train = np.random.rand(M_train, features)
X_test = np.random.rand(M_test, features)
Y_train = (np.random.rand(M_train)>0.5).astype(int)
Y_test = (np.random.rand(M_test)>0.5).astype(int)

accuracy_scores = []
for i in range(50,500,10):
    t1 = timeit.default_timer()
    print(f"Estimators {i}")
    model_random = RandomForestClassifier(n_estimators=i,criterion="entropy",max_features=10,min_samples_leaf=50, n_jobs=None)
    model_random.fit(X_train,Y_train)
    Y_pred=model_random.predict(X_test)
    accuracy=round(accuracy_score(Y_pred, Y_test)*100,2)
    accuracy_scores.append(accuracy)
    delta_t = timeit.default_timer() - t1
    print(f"Processing time [seconds] {delta_t}")
print(max(accuracy_scores))
y=accuracy_scores.index(max(accuracy_scores))*10+50
print(y)

处理时间的长短取决于您正在处理的样本数量（M 变量）。如果您增加或减少该变量，您将看到处理时间如何相应变化。

请注意特征数量（特征变量）对处理时间的影响不大。

如您所见，我添加了可选参数 n_jobs=None，这意味着您只使用一个处理器。如果将其更改为 n_jobs=-1，将使用所有可用的处理器并且处理时间将减少。

For循环不断让jupyter卡住

For loop keeps getting jupyter stuck

python

for-loop

random-forest