我如何使用 scikit learn 迭代 python 中的 'list' 个模型?
How can i iterate over a 'list' of models in python with scikit learn?
我构建了一个函数来显示单个模型的一些评估指标,现在我想将此函数应用于我估计的模型池。
旧函数的输入是:
OldFunction(code: str, x, X_train: np.array, X_test: np.array, X:pd.DataFrame)
其中:
代码是一个字符串,用于创建数据框的列名
x 是型号名称
X_train和X_test是数据分离器
的np.arrays
X是整个数据的dataframe
为了估计一组模型的指标,我尝试通过在我的函数中添加一个循环来修改我的函数,并将模型放在一个列表中。
但是没用。
出现问题是因为我无法遍历模型列表,所以我有什么选择?你有什么想法吗?
我把新功能留在下面。
import numpy as np
import pandas as pd
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.model_selection import cross_val_score
def displaymetrics(code: list, models: list, X_train: np.array, X_test: np.array, X: pd.DataFrame):
for i in models:
y_score = models[i].fit(X_train, y_train).decision_function(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)
# Traditional Scores
y_pred = pd.DataFrame(model[i].predict(X_train)).reset_index(drop=True)
Recall_Train,Precision_Train, Accuracy_Train = recall_score(y_train, y_pred), precision_score(y_train, y_pred), accuracy_score(y_train, y_pred)
y_pred = pd.DataFrame(model[i].predict(X_test)).reset_index(drop=True)
Recall_Test = recall_score(y_test, y_pred)
Precision_Test = precision_score(y_test, y_pred)
Accuracy_Test = accuracy_score(y_test, y_pred)
#Cross Validation
cv_au = cross_val_score(models[i], X_test, y_test, cv=30, scoring='roc_auc')
cv_f1 = cross_val_score(models[i], X_test, y_test, cv=30, scoring='f1')
cv_pr = cross_val_score(models[i], X_test, y_test, cv=30, scoring='precision')
cv_re = cross_val_score(models[i], X_test, y_test, cv=30, scoring='recall')
cv_ac = cross_val_score(models[i], X_test, y_test, cv=30, scoring='accuracy')
cv_ba = cross_val_score(models[i], X_test, y_test, cv=30, scoring='balanced_accuracy')
cv_au_m, cv_au_std = cv_au.mean() , cv_au.std()
cv_f1_m, cv_f1_std = cv_f1.mean() , cv_f1.std()
cv_pr_m, cv_pr_std = cv_pr.mean() , cv_pr.std()
cv_re_m, cv_re_std= cv_re.mean() , cv_re.std()
cv_ac_m, cv_ac_std = cv_ac.mean() , cv_ac.std()
cv_ba_m, cv_ba_std= cv_ba.mean() , cv_ba.std()
cv_au, cv_f1, cv_pr = (cv_au_m, cv_au_std), (cv_f1_m, cv_f1_std), (cv_pr_m, cv_pr_std)
cv_re, cv_ac, cv_ba = (cv_re_m, cv_re_std), (cv_ac_m, cv_ac_std), (cv_ba_m, cv_ba_std)
tuples = [cv_au, cv_f1, cv_pr, cv_re, cv_ac, cv_ba]
tuplas = [0]*len(tuples)
for i in range(len(tuples)):
tuplas[i] = [round(x,4) for x in tuples[i]]
results = pd.DataFrame()
results['Metrics'] = ['roc_auc', 'Accuracy_Train', 'Precision_Train', 'Recall_Train', 'Accuracy_Test',
'Precision_Test','Recall_Test', 'cv_roc-auc (mean, std)', 'cv_f1score(mean, std)',
'cv_precision (mean, std)', 'cv_recall (mean, std)', 'cv_accuracy (mean, std)',
'cv_bal_accuracy (mean, std)']
results.set_index(['Metrics'], inplace=True)
results['Model_'+code[i]] = [roc_auc, Accuracy_Train, Precision_Train, Recall_Train, Accuracy_Test,
Precision_Test, Recall_Test, tuplas[0], tuplas[1], tuplas[2], tuplas[3],
tuplas[4], tuplas[5]]
return results
输出应该是一个数据框,其中每列代表每个模型,行代表指标。
如果有错误或者只是输出不正确,您可能应该提及。
我会假设你有一个错误。
您确定在调用 displaymetrics
时将模型作为列表传递吗?
例如
models = [model1, model2, ...]
displaymetrics(code, models, X_train, X_test, X)
另外,您的代码有一个错误:
你调用 models[i].fit(...)
但 i
本身就是一个模型。您应该只做 i.fit(...)
或更好地更改名称 i
因为它通常指的是对内容的迭代。 (如果你想遍历列表的索引,你应该使用 for i in range(0, len(models)): ...
。)
注意:您不应该为每个模型迭代导入 pandas 和 numpy。我还建议您将所有导入(sklearn 模块的)放在代码的上半部分。
所以,我认为您的代码应该如下所示:
import numpy as np
import pandas as pd
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.model_selection import cross_val_score
def displaymetrics(code: list, models: list, X_train: np.array, X_test: np.array, X: pd.DataFrame):
for model in models: # or for i in range(0, len(models)):
y_score = model.fit(X_train, y_train).decision_function(X_test)
# or y_score = models[i].fit(X_train, y_train).decision_function(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score)
# etc etc
尝试编辑您的代码,以便向我们展示您如何调用 displaymetrics
以及使用哪些参数。
您应该使用字典而不是列表,如下例所示:
dict_classifiers = {
"Logreg": LogisticRegression(solver='lbfgs'),
"NN": KNeighborsClassifier(),
"LinearSVM": SVC(probability=True, kernel='linear'), #class_weight='balanced'
"GBC": GradientBoostingClassifier(),
"DT": tree.DecisionTreeClassifier(),
"RF": RandomForestClassifier(),
"NB": GaussianNB(),
}
然后使用,例如:
for model, model_instantiation in dict_classifiers.iteritems():
y_score = model_instantiation.fit(X_train, y_train).decision_function(X_test)
...
希望对您有所帮助,请告诉我您的进展情况!
我构建了一个函数来显示单个模型的一些评估指标,现在我想将此函数应用于我估计的模型池。
旧函数的输入是:
OldFunction(code: str, x, X_train: np.array, X_test: np.array, X:pd.DataFrame)
其中:
代码是一个字符串,用于创建数据框的列名
x 是型号名称
X_train和X_test是数据分离器
的np.arrays
X是整个数据的dataframe
为了估计一组模型的指标,我尝试通过在我的函数中添加一个循环来修改我的函数,并将模型放在一个列表中。
但是没用。
出现问题是因为我无法遍历模型列表,所以我有什么选择?你有什么想法吗?
我把新功能留在下面。
import numpy as np
import pandas as pd
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.model_selection import cross_val_score
def displaymetrics(code: list, models: list, X_train: np.array, X_test: np.array, X: pd.DataFrame):
for i in models:
y_score = models[i].fit(X_train, y_train).decision_function(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)
# Traditional Scores
y_pred = pd.DataFrame(model[i].predict(X_train)).reset_index(drop=True)
Recall_Train,Precision_Train, Accuracy_Train = recall_score(y_train, y_pred), precision_score(y_train, y_pred), accuracy_score(y_train, y_pred)
y_pred = pd.DataFrame(model[i].predict(X_test)).reset_index(drop=True)
Recall_Test = recall_score(y_test, y_pred)
Precision_Test = precision_score(y_test, y_pred)
Accuracy_Test = accuracy_score(y_test, y_pred)
#Cross Validation
cv_au = cross_val_score(models[i], X_test, y_test, cv=30, scoring='roc_auc')
cv_f1 = cross_val_score(models[i], X_test, y_test, cv=30, scoring='f1')
cv_pr = cross_val_score(models[i], X_test, y_test, cv=30, scoring='precision')
cv_re = cross_val_score(models[i], X_test, y_test, cv=30, scoring='recall')
cv_ac = cross_val_score(models[i], X_test, y_test, cv=30, scoring='accuracy')
cv_ba = cross_val_score(models[i], X_test, y_test, cv=30, scoring='balanced_accuracy')
cv_au_m, cv_au_std = cv_au.mean() , cv_au.std()
cv_f1_m, cv_f1_std = cv_f1.mean() , cv_f1.std()
cv_pr_m, cv_pr_std = cv_pr.mean() , cv_pr.std()
cv_re_m, cv_re_std= cv_re.mean() , cv_re.std()
cv_ac_m, cv_ac_std = cv_ac.mean() , cv_ac.std()
cv_ba_m, cv_ba_std= cv_ba.mean() , cv_ba.std()
cv_au, cv_f1, cv_pr = (cv_au_m, cv_au_std), (cv_f1_m, cv_f1_std), (cv_pr_m, cv_pr_std)
cv_re, cv_ac, cv_ba = (cv_re_m, cv_re_std), (cv_ac_m, cv_ac_std), (cv_ba_m, cv_ba_std)
tuples = [cv_au, cv_f1, cv_pr, cv_re, cv_ac, cv_ba]
tuplas = [0]*len(tuples)
for i in range(len(tuples)):
tuplas[i] = [round(x,4) for x in tuples[i]]
results = pd.DataFrame()
results['Metrics'] = ['roc_auc', 'Accuracy_Train', 'Precision_Train', 'Recall_Train', 'Accuracy_Test',
'Precision_Test','Recall_Test', 'cv_roc-auc (mean, std)', 'cv_f1score(mean, std)',
'cv_precision (mean, std)', 'cv_recall (mean, std)', 'cv_accuracy (mean, std)',
'cv_bal_accuracy (mean, std)']
results.set_index(['Metrics'], inplace=True)
results['Model_'+code[i]] = [roc_auc, Accuracy_Train, Precision_Train, Recall_Train, Accuracy_Test,
Precision_Test, Recall_Test, tuplas[0], tuplas[1], tuplas[2], tuplas[3],
tuplas[4], tuplas[5]]
return results
输出应该是一个数据框,其中每列代表每个模型,行代表指标。
如果有错误或者只是输出不正确,您可能应该提及。 我会假设你有一个错误。
您确定在调用 displaymetrics
时将模型作为列表传递吗?
例如
models = [model1, model2, ...]
displaymetrics(code, models, X_train, X_test, X)
另外,您的代码有一个错误:
你调用 models[i].fit(...)
但 i
本身就是一个模型。您应该只做 i.fit(...)
或更好地更改名称 i
因为它通常指的是对内容的迭代。 (如果你想遍历列表的索引,你应该使用 for i in range(0, len(models)): ...
。)
注意:您不应该为每个模型迭代导入 pandas 和 numpy。我还建议您将所有导入(sklearn 模块的)放在代码的上半部分。
所以,我认为您的代码应该如下所示:
import numpy as np
import pandas as pd
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.model_selection import cross_val_score
def displaymetrics(code: list, models: list, X_train: np.array, X_test: np.array, X: pd.DataFrame):
for model in models: # or for i in range(0, len(models)):
y_score = model.fit(X_train, y_train).decision_function(X_test)
# or y_score = models[i].fit(X_train, y_train).decision_function(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score)
# etc etc
尝试编辑您的代码,以便向我们展示您如何调用 displaymetrics
以及使用哪些参数。
您应该使用字典而不是列表,如下例所示:
dict_classifiers = {
"Logreg": LogisticRegression(solver='lbfgs'),
"NN": KNeighborsClassifier(),
"LinearSVM": SVC(probability=True, kernel='linear'), #class_weight='balanced'
"GBC": GradientBoostingClassifier(),
"DT": tree.DecisionTreeClassifier(),
"RF": RandomForestClassifier(),
"NB": GaussianNB(),
}
然后使用,例如:
for model, model_instantiation in dict_classifiers.iteritems():
y_score = model_instantiation.fit(X_train, y_train).decision_function(X_test)
...
希望对您有所帮助,请告诉我您的进展情况!