新手:如何评估模型以提高分类模型的准确性
Newbie : How evaluate model to increase accuracy model in classification
我的数据
如果我的某些模型在 运行 时产生如下所示的结果,我该如何提高模型的准确性
`
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.6780893042575286
`
随机森林分类器:准确度:0.6780893042575286
有几种方法可以实现:
看数据。它们是否处于算法的最佳状态?关于 NaN,Covariance 等等?它们是否归一化,分类的翻译得好吗?对于一个论坛来说,这个问题太过深远了。
看问题和适合这个问题的不同算法。也许
- 逻辑回归
- SVN
- XGBoost
- .....
- 尝试使用 RandomisedsearvCV 或 GridSearchCV 调整超参数
这水平挺高的
在模型选择方面,您可以使用类似下面的函数来找到适合问题的好模型。
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn import model_selection
from sklearn.utils import class_weight
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
def mutli_model(X_train, y_train, X_test, y_test):
""" Function to determine best model archietecture """
dfs = []
models = [
('LogReg', LogisticRegression()),
('RF', RandomForestClassifier()),
('KNN', KNeighborsClassifier()),
('SVM', SVC()),
('GNB', GaussianNB()),
('XGB', XGBClassifier(eval_metric="error"))
]
results = []
names = []
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc']
target_names = ['App_Status_1', 'App_Status_2']
for name, model in models:
kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=90210)
cv_results = model_selection.cross_validate(model, X_train, y_train, cv=kfold, scoring=scoring)
clf = model.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(name)
print(classification_report(y_test, y_pred, target_names=target_names))
results.append(cv_results)
names.append(name)
this_df = pd.DataFrame(cv_results)
this_df['model'] = name
dfs.append(this_df)
final = pd.concat(dfs, ignore_index=True)
return final
选择模型后,你可以做一些叫做Hyperparameter tuning的事情,这将进一步提高模型的性能。
如果您想进一步改进模型,可以实施 Data Augmentation 等技术并重新审视数据的清理阶段。
如果在这之后仍然没有改善,您可以尝试收集更多数据或重新关注问题陈述。
我的数据
如果我的某些模型在 运行 时产生如下所示的结果,我该如何提高模型的准确性 `
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.6780893042575286
` 随机森林分类器:准确度:0.6780893042575286
有几种方法可以实现:
看数据。它们是否处于算法的最佳状态?关于 NaN,Covariance 等等?它们是否归一化,分类的翻译得好吗?对于一个论坛来说,这个问题太过深远了。
看问题和适合这个问题的不同算法。也许
- 逻辑回归
- SVN
- XGBoost
- .....
- 尝试使用 RandomisedsearvCV 或 GridSearchCV 调整超参数
这水平挺高的
在模型选择方面,您可以使用类似下面的函数来找到适合问题的好模型。
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn import model_selection
from sklearn.utils import class_weight
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
def mutli_model(X_train, y_train, X_test, y_test):
""" Function to determine best model archietecture """
dfs = []
models = [
('LogReg', LogisticRegression()),
('RF', RandomForestClassifier()),
('KNN', KNeighborsClassifier()),
('SVM', SVC()),
('GNB', GaussianNB()),
('XGB', XGBClassifier(eval_metric="error"))
]
results = []
names = []
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc']
target_names = ['App_Status_1', 'App_Status_2']
for name, model in models:
kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=90210)
cv_results = model_selection.cross_validate(model, X_train, y_train, cv=kfold, scoring=scoring)
clf = model.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(name)
print(classification_report(y_test, y_pred, target_names=target_names))
results.append(cv_results)
names.append(name)
this_df = pd.DataFrame(cv_results)
this_df['model'] = name
dfs.append(this_df)
final = pd.concat(dfs, ignore_index=True)
return final
选择模型后,你可以做一些叫做Hyperparameter tuning的事情,这将进一步提高模型的性能。
如果您想进一步改进模型,可以实施 Data Augmentation 等技术并重新审视数据的清理阶段。
如果在这之后仍然没有改善,您可以尝试收集更多数据或重新关注问题陈述。