使用 GridSearchCV best_params_ 给出了糟糕的结果
Using GridSearchCV best_params_ gives poor results
我正在尝试在相当小的数据集(Kaggle Leaf 大约有 990 行)上调整 KNN 的超参数:
def knnTuning(self, x_train, t_train):
params = {
'n_neighbors': [1, 2, 3, 4, 5, 7, 9],
'weights': ['uniform', 'distance'],
'leaf_size': [5,10, 15, 20]
}
grid = GridSearchCV(KNeighborsClassifier(), params)
grid.fit(x_train, t_train)
print(grid.best_params_)
print(grid.best_score_)
return knn.KNN(neighbors=grid.best_params_["n_neighbors"],
weight = grid.best_params_["weights"],
leafSize = grid.best_params_["leaf_size"])
Prints:
{'leaf_size': 5, 'n_neighbors': 1, 'weights': 'uniform'}
0.9119999999999999
我return这个分类器
class KNN:
def __init__(self, neighbors=1, weight = 'uniform', leafSize = 10):
self.clf = KNeighborsClassifier(n_neighbors = neighbors,
weights = weight, leaf_size = leafSize)
def train(self, X, t):
self.clf.fit(X, t)
def predict(self, x):
return self.clf.predict(x)
def global_accuracy(self, X, t):
predicted = self.predict(X)
accuracy = (predicted == t).mean()
return accuracy
我运行 这几次使用 700 行进行训练,200 行进行验证,它们是随机排列选择的。
然后我得到了全局精度从 0.01(经常)到 0.4(很少)的结果。
我知道我不是在比较两个相同的指标,但我仍然无法理解结果之间的巨大差异。
不太确定您是如何训练模型或如何进行预处理的。 leaf dataset 有大约 100 个标签(物种),因此您必须注意拆分测试和训练以确保样本的平均分配。奇怪的准确性的原因之一可能是您的样本分割不均匀。
此外,您还需要缩放特征:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
df = pd.read_csv("https://raw.githubusercontent.com/WenjinTao/Leaf-Classification--Kaggle/master/train.csv")
le = LabelEncoder()
scaler = StandardScaler()
X = df.drop(['id','species'],axis=1)
X = scaler.fit_transform(X)
y = le.fit_transform(df['species'])
strat = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=0).split(X,y)
x_train, y_train, x_test, y_test = [[X.iloc[train,:],t[train],X.iloc[test,:],t[test]] for train,test in strat][0]
如果我们进行训练,我会小心地包含 n_neighbors = 1 :
params = {
'n_neighbors': [2, 3, 4],
'weights': ['uniform', 'distance'],
'leaf_size': [5,10, 15, 20]
}
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
grid = GridSearchCV(KNeighborsClassifier(), params, cv=sss)
grid.fit(x_train, y_train)
print(grid.best_params_)
print(grid.best_score_)
{'leaf_size': 5, 'n_neighbors': 2, 'weights': 'distance'}
0.9676258992805755
然后你可以检查你的测试:
pred = grid.predict(x_test)
(y_test == pred).mean()
0.9831649831649831
我正在尝试在相当小的数据集(Kaggle Leaf 大约有 990 行)上调整 KNN 的超参数:
def knnTuning(self, x_train, t_train):
params = {
'n_neighbors': [1, 2, 3, 4, 5, 7, 9],
'weights': ['uniform', 'distance'],
'leaf_size': [5,10, 15, 20]
}
grid = GridSearchCV(KNeighborsClassifier(), params)
grid.fit(x_train, t_train)
print(grid.best_params_)
print(grid.best_score_)
return knn.KNN(neighbors=grid.best_params_["n_neighbors"],
weight = grid.best_params_["weights"],
leafSize = grid.best_params_["leaf_size"])
Prints:
{'leaf_size': 5, 'n_neighbors': 1, 'weights': 'uniform'}
0.9119999999999999
我return这个分类器
class KNN:
def __init__(self, neighbors=1, weight = 'uniform', leafSize = 10):
self.clf = KNeighborsClassifier(n_neighbors = neighbors,
weights = weight, leaf_size = leafSize)
def train(self, X, t):
self.clf.fit(X, t)
def predict(self, x):
return self.clf.predict(x)
def global_accuracy(self, X, t):
predicted = self.predict(X)
accuracy = (predicted == t).mean()
return accuracy
我运行 这几次使用 700 行进行训练,200 行进行验证,它们是随机排列选择的。
然后我得到了全局精度从 0.01(经常)到 0.4(很少)的结果。
我知道我不是在比较两个相同的指标,但我仍然无法理解结果之间的巨大差异。
不太确定您是如何训练模型或如何进行预处理的。 leaf dataset 有大约 100 个标签(物种),因此您必须注意拆分测试和训练以确保样本的平均分配。奇怪的准确性的原因之一可能是您的样本分割不均匀。
此外,您还需要缩放特征:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
df = pd.read_csv("https://raw.githubusercontent.com/WenjinTao/Leaf-Classification--Kaggle/master/train.csv")
le = LabelEncoder()
scaler = StandardScaler()
X = df.drop(['id','species'],axis=1)
X = scaler.fit_transform(X)
y = le.fit_transform(df['species'])
strat = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=0).split(X,y)
x_train, y_train, x_test, y_test = [[X.iloc[train,:],t[train],X.iloc[test,:],t[test]] for train,test in strat][0]
如果我们进行训练,我会小心地包含 n_neighbors = 1 :
params = {
'n_neighbors': [2, 3, 4],
'weights': ['uniform', 'distance'],
'leaf_size': [5,10, 15, 20]
}
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
grid = GridSearchCV(KNeighborsClassifier(), params, cv=sss)
grid.fit(x_train, y_train)
print(grid.best_params_)
print(grid.best_score_)
{'leaf_size': 5, 'n_neighbors': 2, 'weights': 'distance'}
0.9676258992805755
然后你可以检查你的测试:
pred = grid.predict(x_test)
(y_test == pred).mean()
0.9831649831649831