Scikit-Learn:Knn 交叉验证错误(最少 class ...)
Scikit-Learn : Error on K-nn Crossvalidation (the least populated class ...)
我正在尝试在叶特征样本上使用 Knn。
194 列有 990 行。
第二列填写叶子来自的树名,它将是标签。
from sklearn import model_selection
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn import neighbors, metrics
X = data.iloc[:, 2: 194]
y = data.iloc[:,1]
labelencoder = LabelEncoder()
y = labelencoder.fit_transform(y)
#separate train and test data
X_train, X_test, y_train, y_test = \
model_selection.train_test_split(X, y , test_size=0.3)
std_scale = preprocessing.StandardScaler().fit(X_train)
X_train_std = std_scale.transform(X_train)
X_test_std = std_scale.transform(X_test)
param_grid = {'n_neighbors':[3,5,7,9,11,13,15]}
score = 'accuracy'
clf = model_selection.GridSearchCV(neighbors.KNeighborsClassifier(),
param_grid, cv=5, scoring=score)
#here i got the following error
clf.fit(X_train_std, y_train)
C:\Users\chrys\Anaconda3\lib\site-packages\sklearn\model_selection_split.py:581: Warning: The least populated class in y has only 4 members, which is too few. The minimum number of groups for any class cannot be less than n_splits=5.
% (min_groups, self.n_splits)), Warning)
我知道当 "cv" >4 时,"cv" 是我们在交叉验证期间有多少部分。
我不明白,因为我的样本应该足够大,甚至可以除以 10。
这里是示例的 link:
leaf-sample
提前感谢您的帮助
my sample should be widely big enough for being divided by even 10
仅当您使用 整个 数据集进行训练时才会出现这种情况。因为您已经(正确地)将数据拆分为测试集和训练集,所以您很可能至少有一个 class 的实例少于交叉验证拆分的数量。特别是对于您的数据集,它有 100 个 classes,每个 class.
只有 10 个实例
您可以使用以下方法检查您的训练集标签:
count = {k:0 for k in set(y)}
for yy in y:
count[yy] += 1
sorted(count.items(), key=lambda i: i[1])
当我 运行 你的代码时,我有四个 class 用于交叉验证的少于 5 个:
[(23, 4),
(39, 4),
(68, 4),
(85, 4),
(17, 5),
...
我正在尝试在叶特征样本上使用 Knn。 194 列有 990 行。 第二列填写叶子来自的树名,它将是标签。
from sklearn import model_selection
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn import neighbors, metrics
X = data.iloc[:, 2: 194]
y = data.iloc[:,1]
labelencoder = LabelEncoder()
y = labelencoder.fit_transform(y)
#separate train and test data
X_train, X_test, y_train, y_test = \
model_selection.train_test_split(X, y , test_size=0.3)
std_scale = preprocessing.StandardScaler().fit(X_train)
X_train_std = std_scale.transform(X_train)
X_test_std = std_scale.transform(X_test)
param_grid = {'n_neighbors':[3,5,7,9,11,13,15]}
score = 'accuracy'
clf = model_selection.GridSearchCV(neighbors.KNeighborsClassifier(),
param_grid, cv=5, scoring=score)
#here i got the following error
clf.fit(X_train_std, y_train)
C:\Users\chrys\Anaconda3\lib\site-packages\sklearn\model_selection_split.py:581: Warning: The least populated class in y has only 4 members, which is too few. The minimum number of groups for any class cannot be less than n_splits=5. % (min_groups, self.n_splits)), Warning)
我知道当 "cv" >4 时,"cv" 是我们在交叉验证期间有多少部分。 我不明白,因为我的样本应该足够大,甚至可以除以 10。
这里是示例的 link: leaf-sample
提前感谢您的帮助
my sample should be widely big enough for being divided by even 10
仅当您使用 整个 数据集进行训练时才会出现这种情况。因为您已经(正确地)将数据拆分为测试集和训练集,所以您很可能至少有一个 class 的实例少于交叉验证拆分的数量。特别是对于您的数据集,它有 100 个 classes,每个 class.
只有 10 个实例您可以使用以下方法检查您的训练集标签:
count = {k:0 for k in set(y)}
for yy in y:
count[yy] += 1
sorted(count.items(), key=lambda i: i[1])
当我 运行 你的代码时,我有四个 class 用于交叉验证的少于 5 个:
[(23, 4),
(39, 4),
(68, 4),
(85, 4),
(17, 5),
...