Scikit-Learn:Knn 交叉验证错误(最少 class ...)

Scikit-Learn : Error on K-nn Crossvalidation (the least populated class ...)

我正在尝试在叶特征样本上使用 Knn。 194 列有 990 行。 第二列填写叶子来自的树名,它将是标签。

from sklearn import model_selection 
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn import neighbors, metrics

X = data.iloc[:, 2: 194]
y = data.iloc[:,1]
labelencoder = LabelEncoder()
y = labelencoder.fit_transform(y)

#separate train and test data
X_train, X_test, y_train, y_test = \
model_selection.train_test_split(X, y , test_size=0.3)

std_scale = preprocessing.StandardScaler().fit(X_train)
X_train_std = std_scale.transform(X_train)
X_test_std = std_scale.transform(X_test)

param_grid = {'n_neighbors':[3,5,7,9,11,13,15]}
score = 'accuracy'
clf = model_selection.GridSearchCV(neighbors.KNeighborsClassifier(),
param_grid, cv=5, scoring=score)
#here i got the following error

clf.fit(X_train_std, y_train)

C:\Users\chrys\Anaconda3\lib\site-packages\sklearn\model_selection_split.py:581: Warning: The least populated class in y has only 4 members, which is too few. The minimum number of groups for any class cannot be less than n_splits=5. % (min_groups, self.n_splits)), Warning)

我知道当 "cv" >4 时,"cv" 是我们在交叉验证期间有多少部分。 我不明白,因为我的样本应该足够大,甚至可以除以 10。

这里是示例的 link: leaf-sample

提前感谢您的帮助

my sample should be widely big enough for being divided by even 10

仅当您使用 整个 数据集进行训练时才会出现这种情况。因为您已经(正确地)将数据拆分为测试集和训练集,所以您很可能至少有一个 class 的实例少于交叉验证拆分的数量。特别是对于您的数据集,它有 100 个 classes,每个 class.

只有 10 个实例

您可以使用以下方法检查您的训练集标签:

count = {k:0 for k in set(y)}
for yy in y:
    count[yy] += 1
sorted(count.items(), key=lambda i: i[1])

当我 运行 你的代码时,我有四个 class 用于交叉验证的少于 5 个:

[(23, 4),
 (39, 4),
 (68, 4),
 (85, 4),
 (17, 5),
 ...