Sklearn StratifiedKFold: ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead
Sklearn StratifiedKFold: ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead
使用 Sklearn 分层 kfold 拆分,当我尝试使用 multi-class 拆分时,我收到错误消息(见下文)。当我尝试使用二进制进行拆分时,它没有问题。
num_classes = len(np.unique(y_train))
y_train_categorical = keras.utils.to_categorical(y_train, num_classes)
kf=StratifiedKFold(n_splits=5, shuffle=True, random_state=999)
# splitting data into different folds
for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train_categorical)):
x_train_kf, x_val_kf = x_train[train_index], x_train[val_index]
y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead.
keras.utils.to_categorical
生成一个单热编码的 class 向量,即错误消息中提到的 multilabel-indicator
。 StratifiedKFold
不适用于此类输入;来自 split
方法 docs:
split
(X, y, groups=None)
[...]
y : array-like, shape (n_samples,)
The target variable for supervised learning problems. Stratification is done based on the y labels.
即您的 y
必须是 class 标签的一维数组。
本质上,您要做的只是颠倒操作顺序:首先拆分(使用您的初始 y_train
),然后转换 to_categorical
。
在我的例子中,x
是一个二维矩阵,y
也是一个二维矩阵,即确实是 multi-class multi-output 的情况。我只是像往常一样为 y
和 x
传递了一个虚拟 np.zeros(shape=(n,1))
。完整代码示例:
import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [3, 7], [9, 4]])
# y = np.array([0, 0, 1, 1, 0, 1]) # <<< works
y = X # does not work if passed into `.split`
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=36851234)
for train_index, test_index in rskf.split(X, np.zeros(shape=(X.shape[0], 1))):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
这样调用 split()
:
for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train_categorical.argmax(1))):
x_train_kf, x_val_kf = x_train[train_index], x_train[val_index]
y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]
我遇到了同样的问题,发现你可以用这个 util
函数检查目标的类型:
from sklearn.utils.multiclass import type_of_target
type_of_target(y)
'multilabel-indicator'
来自其文档字符串:
- 'binary':
y
contains <= 2 discrete values and is 1d or a column
vector.
- 'multiclass':
y
contains more than two discrete values, is not a
sequence of sequences, and is 1d or a column vector.
- 'multiclass-multioutput':
y
is a 2d array that contains more
than two discrete values, is not a sequence of sequences, and both
dimensions are of size > 1.
- 'multilabel-indicator':
y
is a label indicator matrix, an array
of two dimensions with at least two columns, and at most 2 unique
values.
使用 LabelEncoder
,您可以将 类 转换为一维数字数组(假设您的目标标签位于 categoricals/object 的一维数组中):
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(target_labels)
补充@desertnaut 所说的,为了将您的 one-hot-encoding
转换回一维数组,您只需要做的是:
class_labels = np.argmax(y_train, axis=1)
这将转换回您 类 的初始表示。
如果你的目标变量是连续的,那么使用简单的 KFold 交叉验证而不是 StratifiedKFold。
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
使用 Sklearn 分层 kfold 拆分,当我尝试使用 multi-class 拆分时,我收到错误消息(见下文)。当我尝试使用二进制进行拆分时,它没有问题。
num_classes = len(np.unique(y_train))
y_train_categorical = keras.utils.to_categorical(y_train, num_classes)
kf=StratifiedKFold(n_splits=5, shuffle=True, random_state=999)
# splitting data into different folds
for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train_categorical)):
x_train_kf, x_val_kf = x_train[train_index], x_train[val_index]
y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead.
keras.utils.to_categorical
生成一个单热编码的 class 向量,即错误消息中提到的 multilabel-indicator
。 StratifiedKFold
不适用于此类输入;来自 split
方法 docs:
split
(X, y, groups=None)[...]
y : array-like, shape (n_samples,)
The target variable for supervised learning problems. Stratification is done based on the y labels.
即您的 y
必须是 class 标签的一维数组。
本质上,您要做的只是颠倒操作顺序:首先拆分(使用您的初始 y_train
),然后转换 to_categorical
。
在我的例子中,x
是一个二维矩阵,y
也是一个二维矩阵,即确实是 multi-class multi-output 的情况。我只是像往常一样为 y
和 x
传递了一个虚拟 np.zeros(shape=(n,1))
。完整代码示例:
import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [3, 7], [9, 4]])
# y = np.array([0, 0, 1, 1, 0, 1]) # <<< works
y = X # does not work if passed into `.split`
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=36851234)
for train_index, test_index in rskf.split(X, np.zeros(shape=(X.shape[0], 1))):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
这样调用 split()
:
for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train_categorical.argmax(1))):
x_train_kf, x_val_kf = x_train[train_index], x_train[val_index]
y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]
我遇到了同样的问题,发现你可以用这个 util
函数检查目标的类型:
from sklearn.utils.multiclass import type_of_target
type_of_target(y)
'multilabel-indicator'
来自其文档字符串:
- 'binary':
y
contains <= 2 discrete values and is 1d or a column vector.- 'multiclass':
y
contains more than two discrete values, is not a sequence of sequences, and is 1d or a column vector.- 'multiclass-multioutput':
y
is a 2d array that contains more than two discrete values, is not a sequence of sequences, and both dimensions are of size > 1.- 'multilabel-indicator':
y
is a label indicator matrix, an array of two dimensions with at least two columns, and at most 2 unique values.
使用 LabelEncoder
,您可以将 类 转换为一维数字数组(假设您的目标标签位于 categoricals/object 的一维数组中):
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(target_labels)
补充@desertnaut 所说的,为了将您的 one-hot-encoding
转换回一维数组,您只需要做的是:
class_labels = np.argmax(y_train, axis=1)
这将转换回您 类 的初始表示。
如果你的目标变量是连续的,那么使用简单的 KFold 交叉验证而不是 StratifiedKFold。
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)