如何对单列使用分层

How to use stratify for single column

我是这个数据人员中的新人。这就是为什么,我可能不确定我应该写什么作为我的问题。我试图尽可能简单地表达我的问题。我正在展示我的部分代码。

print(data)

输出:

array([[0, 0, 0, ..., 255, 255, 255],
       [255, 255, 255, ..., 0, 0, 0],
       [255, 255, 255, ..., 255, 255, 255],
       ...,
       [255, 255, 255, ..., 255, 255, 255],
       [255, 255, 255, ..., 255, 255, 255],
       [255, 255, 255, ..., 255, 255, 255]], dtype=object)

print(result)

输出:

['Arrowhead' 'Arrowhead' 'Arrowhead' ... 'Vessel' 'Vessel' 'Vessel']

正在将标签转换为数字:

LE = LabelEncoder()
target = LE.fit_transform(result)

print(target) 

输出:

[ 0  0  0 ... 38 38 38]

拆分:

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42, stratify=target)

我收到错误:

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

为了修复这个错误,我不得不删除 stratify,这暂时没问题:

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)

要构建 CNN,我必须这样做:

lb = preprocessing.LabelBinarizer()

y_train_categorical = lb.fit_transform(y_train)
y_test_categorical = lb.fit_transform(y_test)

print(y_train_categorical.shape)
print(y_test_categorical.shape)

输出:

(1945, 38)
(487, 34)

问题来了。我需要相同的 y 轴值 (y_train_categorical.shape[1] & y_test_categorical.shape[1])。因为,我申请了:

model = Sequential()

model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(100,100,1)))
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(38, activation='softmax'))

对于 model.fit():

效果很好
model.fit(X_train, y_train_categorical, 
          batch_size=32, epochs=5, verbose=1)

但是,在测试评估时,

loss, accuracy = model.evaluate(X_test, y_test_categorical, verbose=0)
print('Loss: ', loss,'\nAcc: ', accuracy)

我收到这个错误:

ValueError: Error when checking target: expected dense_2 to have shape (38,) but got array with shape (34,)

如何使 y_train_categorical.shape[1]y_test_categorical.shape[1] 相同或是否有任何简单的解决方案来解决我的最后一个错误(在评估测试模型时)?

总的来说,无论错误和方法论如何,这:

y_train_categorical = lb.fit_transform(y_train)
y_test_categorical = lb.fit_transform(y_test)

是错误的:我们从不在测试集上安装我们的预处理内容,我们重用在训练集中安装的转换,即:

y_train_categorical = lb.fit_transform(y_train)
y_test_categorical = lb.transform(y_test) # transform only

可能也可以解决您的错误,如果您的测试集的所有标签都出现在您的训练集中 - 这应该是格式良好的预测性 ML 问题的情况(否则问题本身定义不明确)。

如果 lb.fit_transform(y_test) 给出一个错误,指出它遇到了以前不存在(和编码)的标签,这恰好意味着你的测试集中有新的、看不见的标签,这是你遇到的真正问题在这里纠正,而不是一些编码错误。

错误解决方案:

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

错误提到您的 target 变量中有一个 class,它只出现了一次。为了解释这一点,让我们考虑以下示例:

random_list = ['a','a','a','b','b','c','d','d','e','e','e']
LE = LabelEncoder()
target = LE.fit_transform(random_list)
print(target)

给予

array([0, 0, 0, 1, 1, 2, 3, 3, 4, 4, 4])

现在,如果我尝试执行 train_test_split,这将引发错误。

train_test_split(target, test_size=0.2, stratify=target)
#ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

这是因为我只出现了一次 'c',这造成了在 stratify=True 时将其放入训练还是测试中的歧义。因此,为了使它起作用,我们需要每个 class.

出现超过 1 次

上面例子的额外错误

即使我从上面的列表中删除 'c',上面的解决方案也不起作用。我们遇到另一个错误。

random_list = ['a','a','a','b','b','d','d','e','e','e']
E = LabelEncoder()
target = LE.fit_transform(random_list) #produces array([0, 0, 0, 1, 1, 3, 3, 4, 4, 4])
train_test_split(target, test_size=0.2, stratify=target)
#ValueError: The test_size = 2 should be greater or equal to the number of classes = 4

要使分层成功,您需要在训练和测试中都出现所有 classes。如果 data_points 的数量不足以创建正确的分配,则会抛出上述错误。对于test_size=2,最多可以分层2class层。