为什么我的神经网络预测属于一个 class 的测试图像的错误 class 标签,尽管它具有很高的验证准确性?

Why does my neural network predict the incorrect class label for test images belonging to one class, despite having a high validation accuracy?

我正在使用 Inception v4 模型在 3 classes A、B 和 C 上训练一个 classifier,每个模型在训练数据集中大约有 900 张图像,在验证数据集中大约有 80 张图像放。我 运行 我的训练代码为 200 个时期,批量大小为 8。我得到的平均验证准确率超过 99%,损失非常低:-

Epoch 199/200
303/303 [==============================] - 53s 174ms/step - loss: 0.0026 - accuracy: 0.9996 - val_loss: 5.1226e-04 - val_accuracy: 1.0000
Epoch 200/200
303/303 [==============================] - 53s 176ms/step - loss: 0.0019 - accuracy: 1.0000 - val_loss: 0.1079 - val_accuracy: 0.9750

当我 运行 我的测试代码在验证集目录 A 中的图像上时,它预测 80% 的图像为 class A,20% 为 class C class B 中什么也没有。与 class C 相同(80% 为 C,20% 为 A)。在目录 B 上,所有图像都被预测为 class A 或 C。在所有三个测试用例中,测试程序没有将单个图像 class 化为 class B,尽管高验证准确性以及使用与训练时用于验证的完全相同的目录(后者也让我相信它主要不是由过度拟合引起的)。

这是目录 B 上测试程序的输出:

25/25 [==============================] - 8s 186ms/step - loss: 0.0212 - accuracy: 0.9963
['loss', 'accuracy']
[0.02124088630080223, 0.9963099360466003]
Testing images located in val/B/
[[6.2504888e-01 8.8258091e-08 3.7495103e-01]]
A:62.5%
[[8.8602149e-01 1.3459101e-05 1.1396510e-01]]
A:88.6%
[[4.7189465e-01 4.0863368e-05 5.2806443e-01]]
C:52.81%
[[1.0370950e-01 2.7608112e-07 8.9629024e-01]]
C:89.63%
[[7.1212035e-01 3.3269991e-06 2.8787634e-01]]
A:71.21%

等等。

我什至尝试将 img = np.expand_dims(test_image, axis=0) 线除以 255,如 所述。在那种情况下它是成功的,但在这里不是这样。

这是我的训练代码:

def create_inception_v4(nb_classes, load_weights, checkpoint_path):

    init = Input((299,299, 3))

    x = inception_stem(init)

    # 4 x Inception A
    for i in range(4):
        x = inception_A(x)

    # Reduction A
    x = reduction_A(x)

    # 7 x Inception B
    for i in range(7):
        x = inception_B(x)

    # Reduction B
    x = reduction_B(x)

    # 3 x Inception C
    for i in range(3):
        x = inception_C(x)

    # Average Pooling
    x = AveragePooling2D((8, 8))(x)

    # Dropout - Use 0.2, as mentioned in official paper. 
    x = Dropout(0.2)(x)
    x = Flatten()(x)

    # Output
    out = Dense(nb_classes, activation='softmax')(x)

    model = Model(init, out, name='Inception-v4')

    if load_weights:
        weights = checkpoint_path
        model.load_weights(weights, by_name=True)
        print("Model weights loaded.")
 
    return model





def train(args,check,checkpoint_path,network_name="inceptionv4"):
    n_gpus=int(args['gpus'])      
   
    sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(log_device_placement=True))

    datagen=ImageDataGenerator(rescale=1/255,
                rotation_range=40,
                width_shift_range=0.1,
                height_shift_range=0.1,
                shear_range=0.1,
                zoom_range=0.1,
                horizontal_flip=True,
                fill_mode='nearest',
                samplewise_std_normalization=True)

    val_datagen = ImageDataGenerator(rescale=1/255)

    batch_size = int(args["batch_size"])

    train_generator = datagen.flow_from_directory(train_dir,target_size=(299,299),class_mode="categorical", batch_size=batch_size)
    val_gen = datagen.flow_from_directory(val_dir,target_size=(299,299),class_mode="categorical", batch_size=batch_size)

    mc = keras.callbacks.ModelCheckpoint(f"{network_name}_checkpoints/{network_name}.h5", save_weights_only=True, save_best_only=True)

    tensorboard = TensorBoard(log_dir="{}/{}".format(args["log_dir"], time()))

    validation_steps = 10


    model = create_inception_v4(int(args["num_classes"]),check,checkpoint_path)
    model.compile(loss='categorical_crossentropy', optimizer=tf.keras.optimizers.SGD(learning_rate=float(args['learning_rate']), decay=1e-6, momentum=0.9, nesterov=True), metrics=["accuracy"])   

    counter = Counter(train_generator.classes)                          
    max_val = float(max(counter.values()))       
    class_weights = {class_id : max_val/num_images for class_id, num_images in counter.items()}    
   
    hist = model.fit(train_generator,epochs=num_epochs,verbose=True,validation_data=val_gen,validation_steps=validation_steps,callbacks=[mc, tensorboard], class_weight=class_weights)
    model.save(f"checkpoints/{network_name}_{num_epochs}epochs.h5")

这是我的测试代码:

def test_model(test_dir, num_epochs,class_names, network_name="inceptionv4",):

    model=load_model(f'checkpoints/{network_name}_{num_epochs}epochs.h5')


    datagen=ImageDataGenerator(rescale=1/255,
                rotation_range=40,
                width_shift_range=0.1,
                height_shift_range=0.1,
                shear_range=0.1,
                zoom_range=0.1,
                horizontal_flip=True,
                fill_mode='nearest',         
                samplewise_std_normalization=True)

    val_datagen = ImageDataGenerator(rescale=1/255)
    val_dir = "val/"
    val_gen = datagen.flow_from_directory(val_dir,target_size=(299,299),class_mode="categorical")


    test_accuracy=model.evaluate(val_gen,steps=25)
    print(model.metrics_names)
    print(test_accuracy)

    
    img_width, img_height = 299, 299

    print(f"Testing images located in {test_dir}")
    counter = 0
    results_dict = {}
    start_time = time.time()
    
    for filename_img in os.listdir(test_dir):
        counter += 1
        filename = os.path.join(test_dir,filename_img)
        img = image.load_img(filename, target_size=(img_width, img_height))
        test_image = image.img_to_array(img)
        test_image.shape
        img = np.expand_dims(test_image, axis=0)/255
        classes = model.predict(img, batch_size=10)

        print(classes)
        predicted_class = class_names[np.argmax(classes)]

        if predicted_class not in results_dict.keys():
            results_dict[predicted_class] = 1
        else:
            results_dict[predicted_class] += 1

        print(f"{predicted_class}:{round(np.amax(classes)*100,2)}%")
        if counter % 100 == 0:
            print(f"{counter} files processed!")

    time_taken = time.time() - start_time
    time_taken = round(time_taken,2)
    print(f"{counter} images processed in {time_taken} seconds, at a rate of {round(counter/time_taken,2)} images per second.")
    
    for predicted_class in results_dict.keys():
        print(f"{predicted_class} = {results_dict[predicted_class]} predictions")

我做错了什么?

编辑 1- 我试图通过添加 class_weight 参数来解决不平衡的 classes,如编辑后的代码所示。仍然无法预测 class B。我什至尝试使用 val_datagen 而不是 datagen 导致更糟糕的结果。

编辑 2- 现在我把我的整个文件夹复制到别处,然后删除 class B 并保留 classes A 和 C。我训练了模型,再次获得了非常高的训练精度和现在我的测试程序只能预测 class C 而不能预测 class A。我觉得我在 test.py 中犯了 really 愚蠢的错误]代码。

这是一个非常令人沮丧的错误。我意识到我在整个目录的 model.evaluate() 上获得了很高的验证准确性,但对于单个图像的 model.predict() 却不是这样。这是因为用于训练的图像增强技术也用于验证,但没有用于作为模型输入的单个图像。

在这种情况下,我意识到 samplewise_std_normalization 未应用于测试图像。所以我使用了标准化函数,受到 this answer- test_image = datagen.standardize(test_image) 的启发,现在我的模型运行完美。完整的 test.py 代码如下所示:

def test_model(test_dir, num_epochs,class_names, network_name="inceptionv4",):

    model=load_model(f'checkpoints/{network_name}_{num_epochs}epochs.h5')


    datagen=ImageDataGenerator(rescale=1/255,
                rotation_range=40,
                width_shift_range=0.1,
                height_shift_range=0.1,
                shear_range=0.1,
                zoom_range=0.1,
                horizontal_flip=True,
                fill_mode='nearest',         
                samplewise_std_normalization=True)

    val_datagen = ImageDataGenerator(rescale=1/255)
    val_dir = "val/"
    val_gen = datagen.flow_from_directory(val_dir,target_size=(299,299),class_mode="categorical")


    test_accuracy=model.evaluate(val_gen,steps=25)
    print(model.metrics_names)
    print(test_accuracy)

    
    img_width, img_height = 299, 299

    print(f"Testing images located in {test_dir}")
    counter = 0
    results_dict = {}
    start_time = time.time()
    
    for filename_img in os.listdir(test_dir):
        counter += 1
        filename = os.path.join(test_dir,filename_img)
        img = image.load_img(filename, target_size=(img_width, img_height))
        test_image = image.img_to_array(img)
        test_image = np.expand_dims(test_image, axis=0)  
        # Don't divide by 255, this is taken care of by the standardize function
        test_image = datagen.standardize(test_image)
        classes = model.predict(test_image, batch_size=10)

        print(classes)
        predicted_class = class_names[np.argmax(classes)]

        if predicted_class not in results_dict.keys():
            results_dict[predicted_class] = 1
        else:
            results_dict[predicted_class] += 1

        print(f"{predicted_class}:{round(np.amax(classes)*100,2)}%")
        if counter % 100 == 0:
            print(f"{counter} files processed!")

    time_taken = time.time() - start_time
    time_taken = round(time_taken,2)
    print(f"{counter} images processed in {time_taken} seconds, at a rate of {round(counter/time_taken,2)} images per second.")
    
    for predicted_class in results_dict.keys():
        print(f"{predicted_class} = {results_dict[predicted_class]} predictions")