TensorFlow BinaryCrossentropy 损失很快达到 NaN
TensorFlow BinaryCrossentropy loss quickly reaches NaN
TL;DR - ML 模型损失,当用新数据重新训练时,很快达到 NaN。所有“标准”解决方案都不起作用。
您好,
最近,我(成功)训练了一个 CNN/dense-layered 模型,能够对频谱图(音频的图像表示)进行分类。我想再次尝试训练这个模型使用新数据并确保它是正确的尺寸等。
然而,由于某种原因,BinaryCrossentropy 损失函数稳步下降,直到 1.000 左右,并在第一个 epoch 内突然变为“NaN”。我试过将学习率降低到 1e-8,在整个过程中使用 ReLu,在最后一层使用 sigmoid,但似乎没有任何效果。即使将网络简化为只有密集层,这个问题仍然会发生。虽然我已经手动标准化了我的数据,但我非常有信心我做对了,所以我的所有数据都在 [0, 1] 之间。这里可能有一个洞,但我认为这不太可能。
我在这里附上了我的模型架构代码:
input_shape = (125, 128, 1)
model = models.Sequential([
layers.Conv2D(16, (3, 3), activation='relu', kernel_regularizer=regularizers.l2(0.001), input_shape=input_shape),
layers.MaxPooling2D((3, 3), strides=(2, 2), padding='same'),
layers.BatchNormalization(),
layers.Conv2D(16, (3, 3), activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.MaxPooling2D((2, 2), strides=(1, 1), padding='same'),
layers.BatchNormalization(),
layers.Conv2D(16, (3, 3), activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.MaxPooling2D((3, 3), strides=(2, 2), padding='same'),
layers.BatchNormalization(),
layers.Conv2D(16, (2, 2), activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.MaxPooling2D((2, 2), strides=(1, 1), padding='same'),
layers.BatchNormalization(),
layers.Conv2D(16, (2, 2), activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.MaxPooling2D((3, 3), strides=(2, 2), padding='same'),
layers.BatchNormalization(),
layers.Conv2D(16, (2, 2), activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.MaxPooling2D((2, 2), strides=(1, 1), padding='same'),
layers.BatchNormalization(),
layers.Dropout(0.3),
layers.Flatten(),
layers.Dense(512, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.Dense(256, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.Dense(128, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.Dropout(0.5),
layers.Dense(1, activation='sigmoid')
])
不过有趣的是,我尝试使用这些新数据来微调 VGG16 模型,并且成功了! (没有丢失 NaN 的问题。)我已经在此处附加了该代码,但我真的不知道 where/if 导致问题的原因有什么不同:
base_model = keras.applications.VGG16(
weights="imagenet",
input_shape=(125, 128, 3),
include_top=False,
)
# Freeze the base_model
base_model.trainable = False
# Create new model on top
inputs = keras.Input(shape=(125, 128, 3))
x = inputs
x = base_model(x, training=False)
x = keras.layers.GlobalAveragePooling2D()(x)
x = keras.layers.Dense(256, activation='relu', kernel_regularizer=regularizers.l2(0.001))(x)
x = keras.layers.Dense(128, activation='relu', kernel_regularizer=regularizers.l2(0.001))(x)
x = keras.layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.001))(x)
x = keras.layers.Dropout(0.5)(x) # Regularize with dropout
outputs = keras.layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs, outputs)
model.summary()
我想我已经阅读了所有“书本”的解决方案,但似乎仍然找不到问题的根源。任何帮助将不胜感激。
从 Convolution layers
中删除所有不需要的 kernel_regularizers
、BatchNormalization
和 dropout
层。
仅在模型定义的 Dense
层中保留 kernel_regularizers
和 Dropout
,并更改 Conv2d 层中的 number of kernels
。
并尝试使用以下代码再次训练您的模型:
model = Sequential([
Rescaling(1./255, input_shape=(img_h,img_w,3)),
Conv2D(16, (3, 3), activation='relu'),#, kernel_regularizer=regularizers.l2(0.001)),
MaxPooling2D((3, 3), strides=(2, 2), padding='same'),
#BatchNormalization(),
Conv2D(16, (3, 3), activation='relu'),#, kernel_regularizer=regularizers.l2(0.001)),
MaxPooling2D((2, 2), strides=(1, 1), padding='same'),
#BatchNormalization(),
Conv2D(32, (3, 3), activation='relu'),#, kernel_regularizer=regularizers.l2(0.001)),
MaxPooling2D((3, 3), strides=(2, 2), padding='same'),
#BatchNormalization(),
Conv2D(32, (2, 2), activation='relu'),#, kernel_regularizer=regularizers.l2(0.001)),
MaxPooling2D((2, 2), strides=(1, 1), padding='same'),
#BatchNormalization(),
Conv2D(64, (2, 2), activation='relu'),#, kernel_regularizer=regularizers.l2(0.001)),
MaxPooling2D((3, 3), strides=(2, 2), padding='same'),
#BatchNormalization(),
Conv2D(64, (2, 2), activation='relu'),#, kernel_regularizer=regularizers.l2(0.001)),
MaxPooling2D((2, 2), strides=(1, 1), padding='same'),
#BatchNormalization(),
#Dropout(0.3),
Flatten(),
Dense(512, activation='relu'), kernel_regularizer=regularizers.l2(0.001)),
Dense(256, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
Dense(128, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
Dropout(0.5),
Dense(1, activation='sigmoid')
])
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=optimizer,loss='binary_crossentropy',metrics=['accuracy'])
model.fit(...
输出:
Epoch 1/20
63/63 [==============================] - 9s 97ms/step - loss: 1.0032 - accuracy: 0.5035 - val_loss: 0.8219 - val_accuracy: 0.6160
Epoch 2/20
63/63 [==============================] - 6s 88ms/step - loss: 0.7575 - accuracy: 0.5755 - val_loss: 0.7256 - val_accuracy: 0.6120
Epoch 3/20
63/63 [==============================] - 6s 88ms/step - loss: 0.7181 - accuracy: 0.5805 - val_loss: 0.6917 - val_accuracy: 0.6360
Epoch 4/20
63/63 [==============================] - 6s 88ms/step - loss: 0.6749 - accuracy: 0.6190 - val_loss: 0.6671 - val_accuracy: 0.6300
Epoch 5/20
63/63 [==============================] - 6s 95ms/step - loss: 0.6571 - accuracy: 0.6500 - val_loss: 0.6850 - val_accuracy: 0.5980
Epoch 6/20
63/63 [==============================] - 5s 80ms/step - loss: 0.6319 - accuracy: 0.6720 - val_loss: 0.6243 - val_accuracy: 0.6730
Epoch 7/20
63/63 [==============================] - 6s 90ms/step - loss: 0.5923 - accuracy: 0.6935 - val_loss: 0.6144 - val_accuracy: 0.7120
Epoch 8/20
63/63 [==============================] - 6s 89ms/step - loss: 0.5643 - accuracy: 0.7205 - val_loss: 0.6136 - val_accuracy: 0.6700
Epoch 9/20
63/63 [==============================] - 6s 93ms/step - loss: 0.5552 - accuracy: 0.7380 - val_loss: 0.5669 - val_accuracy: 0.7080
Epoch 10/20
63/63 [==============================] - 4s 58ms/step - loss: 0.5423 - accuracy: 0.7400 - val_loss: 0.5819 - val_accuracy: 0.7120
Epoch 11/20
63/63 [==============================] - 4s 57ms/step - loss: 0.4905 - accuracy: 0.7745 - val_loss: 0.6146 - val_accuracy: 0.7020
Epoch 12/20
63/63 [==============================] - 4s 57ms/step - loss: 0.4808 - accuracy: 0.7900 - val_loss: 0.6318 - val_accuracy: 0.7070
Epoch 13/20
63/63 [==============================] - 4s 60ms/step - loss: 0.4602 - accuracy: 0.7990 - val_loss: 0.5707 - val_accuracy: 0.7160
Epoch 14/20
63/63 [==============================] - 4s 61ms/step - loss: 0.4291 - accuracy: 0.8190 - val_loss: 0.6392 - val_accuracy: 0.6910
Epoch 15/20
63/63 [==============================] - 5s 69ms/step - loss: 0.4003 - accuracy: 0.8355 - val_loss: 0.7048 - val_accuracy: 0.7110
Epoch 16/20
63/63 [==============================] - 4s 58ms/step - loss: 0.3658 - accuracy: 0.8430 - val_loss: 0.8027 - val_accuracy: 0.7180
Epoch 17/20
63/63 [==============================] - 4s 58ms/step - loss: 0.3069 - accuracy: 0.8750 - val_loss: 0.9428 - val_accuracy: 0.6970
Epoch 18/20
63/63 [==============================] - 4s 59ms/step - loss: 0.2601 - accuracy: 0.9005 - val_loss: 0.9420 - val_accuracy: 0.7170
Epoch 19/20
63/63 [==============================] - 4s 60ms/step - loss: 0.2061 - accuracy: 0.9230 - val_loss: 0.9134 - val_accuracy: 0.7290
Epoch 20/20
63/63 [==============================] - 4s 62ms/step - loss: 0.1770 - accuracy: 0.9330 - val_loss: 1.0805 - val_accuracy: 0.6930
原来这是我的一些输入数据的问题(在归一化过程中除以零错误。)很抱歉给您带来麻烦,感谢您的帮助。
TL;DR - ML 模型损失,当用新数据重新训练时,很快达到 NaN。所有“标准”解决方案都不起作用。
您好,
最近,我(成功)训练了一个 CNN/dense-layered 模型,能够对频谱图(音频的图像表示)进行分类。我想再次尝试训练这个模型使用新数据并确保它是正确的尺寸等。
然而,由于某种原因,BinaryCrossentropy 损失函数稳步下降,直到 1.000 左右,并在第一个 epoch 内突然变为“NaN”。我试过将学习率降低到 1e-8,在整个过程中使用 ReLu,在最后一层使用 sigmoid,但似乎没有任何效果。即使将网络简化为只有密集层,这个问题仍然会发生。虽然我已经手动标准化了我的数据,但我非常有信心我做对了,所以我的所有数据都在 [0, 1] 之间。这里可能有一个洞,但我认为这不太可能。
我在这里附上了我的模型架构代码:
input_shape = (125, 128, 1)
model = models.Sequential([
layers.Conv2D(16, (3, 3), activation='relu', kernel_regularizer=regularizers.l2(0.001), input_shape=input_shape),
layers.MaxPooling2D((3, 3), strides=(2, 2), padding='same'),
layers.BatchNormalization(),
layers.Conv2D(16, (3, 3), activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.MaxPooling2D((2, 2), strides=(1, 1), padding='same'),
layers.BatchNormalization(),
layers.Conv2D(16, (3, 3), activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.MaxPooling2D((3, 3), strides=(2, 2), padding='same'),
layers.BatchNormalization(),
layers.Conv2D(16, (2, 2), activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.MaxPooling2D((2, 2), strides=(1, 1), padding='same'),
layers.BatchNormalization(),
layers.Conv2D(16, (2, 2), activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.MaxPooling2D((3, 3), strides=(2, 2), padding='same'),
layers.BatchNormalization(),
layers.Conv2D(16, (2, 2), activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.MaxPooling2D((2, 2), strides=(1, 1), padding='same'),
layers.BatchNormalization(),
layers.Dropout(0.3),
layers.Flatten(),
layers.Dense(512, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.Dense(256, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.Dense(128, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
layers.Dropout(0.5),
layers.Dense(1, activation='sigmoid')
])
不过有趣的是,我尝试使用这些新数据来微调 VGG16 模型,并且成功了! (没有丢失 NaN 的问题。)我已经在此处附加了该代码,但我真的不知道 where/if 导致问题的原因有什么不同:
base_model = keras.applications.VGG16(
weights="imagenet",
input_shape=(125, 128, 3),
include_top=False,
)
# Freeze the base_model
base_model.trainable = False
# Create new model on top
inputs = keras.Input(shape=(125, 128, 3))
x = inputs
x = base_model(x, training=False)
x = keras.layers.GlobalAveragePooling2D()(x)
x = keras.layers.Dense(256, activation='relu', kernel_regularizer=regularizers.l2(0.001))(x)
x = keras.layers.Dense(128, activation='relu', kernel_regularizer=regularizers.l2(0.001))(x)
x = keras.layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.001))(x)
x = keras.layers.Dropout(0.5)(x) # Regularize with dropout
outputs = keras.layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs, outputs)
model.summary()
我想我已经阅读了所有“书本”的解决方案,但似乎仍然找不到问题的根源。任何帮助将不胜感激。
从 Convolution layers
中删除所有不需要的 kernel_regularizers
、BatchNormalization
和 dropout
层。
仅在模型定义的 Dense
层中保留 kernel_regularizers
和 Dropout
,并更改 Conv2d 层中的 number of kernels
。
并尝试使用以下代码再次训练您的模型:
model = Sequential([
Rescaling(1./255, input_shape=(img_h,img_w,3)),
Conv2D(16, (3, 3), activation='relu'),#, kernel_regularizer=regularizers.l2(0.001)),
MaxPooling2D((3, 3), strides=(2, 2), padding='same'),
#BatchNormalization(),
Conv2D(16, (3, 3), activation='relu'),#, kernel_regularizer=regularizers.l2(0.001)),
MaxPooling2D((2, 2), strides=(1, 1), padding='same'),
#BatchNormalization(),
Conv2D(32, (3, 3), activation='relu'),#, kernel_regularizer=regularizers.l2(0.001)),
MaxPooling2D((3, 3), strides=(2, 2), padding='same'),
#BatchNormalization(),
Conv2D(32, (2, 2), activation='relu'),#, kernel_regularizer=regularizers.l2(0.001)),
MaxPooling2D((2, 2), strides=(1, 1), padding='same'),
#BatchNormalization(),
Conv2D(64, (2, 2), activation='relu'),#, kernel_regularizer=regularizers.l2(0.001)),
MaxPooling2D((3, 3), strides=(2, 2), padding='same'),
#BatchNormalization(),
Conv2D(64, (2, 2), activation='relu'),#, kernel_regularizer=regularizers.l2(0.001)),
MaxPooling2D((2, 2), strides=(1, 1), padding='same'),
#BatchNormalization(),
#Dropout(0.3),
Flatten(),
Dense(512, activation='relu'), kernel_regularizer=regularizers.l2(0.001)),
Dense(256, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
Dense(128, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
Dropout(0.5),
Dense(1, activation='sigmoid')
])
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=optimizer,loss='binary_crossentropy',metrics=['accuracy'])
model.fit(...
输出:
Epoch 1/20
63/63 [==============================] - 9s 97ms/step - loss: 1.0032 - accuracy: 0.5035 - val_loss: 0.8219 - val_accuracy: 0.6160
Epoch 2/20
63/63 [==============================] - 6s 88ms/step - loss: 0.7575 - accuracy: 0.5755 - val_loss: 0.7256 - val_accuracy: 0.6120
Epoch 3/20
63/63 [==============================] - 6s 88ms/step - loss: 0.7181 - accuracy: 0.5805 - val_loss: 0.6917 - val_accuracy: 0.6360
Epoch 4/20
63/63 [==============================] - 6s 88ms/step - loss: 0.6749 - accuracy: 0.6190 - val_loss: 0.6671 - val_accuracy: 0.6300
Epoch 5/20
63/63 [==============================] - 6s 95ms/step - loss: 0.6571 - accuracy: 0.6500 - val_loss: 0.6850 - val_accuracy: 0.5980
Epoch 6/20
63/63 [==============================] - 5s 80ms/step - loss: 0.6319 - accuracy: 0.6720 - val_loss: 0.6243 - val_accuracy: 0.6730
Epoch 7/20
63/63 [==============================] - 6s 90ms/step - loss: 0.5923 - accuracy: 0.6935 - val_loss: 0.6144 - val_accuracy: 0.7120
Epoch 8/20
63/63 [==============================] - 6s 89ms/step - loss: 0.5643 - accuracy: 0.7205 - val_loss: 0.6136 - val_accuracy: 0.6700
Epoch 9/20
63/63 [==============================] - 6s 93ms/step - loss: 0.5552 - accuracy: 0.7380 - val_loss: 0.5669 - val_accuracy: 0.7080
Epoch 10/20
63/63 [==============================] - 4s 58ms/step - loss: 0.5423 - accuracy: 0.7400 - val_loss: 0.5819 - val_accuracy: 0.7120
Epoch 11/20
63/63 [==============================] - 4s 57ms/step - loss: 0.4905 - accuracy: 0.7745 - val_loss: 0.6146 - val_accuracy: 0.7020
Epoch 12/20
63/63 [==============================] - 4s 57ms/step - loss: 0.4808 - accuracy: 0.7900 - val_loss: 0.6318 - val_accuracy: 0.7070
Epoch 13/20
63/63 [==============================] - 4s 60ms/step - loss: 0.4602 - accuracy: 0.7990 - val_loss: 0.5707 - val_accuracy: 0.7160
Epoch 14/20
63/63 [==============================] - 4s 61ms/step - loss: 0.4291 - accuracy: 0.8190 - val_loss: 0.6392 - val_accuracy: 0.6910
Epoch 15/20
63/63 [==============================] - 5s 69ms/step - loss: 0.4003 - accuracy: 0.8355 - val_loss: 0.7048 - val_accuracy: 0.7110
Epoch 16/20
63/63 [==============================] - 4s 58ms/step - loss: 0.3658 - accuracy: 0.8430 - val_loss: 0.8027 - val_accuracy: 0.7180
Epoch 17/20
63/63 [==============================] - 4s 58ms/step - loss: 0.3069 - accuracy: 0.8750 - val_loss: 0.9428 - val_accuracy: 0.6970
Epoch 18/20
63/63 [==============================] - 4s 59ms/step - loss: 0.2601 - accuracy: 0.9005 - val_loss: 0.9420 - val_accuracy: 0.7170
Epoch 19/20
63/63 [==============================] - 4s 60ms/step - loss: 0.2061 - accuracy: 0.9230 - val_loss: 0.9134 - val_accuracy: 0.7290
Epoch 20/20
63/63 [==============================] - 4s 62ms/step - loss: 0.1770 - accuracy: 0.9330 - val_loss: 1.0805 - val_accuracy: 0.6930
原来这是我的一些输入数据的问题(在归一化过程中除以零错误。)很抱歉给您带来麻烦,感谢您的帮助。