获得 ETA 是否正常：6:43:26 小时完成第一个 epoch

Question

我已经创建了以下基于 vgg16 的 CNN，我想训练它 50 个时期。但它显示将近 7 小时（ETA：6:43:26）完成第一个纪元。谁能告诉我 209222 个训练图像和 40000 个验证图像（DeepFashion 数据集）是否正常？还是我的 steps_per_epoch 有问题？我使用具有 16 个工人的 HPC 来训练这个模型。

  train_gen = ImageDataGenerator(rescale=1./255)

  val_gen = ImageDataGenerator(rescale=1./255)

  train_batches = train_gen.flow_from_directory(train_path,
          target_size=(img_r, img_c),
          batch_size=batch_size,
          class_mode='categorical',
          shuffle=True)
          
  val_batches = val_gen.flow_from_directory(validation_path,
          target_size=(img_r, img_c),
          batch_size=batch_size_val,
          class_mode='categorical',
          shuffle=False)
  
  return train_batches, val_batches



def fit_model(model, batches, val_batches):

    print("started model training")
    history = model.fit(train_batches,
                                  steps_per_epoch = 209222/32,
                                  epochs = 50,
                                  validation_data= val_batches,
                                  validation_steps=40000/32,
                                  verbose=1,
                                  use_multiprocessing=True,
                                  workers=16
                                  )

这是模型部分

def create_model(input_shape, output_classes):
    logging.debug('input_shape {}'.format(input_shape))
    logging.debug('input_shape {}'.format(type(input_shape)))
    
    #optimizer_mod = keras.optimizers.SGD(lr=0.001, momentum=momentum, decay=decay, nesterov=False)
    
    vgg16 = VGG16(weights='imagenet',include_top=False)
  
    for layer in vgg16.layers[:15]:
        layer.trainable = False
    
    x= vgg16.get_layer('block4_conv3').input
    x = vgg16.get_layer('block4_conv3')(x)
  
    if True:
        x = Reshape([28*28,512])(x)
        att = MultiHeadsAttModel(l=28*28, d=512 , dv=64, dout=512, nv = 8 )
        x = att([x,x,x])
        x = Reshape([28,28,512])(x)   
        x = BatchNormalization()(x)
        
    #x = vgg16.get_layer('block5_conv1')(x)
    #x = vgg16.get_layer('block5_conv2')(x)
    #x = vgg16.get_layer('block5_conv3')(x)
    #x = vgg16.get_layer('block5_pool')(x)
    
    x = Flatten()(x)
    x = Dense(256, activation="relu")(x)
    x = Dropout(0.5)(x)
    outputs = Dense(output_classes, activation='softmax')(x)
    
    
    model =tf.keras.Model(inputs=vgg16.input, outputs=outputs)
    
    top3_acc = functools.partial(keras.metrics.top_k_categorical_accuracy, k=3)
    top3_acc.__name__ = 'top3_acc' 
    opt = tf.keras.optimizers.Adam(learning_rate=0.01)
    
    model.compile(
                  optimizer=opt,
                  loss='categorical_crossentropy',
                  metrics=['accuracy',top3_acc]) 

    return model

Answer 1

如果您使用的是 VGG，那么您应该将 -1 和 +1 之间的值重新调整为

这就是它的训练方式所以使用

rescale=1/127.5=1
```
That will not solve your long epoch 1 problem however. 
For steps_per_epoch and validation steps use

steps_per_epoch=209222//32+1 validation_steps= 40000//32 +1

That will also not solve the problem I suspect. 
Each training epoch will require 6539 steps and each validation 
will require 1251 steps. This is really rather large.
Now the processing time will be greatly dependent on the image size. 
What values did you use?
Also the VGG model has on the order of 40 million trainable parameters 
so it is computationally intensive to begin with. I would recommend 
using the Mobilenet model which has on the order of 4 million parameters
and is about as accurate. As noted by Edwin Cheong above  you need to
check if your GPU is being used. I suspect it is not.

获得 ETA 是否正常：6:43:26 小时完成第一个 epoch

Is it normal to get ETA: 6:43:26 hours to complete the first epoch

training-data

model-fitting

deep-learning

tensorflow

vgg-net