Vertex AI - 如何监控训练进度?

Vertex AI - how to monitor training progress?

问题

在 Vertex AI 训练过程中,有没有办法监控模型训练进度的控制台输出?

背景

假设我们有一个Tensorflow/Keras模型训练代码:

model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=[len(train_dataset.keys())]),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)
])

optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(
    loss='mse',
    optimizer=optimizer,
    metrics=['mae', 'mse']
)

EPOCHS = 1000
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)

early_history = model.fit(normed_train_data, train_labels, 
                    epochs=EPOCHS, validation_split = 0.2, 
                    callbacks=[early_stop])

当运行从命令行训练模型时,我们可以在控制台看到进度。

Epoch 1/1000
OMP: Info #211: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #209: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-3
OMP: Info #156: KMP_AFFINITY: 4 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 2 cores/pkg x 2 threads/core (2 total cores)
OMP: Info #213: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 0 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1 
OMP: Info #249: KMP_AFFINITY: pid 1 tid 17 thread 0 bound to OS proc set 0
OMP: Info #249: KMP_AFFINITY: pid 1 tid 17 thread 1 bound to OS proc set 1
OMP: Info #249: KMP_AFFINITY: pid 1 tid 28 thread 2 bound to OS proc set 2
OMP: Info #249: KMP_AFFINITY: pid 1 tid 29 thread 3 bound to OS proc set 3
OMP: Info #249: KMP_AFFINITY: pid 1 tid 30 thread 4 bound to OS proc set 0
OMP: Info #249: KMP_AFFINITY: pid 1 tid 18 thread 5 bound to OS proc set 1
OMP: Info #249: KMP_AFFINITY: pid 1 tid 31 thread 6 bound to OS proc set 2
OMP: Info #249: KMP_AFFINITY: pid 1 tid 32 thread 7 bound to OS proc set 3
OMP: Info #249: KMP_AFFINITY: pid 1 tid 33 thread 8 bound to OS proc set 0
8/8 [==============================] - 2s 31ms/step - loss: 579.6393 - mae: 22.7661 - mse: 579.6393 - val_loss: 571.7239 - val_mae: 22.5494 - val_mse: 571.7239
Epoch 2/1000
8/8 [==============================] - 0s 7ms/step - loss: 527.9056 - mae: 21.6268 - mse: 527.9056 - val_loss: 520.5531 - val_mae: 21.3917 - val_mse: 520.5531
...

但是,如果我们运行在Vertex AI训练中进行训练,看起来是没有menu/option看到控制台输出的。不确定它是否记录在 Log Explorer 中。请帮助了解如何实时监控培训进度。

您可以使用以下查询在 GCP Logs Explorer 中查看训练日志。

resource.type="ml_job"
resource.labels.job_id="your-training-custom-job-ID"

可以在 GCP 控制台中正在进行的 Vertex AI 训练中找到 your-training-custom-job-ID,如下面的屏幕截图所示。

下面是使用上述查询在 GCP 日志资源管理器中进行 Vertex AI 训练的日志截图。

您可以点击跳转到现在立即查看最新日志。此外,您可以使用 Stream Logs 选项查看 REAL TIME 日志数据,您还可以调整缓冲区 window 其中有一定取舍。您可以参考此 documentation 以了解有关在 GCP 日志资源管理器中流式传输日志的更多信息。