Vertex AI - 如何监控训练进度?
Vertex AI - how to monitor training progress?
问题
在 Vertex AI 训练过程中,有没有办法监控模型训练进度的控制台输出?
背景
假设我们有一个Tensorflow/Keras模型训练代码:
model = keras.Sequential([
layers.Dense(64, activation='relu', input_shape=[len(train_dataset.keys())]),
layers.Dense(64, activation='relu'),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(
loss='mse',
optimizer=optimizer,
metrics=['mae', 'mse']
)
EPOCHS = 1000
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)
early_history = model.fit(normed_train_data, train_labels,
epochs=EPOCHS, validation_split = 0.2,
callbacks=[early_stop])
当运行从命令行训练模型时,我们可以在控制台看到进度。
Epoch 1/1000
OMP: Info #211: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #209: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-3
OMP: Info #156: KMP_AFFINITY: 4 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 2 cores/pkg x 2 threads/core (2 total cores)
OMP: Info #213: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1
OMP: Info #249: KMP_AFFINITY: pid 1 tid 17 thread 0 bound to OS proc set 0
OMP: Info #249: KMP_AFFINITY: pid 1 tid 17 thread 1 bound to OS proc set 1
OMP: Info #249: KMP_AFFINITY: pid 1 tid 28 thread 2 bound to OS proc set 2
OMP: Info #249: KMP_AFFINITY: pid 1 tid 29 thread 3 bound to OS proc set 3
OMP: Info #249: KMP_AFFINITY: pid 1 tid 30 thread 4 bound to OS proc set 0
OMP: Info #249: KMP_AFFINITY: pid 1 tid 18 thread 5 bound to OS proc set 1
OMP: Info #249: KMP_AFFINITY: pid 1 tid 31 thread 6 bound to OS proc set 2
OMP: Info #249: KMP_AFFINITY: pid 1 tid 32 thread 7 bound to OS proc set 3
OMP: Info #249: KMP_AFFINITY: pid 1 tid 33 thread 8 bound to OS proc set 0
8/8 [==============================] - 2s 31ms/step - loss: 579.6393 - mae: 22.7661 - mse: 579.6393 - val_loss: 571.7239 - val_mae: 22.5494 - val_mse: 571.7239
Epoch 2/1000
8/8 [==============================] - 0s 7ms/step - loss: 527.9056 - mae: 21.6268 - mse: 527.9056 - val_loss: 520.5531 - val_mae: 21.3917 - val_mse: 520.5531
...
但是,如果我们运行在Vertex AI训练中进行训练,看起来是没有menu/option看到控制台输出的。不确定它是否记录在 Log Explorer 中。请帮助了解如何实时监控培训进度。
您可以使用以下查询在 GCP Logs Explorer 中查看训练日志。
resource.type="ml_job"
resource.labels.job_id="your-training-custom-job-ID"
可以在 GCP 控制台中正在进行的 Vertex AI 训练中找到 your-training-custom-job-ID,如下面的屏幕截图所示。
下面是使用上述查询在 GCP 日志资源管理器中进行 Vertex AI 训练的日志截图。
您可以点击跳转到现在立即查看最新日志。此外,您可以使用 Stream Logs 选项查看 REAL TIME 日志数据,您还可以调整缓冲区 window 其中有一定取舍。您可以参考此 documentation 以了解有关在 GCP 日志资源管理器中流式传输日志的更多信息。
问题
在 Vertex AI 训练过程中,有没有办法监控模型训练进度的控制台输出?
背景
假设我们有一个Tensorflow/Keras模型训练代码:
model = keras.Sequential([
layers.Dense(64, activation='relu', input_shape=[len(train_dataset.keys())]),
layers.Dense(64, activation='relu'),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(
loss='mse',
optimizer=optimizer,
metrics=['mae', 'mse']
)
EPOCHS = 1000
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)
early_history = model.fit(normed_train_data, train_labels,
epochs=EPOCHS, validation_split = 0.2,
callbacks=[early_stop])
当运行从命令行训练模型时,我们可以在控制台看到进度。
Epoch 1/1000
OMP: Info #211: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #209: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-3
OMP: Info #156: KMP_AFFINITY: 4 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 2 cores/pkg x 2 threads/core (2 total cores)
OMP: Info #213: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1
OMP: Info #249: KMP_AFFINITY: pid 1 tid 17 thread 0 bound to OS proc set 0
OMP: Info #249: KMP_AFFINITY: pid 1 tid 17 thread 1 bound to OS proc set 1
OMP: Info #249: KMP_AFFINITY: pid 1 tid 28 thread 2 bound to OS proc set 2
OMP: Info #249: KMP_AFFINITY: pid 1 tid 29 thread 3 bound to OS proc set 3
OMP: Info #249: KMP_AFFINITY: pid 1 tid 30 thread 4 bound to OS proc set 0
OMP: Info #249: KMP_AFFINITY: pid 1 tid 18 thread 5 bound to OS proc set 1
OMP: Info #249: KMP_AFFINITY: pid 1 tid 31 thread 6 bound to OS proc set 2
OMP: Info #249: KMP_AFFINITY: pid 1 tid 32 thread 7 bound to OS proc set 3
OMP: Info #249: KMP_AFFINITY: pid 1 tid 33 thread 8 bound to OS proc set 0
8/8 [==============================] - 2s 31ms/step - loss: 579.6393 - mae: 22.7661 - mse: 579.6393 - val_loss: 571.7239 - val_mae: 22.5494 - val_mse: 571.7239
Epoch 2/1000
8/8 [==============================] - 0s 7ms/step - loss: 527.9056 - mae: 21.6268 - mse: 527.9056 - val_loss: 520.5531 - val_mae: 21.3917 - val_mse: 520.5531
...
但是,如果我们运行在Vertex AI训练中进行训练,看起来是没有menu/option看到控制台输出的。不确定它是否记录在 Log Explorer 中。请帮助了解如何实时监控培训进度。
您可以使用以下查询在 GCP Logs Explorer 中查看训练日志。
resource.type="ml_job"
resource.labels.job_id="your-training-custom-job-ID"
可以在 GCP 控制台中正在进行的 Vertex AI 训练中找到 your-training-custom-job-ID,如下面的屏幕截图所示。
下面是使用上述查询在 GCP 日志资源管理器中进行 Vertex AI 训练的日志截图。
您可以点击跳转到现在立即查看最新日志。此外,您可以使用 Stream Logs 选项查看 REAL TIME 日志数据,您还可以调整缓冲区 window 其中有一定取舍。您可以参考此 documentation 以了解有关在 GCP 日志资源管理器中流式传输日志的更多信息。