在 Google Cloud Platform 上训练作业 运行 但不消耗任何 CPU
Training Job Running on Google Cloud Platform but Not Consuming Any CPU
我在 Google 云平台上的 AI 平台上的训练工作似乎是 运行 但没有消耗任何 CPU。该程序不会终止,但在作业首次启动时确实会出现一些错误 运行。它们看起来像下面
INFO 2020-06-05 04:33:38 +0000 master-replica-0 Create CheckpointSaverHook.
ERROR 2020-06-05 04:33:38 +0000 master-replica-0 I0605 04:33:38.890919 139686838036224 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
INFO 2020-06-05 04:33:41 +0000 worker-replica-0 Graph was finalized.
ERROR 2020-06-05 04:33:41 +0000 worker-replica-0 I0605 04:33:41.006648 140712303798016 monitored_session.py:240] Graph was finalized.
INFO 2020-06-05 04:33:41 +0000 worker-replica-4 Graph was finalized.
ERROR 2020-06-05 04:33:41 +0000 worker-replica-4 I0605 04:33:41.482944 139947128342272 monitored_session.py:240] Graph was finalized.
INFO 2020-06-05 04:33:41 +0000 worker-replica-2 Graph was finalized.
ERROR 2020-06-05 04:33:41 +0000 worker-replica-2 I0605 04:33:41.927765 140284058486528 monitored_session.py:240] Graph was finalized.
INFO 2020-06-05 04:33:41 +0000 master-replica-0 Graph was finalized.
ERROR 2020-06-05 04:33:41 +0000 master-replica-0 I0605 04:33:41.995326 139686838036224 monitored_session.py:240] Graph was finalized.
INFO 2020-06-05 04:33:42 +0000 master-replica-0 Restoring parameters from gs://lasertagger_v1/output/models/wikisplit_experiment_name_2/model.ckpt-0
ERROR 2020-06-05 04:33:42 +0000 master-replica-0 I0605 04:33:42.216852 139686838036224 saver.py:1284] Restoring parameters from gs://lasertagger_v1/output/models/wikisplit_experiment_name_2/model.ckpt-0
INFO 2020-06-05 04:33:43 +0000 worker-replica-3 Done calling model_fn.
ERROR 2020-06-05 04:33:43 +0000 worker-replica-3 I0605 04:33:43.411592 140653000845056 estimator.py:1150] Done calling model_fn.
INFO 2020-06-05 04:33:43 +0000 worker-replica-3 Create CheckpointSaverHook.
ERROR 2020-06-05 04:33:43 +0000 worker-replica-3 I0605 04:33:43.413079 140653000845056 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
INFO 2020-06-05 04:33:44 +0000 worker-replica-1 Done calling model_fn.
ERROR 2020-06-05 04:33:44 +0000 worker-replica-1 I0605 04:33:44.139685 140410730743552 estimator.py:1150] Done calling model_fn.
INFO 2020-06-05 04:33:44 +0000 worker-replica-1 Create CheckpointSaverHook.
ERROR 2020-06-05 04:33:44 +0000 worker-replica-1 I0605 04:33:44.141169 140410730743552 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
INFO 2020-06-05 04:33:47 +0000 worker-replica-1 Graph was finalized.
ERROR 2020-06-05 04:33:47 +0000 worker-replica-1 I0605 04:33:47.280014 140410730743552 monitored_session.py:240] Graph was finalized.
INFO 2020-06-05 04:33:47 +0000 worker-replica-3 Graph was finalized.
ERROR 2020-06-05 04:33:47 +0000 worker-replica-3 I0605 04:33:47.335122 140653000845056 monitored_session.py:240] Graph was finalized.
每条 INFO 消息后跟一条 ERROR 消息,我很困惑这个训练作业是怎么回事。谢谢!
下面是一些更详细的错误信息:
2020-06-05 13:12:50.583 EDT
worker-replica-4
I0605 17:12:50.583258 140104498276096 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
{
insertId: "o5flw8f1urq2q"
jsonPayload: {
created: 1591377170.5835383
levelname: "ERROR"
lineno: 328
message: "I0605 17:12:50.583258 140104498276096 basic_session_run_hooks.py:541] Create CheckpointSaverHook."
pathname: "/runcloudml.py"
}
labels: {
compute.googleapis.com/resource_id: "2069730006064940177"
compute.googleapis.com/resource_name: "gke-cml-0605-170056-7fb-n1-highmem-96-9990517e-rvlx"
compute.googleapis.com/zone: "us-east1-c"
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/trial_id: ""
}
logName: "projects/smart-content-summary/logs/worker-replica-4"
receiveTimestamp: "2020-06-05T17:13:00.962017815Z"
resource: {
labels: {…}
type: "ml_job"
}
severity: "ERROR"
timestamp: "2020-06-05T17:12:50.583538292Z"
}
我高度怀疑问题出在保存模型的过程中。问题将由
引起
- 内存溢出
- 磁盘溢出。
您能否展示它们的一些监控指标或者可以考虑:
- 增加机器内存
- 增加根分区大小?
我在 Google 云平台上的 AI 平台上的训练工作似乎是 运行 但没有消耗任何 CPU。该程序不会终止,但在作业首次启动时确实会出现一些错误 运行。它们看起来像下面
INFO 2020-06-05 04:33:38 +0000 master-replica-0 Create CheckpointSaverHook.
ERROR 2020-06-05 04:33:38 +0000 master-replica-0 I0605 04:33:38.890919 139686838036224 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
INFO 2020-06-05 04:33:41 +0000 worker-replica-0 Graph was finalized.
ERROR 2020-06-05 04:33:41 +0000 worker-replica-0 I0605 04:33:41.006648 140712303798016 monitored_session.py:240] Graph was finalized.
INFO 2020-06-05 04:33:41 +0000 worker-replica-4 Graph was finalized.
ERROR 2020-06-05 04:33:41 +0000 worker-replica-4 I0605 04:33:41.482944 139947128342272 monitored_session.py:240] Graph was finalized.
INFO 2020-06-05 04:33:41 +0000 worker-replica-2 Graph was finalized.
ERROR 2020-06-05 04:33:41 +0000 worker-replica-2 I0605 04:33:41.927765 140284058486528 monitored_session.py:240] Graph was finalized.
INFO 2020-06-05 04:33:41 +0000 master-replica-0 Graph was finalized.
ERROR 2020-06-05 04:33:41 +0000 master-replica-0 I0605 04:33:41.995326 139686838036224 monitored_session.py:240] Graph was finalized.
INFO 2020-06-05 04:33:42 +0000 master-replica-0 Restoring parameters from gs://lasertagger_v1/output/models/wikisplit_experiment_name_2/model.ckpt-0
ERROR 2020-06-05 04:33:42 +0000 master-replica-0 I0605 04:33:42.216852 139686838036224 saver.py:1284] Restoring parameters from gs://lasertagger_v1/output/models/wikisplit_experiment_name_2/model.ckpt-0
INFO 2020-06-05 04:33:43 +0000 worker-replica-3 Done calling model_fn.
ERROR 2020-06-05 04:33:43 +0000 worker-replica-3 I0605 04:33:43.411592 140653000845056 estimator.py:1150] Done calling model_fn.
INFO 2020-06-05 04:33:43 +0000 worker-replica-3 Create CheckpointSaverHook.
ERROR 2020-06-05 04:33:43 +0000 worker-replica-3 I0605 04:33:43.413079 140653000845056 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
INFO 2020-06-05 04:33:44 +0000 worker-replica-1 Done calling model_fn.
ERROR 2020-06-05 04:33:44 +0000 worker-replica-1 I0605 04:33:44.139685 140410730743552 estimator.py:1150] Done calling model_fn.
INFO 2020-06-05 04:33:44 +0000 worker-replica-1 Create CheckpointSaverHook.
ERROR 2020-06-05 04:33:44 +0000 worker-replica-1 I0605 04:33:44.141169 140410730743552 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
INFO 2020-06-05 04:33:47 +0000 worker-replica-1 Graph was finalized.
ERROR 2020-06-05 04:33:47 +0000 worker-replica-1 I0605 04:33:47.280014 140410730743552 monitored_session.py:240] Graph was finalized.
INFO 2020-06-05 04:33:47 +0000 worker-replica-3 Graph was finalized.
ERROR 2020-06-05 04:33:47 +0000 worker-replica-3 I0605 04:33:47.335122 140653000845056 monitored_session.py:240] Graph was finalized.
每条 INFO 消息后跟一条 ERROR 消息,我很困惑这个训练作业是怎么回事。谢谢!
下面是一些更详细的错误信息:
2020-06-05 13:12:50.583 EDT
worker-replica-4
I0605 17:12:50.583258 140104498276096 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
{
insertId: "o5flw8f1urq2q"
jsonPayload: {
created: 1591377170.5835383
levelname: "ERROR"
lineno: 328
message: "I0605 17:12:50.583258 140104498276096 basic_session_run_hooks.py:541] Create CheckpointSaverHook."
pathname: "/runcloudml.py"
}
labels: {
compute.googleapis.com/resource_id: "2069730006064940177"
compute.googleapis.com/resource_name: "gke-cml-0605-170056-7fb-n1-highmem-96-9990517e-rvlx"
compute.googleapis.com/zone: "us-east1-c"
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/trial_id: ""
}
logName: "projects/smart-content-summary/logs/worker-replica-4"
receiveTimestamp: "2020-06-05T17:13:00.962017815Z"
resource: {
labels: {…}
type: "ml_job"
}
severity: "ERROR"
timestamp: "2020-06-05T17:12:50.583538292Z"
}
我高度怀疑问题出在保存模型的过程中。问题将由
引起- 内存溢出
- 磁盘溢出。
您能否展示它们的一些监控指标或者可以考虑:
- 增加机器内存
- 增加根分区大小?