检查失败:DeviceNameUtils::ParseFullName(new_base, &parsed_name)
Check Failed: DeviceNameUtils::ParseFullName(new_base, &parsed_name)
我正在尝试 运行 在 google 云上进行对象检测模型的训练工作。从每个 ps 副本记录以下内容后,它失败了。
Check failed: DeviceNameUtils::ParseFullName(new_base, &parsed_name)
{
insertId: "1am4lt7g2ytgyip"
jsonPayload: {
created: 1532870862.316736
levelname: "CRITICAL"
lineno: 27
message: "Check failed: DeviceNameUtils::ParseFullName(new_base, &parsed_name) "
pathname: "tensorflow/core/common_runtime/renamed_device.cc"
}
labels: {
compute.googleapis.com/resource_id: "8188383009228980271"
compute.googleapis.com/resource_name: "cmle-training-ps-1d73aafb3a-0-7bjnw"
compute.googleapis.com/zone: "us-central1-a"
ml.googleapis.com/job_id: "object_detection_07_29_2018_14_17_36"
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/task_name: "ps-replica-0"
ml.googleapis.com/trial_id: ""
}
logName: "projects/object-detection-210310/logs/ps-replica-0"
receiveTimestamp: "2018-07-29T13:27:48.515404065Z"
resource: {
labels: {
job_id: "object_detection_07_29_2018_14_17_36"
project_id: "object-detection-210310"
task_name: "ps-replica-0"
}
type: "ml_job"
}
severity: "CRITICAL"
timestamp: "2018-07-29T13:27:42.316735982Z"
}
接着是这个:
-ps-replica-1
Command '['python', '-m', u'object_detection.model_main', u'--
model_dir=gs://aka_b1/train/', u'--
pipeline_config_path=gs://aka_b1/data/ssd_mobilenet_v1_coco.config', '--job-
dir', u'gs://aka_b1/train/']' returned non-zero exit status -6
{
insertId: "1d4klnfg3ihl2be"
jsonPayload: {
created: 1532870863.971174
levelname: "ERROR"
lineno: 879
message: "Command '['python', '-m', u'object_detection.model_main', u'--model_dir=gs://aka_b1/train/', u'--pipeline_config_path=gs://aka_b1/data/ssd_mobilenet_v1_coco.config', '--job-dir', u'gs://aka_b1/train/']' returned non-zero exit status -6"
pathname: "/runcloudml.py"
}
labels: {
compute.googleapis.com/resource_id: "7345648913232166992"
compute.googleapis.com/resource_name: "cmle-training-ps-1d73aafb3a-1-tjx4f"
compute.googleapis.com/zone: "us-central1-a"
ml.googleapis.com/job_id: "object_detection_07_29_2018_14_17_36"
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/task_name: "ps-replica-1"
ml.googleapis.com/trial_id: ""
}
logName: "projects/object-detection-210310/logs/ps-replica-1"
receiveTimestamp: "2018-07-29T13:27:47.591698250Z"
resource: {
labels: {
job_id: "object_detection_07_29_2018_14_17_36"
project_id: "object-detection-210310"
task_name: "ps-replica-1"
}
type: "ml_job"
}
severity: "ERROR"
timestamp: "2018-07-29T13:27:43.971174001Z"
}
我在 运行 之前替换了成功训练作业的 tfrecords、配置文件和 ckpt 文件后尝试了。但问题依然存在。唯一不同的是存储桶名称,我在配置文件和训练作业提交命令中更改了它。
请帮忙。
我想我找到了问题所在。我在 REQUIRED_PACKAGES 中将 Tensorflow 包含在 setup.py 中,试图克服之前面临的不同问题。去掉之后就没有出现这个错误了。非常感谢大家。
我正在尝试 运行 在 google 云上进行对象检测模型的训练工作。从每个 ps 副本记录以下内容后,它失败了。
Check failed: DeviceNameUtils::ParseFullName(new_base, &parsed_name)
{
insertId: "1am4lt7g2ytgyip"
jsonPayload: {
created: 1532870862.316736
levelname: "CRITICAL"
lineno: 27
message: "Check failed: DeviceNameUtils::ParseFullName(new_base, &parsed_name) "
pathname: "tensorflow/core/common_runtime/renamed_device.cc"
}
labels: {
compute.googleapis.com/resource_id: "8188383009228980271"
compute.googleapis.com/resource_name: "cmle-training-ps-1d73aafb3a-0-7bjnw"
compute.googleapis.com/zone: "us-central1-a"
ml.googleapis.com/job_id: "object_detection_07_29_2018_14_17_36"
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/task_name: "ps-replica-0"
ml.googleapis.com/trial_id: ""
}
logName: "projects/object-detection-210310/logs/ps-replica-0"
receiveTimestamp: "2018-07-29T13:27:48.515404065Z"
resource: {
labels: {
job_id: "object_detection_07_29_2018_14_17_36"
project_id: "object-detection-210310"
task_name: "ps-replica-0"
}
type: "ml_job"
}
severity: "CRITICAL"
timestamp: "2018-07-29T13:27:42.316735982Z"
}
接着是这个:
-ps-replica-1
Command '['python', '-m', u'object_detection.model_main', u'--
model_dir=gs://aka_b1/train/', u'--
pipeline_config_path=gs://aka_b1/data/ssd_mobilenet_v1_coco.config', '--job-
dir', u'gs://aka_b1/train/']' returned non-zero exit status -6
{
insertId: "1d4klnfg3ihl2be"
jsonPayload: {
created: 1532870863.971174
levelname: "ERROR"
lineno: 879
message: "Command '['python', '-m', u'object_detection.model_main', u'--model_dir=gs://aka_b1/train/', u'--pipeline_config_path=gs://aka_b1/data/ssd_mobilenet_v1_coco.config', '--job-dir', u'gs://aka_b1/train/']' returned non-zero exit status -6"
pathname: "/runcloudml.py"
}
labels: {
compute.googleapis.com/resource_id: "7345648913232166992"
compute.googleapis.com/resource_name: "cmle-training-ps-1d73aafb3a-1-tjx4f"
compute.googleapis.com/zone: "us-central1-a"
ml.googleapis.com/job_id: "object_detection_07_29_2018_14_17_36"
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/task_name: "ps-replica-1"
ml.googleapis.com/trial_id: ""
}
logName: "projects/object-detection-210310/logs/ps-replica-1"
receiveTimestamp: "2018-07-29T13:27:47.591698250Z"
resource: {
labels: {
job_id: "object_detection_07_29_2018_14_17_36"
project_id: "object-detection-210310"
task_name: "ps-replica-1"
}
type: "ml_job"
}
severity: "ERROR"
timestamp: "2018-07-29T13:27:43.971174001Z"
}
我在 运行 之前替换了成功训练作业的 tfrecords、配置文件和 ckpt 文件后尝试了。但问题依然存在。唯一不同的是存储桶名称,我在配置文件和训练作业提交命令中更改了它。
请帮忙。
我想我找到了问题所在。我在 REQUIRED_PACKAGES 中将 Tensorflow 包含在 setup.py 中,试图克服之前面临的不同问题。去掉之后就没有出现这个错误了。非常感谢大家。