PAI 教程示例失败 运行。使用“[退出代码]:177”
PAI tutorial example failed to run. With '[ExitCode]: 177'
我正在关注 PAI 工作 tutorial。
这是我的作业配置:
{
"jobName": "yuan_tensorflow-distributed-jobguid",
"image": "docker.io/openpai/pai.run.tensorflow",
"dataDir": "hdfs://10.11.3.2:9000/yuan/sample/tensorflow",
"outputDir": "$PAI_DEFAULT_FS_URI/yuan/tensorflow-distributed-jobguid/output",
"codeDir": "$PAI_DEFAULT_FS_URI/path/tensorflow-distributed-jobguid/code",
"virtualCluster": "default",
"taskRoles": [
{
"name": "ps_server",
"taskNumber": 2,
"cpuNumber": 2,
"memoryMB": 8192,
"gpuNumber": 0,
"portList": [
{
"label": "http",
"beginAt": 0,
"portNumber": 1
},
{
"label": "ssh",
"beginAt": 0,
"portNumber": 1
}
],
"command": "pip --quiet install scipy && python code/tf_cnn_benchmarks.py --local_parameter_device=cpu --batch_size=32 --model=resnet20 --variable_update=parameter_server --data_dir=$PAI_DATA_DIR --data_name=cifar10 --train_dir=$PAI_OUTPUT_DIR --ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST --worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST --job_name=ps --task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX"
},
{
"name": "worker",
"taskNumber": 2,
"cpuNumber": 2,
"memoryMB": 16384,
"gpuNumber": 4,
"portList": [
{
"label": "http",
"beginAt": 0,
"portNumber": 1
},
{
"label": "ssh",
"beginAt": 0,
"portNumber": 1
}
],
"command": "pip --quiet install scipy && python code/tf_cnn_benchmarks.py --local_parameter_device=cpu --batch_size=32 --model=resnet20 --variable_update=parameter_server --data_dir=$PAI_DATA_DIR --data_name=cifar10 --train_dir=$PAI_OUTPUT_DIR --ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST --worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST --job_name=worker --task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX"
}
],
"killAllOnCompletedTaskNumber": 2,
"retryCount": 0
}
作业提交成功,但很快就失败了,大约4分钟后。
下面是我的'Application Summary'。
Start Time: 6/15/2018, 8:18:01 PM
Finish Time: 6/15/2018, 8:22:31 PM
Exit Diagnostics:
[ExitStatus]: LAUNCHER_EXIT_STATUS_UNDEFINED [ExitCode]: 177
[ExitDiagnostics]: ExitStatus undefined in Launcher, maybe
UserApplication itself failed. [ExitType]: UNKNOWN
________________________________________________________________________________________________________________________________________________________________________________________________________ [ExitCustomizedDiagnostics]: [ExitCode]: 1 [ExitDiagnostics]:
Exception from container-launch. Container id:
container_1529064439409_0003_01_000005 Exit code: 1 Stack trace:
ExitCodeException exitCode=1: at
org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at
org.apache.hadoop.util.Shell.run(Shell.java:456) at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Shell output: [ERROR] EXIT signal received in yarn container, exiting
...
Container exited with a non-zero exit code 1
________________________________________________________________________________________________________________________________________________________________________________________________________ [ExitCustomizedDiagnostics]:
worker: TASK_COMPLETED: [TaskStatus]: { "taskIndex" : 1,
"taskRoleName" : "worker", "taskState" : "TASK_COMPLETED",
"taskRetryPolicyState" : { "retriedCount" : 0, "succeededRetriedCount"
: 0, "transientNormalRetriedCount" : 0,
"transientConflictRetriedCount" : 0, "nonTransientRetriedCount" : 0,
"unKnownRetriedCount" : 0 }, "taskCreatedTimestamp" : 1529065083290,
"taskCompletedTimestamp" : 1529065346772, "taskServiceStatus" : {
"serviceVersion" : 0 }, "containerId" :
"container_1529064439409_0003_01_000005", "containerHost" :
"10.11.1.9", "containerIp" : "10.11.1.9", "containerPorts" :
"http:2938;ssh:2939;", "containerGpus" : 15, "containerLogHttpAddress"
:
"http://10.11.1.9:8042/node/containerlogs/container_1529064439409_0003_01_000005/admin/",
"containerConnectionLostCount" : 0, "containerIsDecommissioning" :
null, "containerLaunchedTimestamp" : 1529065087200,
"containerCompletedTimestamp" : 1529065346768, "containerExitCode" :
1, "containerExitDiagnostics" : "Exception from
container-launch.\nContainer id:
container_1529064439409_0003_01_000005\nExit code: 1\nStack trace:
ExitCodeException exitCode=1: \n\tat
org.apache.hadoop.util.Shell.runCommand(Shell.java:545)\n\tat
org.apache.hadoop.util.Shell.run(Shell.java:456)\n\tat
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)\n\tat
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)\n\tat
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)\n\tat
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)\n\tat
java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat
java.lang.Thread.run(Thread.java:748)\n\nShell output: [ERROR] EXIT
signal received in yarn container, exiting ...\n\n\nContainer exited
with a non-zero exit code 1\n", "containerExitType" : "UNKNOWN" }
[ContainerDiagnostics]: Container completed
container_1529064439409_0003_01_000005 on HostName 10.11.1.9.
ContainerLogHttpAddress:
http://10.11.1.9:8042/node/containerlogs/container_1529064439409_0003_01_000005/admin/
AppCacheNetworkPath:
10.11.1.9:/var/lib/hadoopdata/nm-local-dir/usercache/admin/appcache/application_1529064439409_0003
ContainerLogNetworkPath:
10.11.1.9:/var/lib/yarn/userlogs/application_1529064439409_0003/container_1529064439409_0003_01_000005
________________________________________________________________________________________________________________________________________________________________________________________________________ [AMStopReason]:Task worker Completed and KillAllOnAnyCompleted
enabled.
找到更多日志详细信息:
[INFO] hdfs_ssh_folder is hdfs://10.11.3.2:9000/Container/admin/yuan_tensorflow-distributed-2/ssh/application_1529064439409_0450
[INFO] task_role_no is 0
[INFO] PAI_TASK_INDEX is 1
[INFO] waitting for ssh key ready
[INFO] waitting for ssh key ready
[INFO] ssh key pair ready ...
[INFO] begin to download ssh key pair from hdfs ...
[INFO] start ssh service
* Restarting OpenBSD Secure Shell server sshd [80G
[74G[ OK ]
[INFO] USER COMMAND START
Traceback (most recent call last):
File "code/tf_cnn_benchmarks.py", line 38, in <module>
import benchmark_storage
ImportError: No module named benchmark_storage
[DEBUG] EXIT signal received in docker container, exiting ...
结论:
代码未完成,需要一些依赖项。
下面我提供了一个工作配置。
{
"jobName": "tensorflow-cifar10",
"image": "openpai/pai.example.tensorflow",
"dataDir": "/tmp/data",
"outputDir": "/tmp/output",
"taskRoles": [
{
"name": "cifar_train",
"taskNumber": 1,
"cpuNumber": 8,
"memoryMB": 32768,
"gpuNumber": 1,
"command": "git clone https://github.com/tensorflow/models && cd models/research/slim && python download_and_convert_data.py --dataset_name=cifar10 --dataset_dir=$PAI_DATA_DIR && python train_image_classifier.py --batch_size=64 --model_name=inception_v3 --dataset_name=cifar10 --dataset_split_name=train --dataset_dir=$PAI_DATA_DIR --train_dir=$PAI_OUTPUT_DIR"
}
]
}
通常你需要查看所有工作人员的日志,尤其是第一个退出的容器,看看那里发生了什么,因为任何退出的容器都会导致 Launcher 提前停止作业,因此你可以在中看到 "EXIT signal received in yarn container" 消息应用程序诊断内容。
失败作业的日志不会被删除。作业完成后将其移动到 hdfs。
从您的日志来看,代码似乎遗漏了一些文件。请下载基准测试的整个文件夹,而不是只下载一两个文件(如 cnn 基准测试)。
我正在关注 PAI 工作 tutorial。
这是我的作业配置:
{
"jobName": "yuan_tensorflow-distributed-jobguid",
"image": "docker.io/openpai/pai.run.tensorflow",
"dataDir": "hdfs://10.11.3.2:9000/yuan/sample/tensorflow",
"outputDir": "$PAI_DEFAULT_FS_URI/yuan/tensorflow-distributed-jobguid/output",
"codeDir": "$PAI_DEFAULT_FS_URI/path/tensorflow-distributed-jobguid/code",
"virtualCluster": "default",
"taskRoles": [
{
"name": "ps_server",
"taskNumber": 2,
"cpuNumber": 2,
"memoryMB": 8192,
"gpuNumber": 0,
"portList": [
{
"label": "http",
"beginAt": 0,
"portNumber": 1
},
{
"label": "ssh",
"beginAt": 0,
"portNumber": 1
}
],
"command": "pip --quiet install scipy && python code/tf_cnn_benchmarks.py --local_parameter_device=cpu --batch_size=32 --model=resnet20 --variable_update=parameter_server --data_dir=$PAI_DATA_DIR --data_name=cifar10 --train_dir=$PAI_OUTPUT_DIR --ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST --worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST --job_name=ps --task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX"
},
{
"name": "worker",
"taskNumber": 2,
"cpuNumber": 2,
"memoryMB": 16384,
"gpuNumber": 4,
"portList": [
{
"label": "http",
"beginAt": 0,
"portNumber": 1
},
{
"label": "ssh",
"beginAt": 0,
"portNumber": 1
}
],
"command": "pip --quiet install scipy && python code/tf_cnn_benchmarks.py --local_parameter_device=cpu --batch_size=32 --model=resnet20 --variable_update=parameter_server --data_dir=$PAI_DATA_DIR --data_name=cifar10 --train_dir=$PAI_OUTPUT_DIR --ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST --worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST --job_name=worker --task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX"
}
],
"killAllOnCompletedTaskNumber": 2,
"retryCount": 0
}
作业提交成功,但很快就失败了,大约4分钟后。
下面是我的'Application Summary'。
Start Time: 6/15/2018, 8:18:01 PM
Finish Time: 6/15/2018, 8:22:31 PM
Exit Diagnostics:
[ExitStatus]: LAUNCHER_EXIT_STATUS_UNDEFINED [ExitCode]: 177 [ExitDiagnostics]: ExitStatus undefined in Launcher, maybe UserApplication itself failed. [ExitType]: UNKNOWN ________________________________________________________________________________________________________________________________________________________________________________________________________ [ExitCustomizedDiagnostics]: [ExitCode]: 1 [ExitDiagnostics]: Exception from container-launch. Container id: container_1529064439409_0003_01_000005 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Shell output: [ERROR] EXIT signal received in yarn container, exiting ...
Container exited with a non-zero exit code 1
________________________________________________________________________________________________________________________________________________________________________________________________________ [ExitCustomizedDiagnostics]:
worker: TASK_COMPLETED: [TaskStatus]: { "taskIndex" : 1, "taskRoleName" : "worker", "taskState" : "TASK_COMPLETED", "taskRetryPolicyState" : { "retriedCount" : 0, "succeededRetriedCount" : 0, "transientNormalRetriedCount" : 0, "transientConflictRetriedCount" : 0, "nonTransientRetriedCount" : 0, "unKnownRetriedCount" : 0 }, "taskCreatedTimestamp" : 1529065083290, "taskCompletedTimestamp" : 1529065346772, "taskServiceStatus" : { "serviceVersion" : 0 }, "containerId" : "container_1529064439409_0003_01_000005", "containerHost" : "10.11.1.9", "containerIp" : "10.11.1.9", "containerPorts" : "http:2938;ssh:2939;", "containerGpus" : 15, "containerLogHttpAddress" : "http://10.11.1.9:8042/node/containerlogs/container_1529064439409_0003_01_000005/admin/", "containerConnectionLostCount" : 0, "containerIsDecommissioning" : null, "containerLaunchedTimestamp" : 1529065087200, "containerCompletedTimestamp" : 1529065346768, "containerExitCode" : 1, "containerExitDiagnostics" : "Exception from container-launch.\nContainer id: container_1529064439409_0003_01_000005\nExit code: 1\nStack trace: ExitCodeException exitCode=1: \n\tat org.apache.hadoop.util.Shell.runCommand(Shell.java:545)\n\tat org.apache.hadoop.util.Shell.run(Shell.java:456)\n\tat org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)\n\tat org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)\n\tat org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)\n\tat org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n\nShell output: [ERROR] EXIT signal received in yarn container, exiting ...\n\n\nContainer exited with a non-zero exit code 1\n", "containerExitType" : "UNKNOWN" } [ContainerDiagnostics]: Container completed container_1529064439409_0003_01_000005 on HostName 10.11.1.9. ContainerLogHttpAddress: http://10.11.1.9:8042/node/containerlogs/container_1529064439409_0003_01_000005/admin/ AppCacheNetworkPath: 10.11.1.9:/var/lib/hadoopdata/nm-local-dir/usercache/admin/appcache/application_1529064439409_0003 ContainerLogNetworkPath: 10.11.1.9:/var/lib/yarn/userlogs/application_1529064439409_0003/container_1529064439409_0003_01_000005 ________________________________________________________________________________________________________________________________________________________________________________________________________ [AMStopReason]:Task worker Completed and KillAllOnAnyCompleted enabled.
找到更多日志详细信息:
[INFO] hdfs_ssh_folder is hdfs://10.11.3.2:9000/Container/admin/yuan_tensorflow-distributed-2/ssh/application_1529064439409_0450
[INFO] task_role_no is 0
[INFO] PAI_TASK_INDEX is 1
[INFO] waitting for ssh key ready
[INFO] waitting for ssh key ready
[INFO] ssh key pair ready ...
[INFO] begin to download ssh key pair from hdfs ...
[INFO] start ssh service
* Restarting OpenBSD Secure Shell server sshd [80G
[74G[ OK ]
[INFO] USER COMMAND START
Traceback (most recent call last):
File "code/tf_cnn_benchmarks.py", line 38, in <module>
import benchmark_storage
ImportError: No module named benchmark_storage
[DEBUG] EXIT signal received in docker container, exiting ...
结论:
代码未完成,需要一些依赖项。 下面我提供了一个工作配置。
{
"jobName": "tensorflow-cifar10",
"image": "openpai/pai.example.tensorflow",
"dataDir": "/tmp/data",
"outputDir": "/tmp/output",
"taskRoles": [
{
"name": "cifar_train",
"taskNumber": 1,
"cpuNumber": 8,
"memoryMB": 32768,
"gpuNumber": 1,
"command": "git clone https://github.com/tensorflow/models && cd models/research/slim && python download_and_convert_data.py --dataset_name=cifar10 --dataset_dir=$PAI_DATA_DIR && python train_image_classifier.py --batch_size=64 --model_name=inception_v3 --dataset_name=cifar10 --dataset_split_name=train --dataset_dir=$PAI_DATA_DIR --train_dir=$PAI_OUTPUT_DIR"
}
]
}
通常你需要查看所有工作人员的日志,尤其是第一个退出的容器,看看那里发生了什么,因为任何退出的容器都会导致 Launcher 提前停止作业,因此你可以在中看到 "EXIT signal received in yarn container" 消息应用程序诊断内容。
失败作业的日志不会被删除。作业完成后将其移动到 hdfs。
从您的日志来看,代码似乎遗漏了一些文件。请下载基准测试的整个文件夹,而不是只下载一两个文件(如 cnn 基准测试)。