给定容器错误状态代码,在哪里可以找到更明确的错误?

Where to find more explicit errors given container error status codes?

我实际上是 运行 通过 Mesos 堆栈执行任务,它使用 Docker 容器。

有时,某些任务会失败。

以下是一些相关的 TaskStatus 消息和原因:

message: Container exited with status 1 - reason: REASON_COMMAND_EXECUTOR_FAILED
message: Container exited with status 42 - reason: REASON_COMMAND_EXECUTOR_FAILED
message: Container exited with status 137 - reason: REASON_COMMAND_EXECUTOR_FAILED

是否有 table 的对应关系将来自 TaskStatus 消息的容器错误状态代码与更明确的错误联系起来?

你想在mesos.proto中复习enum Reason(复制如下):

  enum Reason {
    // TODO(jieyu): The default value when a caller doesn't check for
    // presence is 0 and so ideally the 0 reason is not a valid one.
    // Since this is not used anywhere, consider removing this reason.
    REASON_COMMAND_EXECUTOR_FAILED = 0;

    REASON_CONTAINER_LAUNCH_FAILED = 21;
    REASON_CONTAINER_LIMITATION = 19;
    REASON_CONTAINER_LIMITATION_DISK = 20;
    REASON_CONTAINER_LIMITATION_MEMORY = 8;
    REASON_CONTAINER_PREEMPTED = 17;
    REASON_CONTAINER_UPDATE_FAILED = 22;
    REASON_EXECUTOR_REGISTRATION_TIMEOUT = 23;
    REASON_EXECUTOR_REREGISTRATION_TIMEOUT = 24;
    REASON_EXECUTOR_TERMINATED = 1;
    REASON_EXECUTOR_UNREGISTERED = 2;
    REASON_FRAMEWORK_REMOVED = 3;
    REASON_GC_ERROR = 4;
    REASON_INVALID_FRAMEWORKID = 5;
    REASON_INVALID_OFFERS = 6;
    REASON_IO_SWITCHBOARD_EXITED = 27;
    REASON_MASTER_DISCONNECTED = 7;
    REASON_RECONCILIATION = 9;
    REASON_RESOURCES_UNKNOWN = 18;
    REASON_SLAVE_DISCONNECTED = 10;
    REASON_SLAVE_REMOVED = 11;
    REASON_SLAVE_RESTARTED = 12;
    REASON_SLAVE_UNKNOWN = 13;
    REASON_TASK_CHECK_STATUS_UPDATED = 28;
    REASON_TASK_GROUP_INVALID = 25;
    REASON_TASK_GROUP_UNAUTHORIZED = 26;
    REASON_TASK_INVALID = 14;
    REASON_TASK_UNAUTHORIZED = 15;
    REASON_TASK_UNKNOWN = 16;
  }

命令任务可能因多种原因而失败并设置正确的退出代码。例如 Docker 1.10 像这样设置退出状态代码 (from documentation and this answer):

The exit code from docker run gives information about why the container failed to run or why it exited. When docker run exits with a non-zero code, the exit codes follow the chroot standard, see below:

125 if the error is with Docker daemon itself:

$ docker run --foo busybox; echo $?
# flag provided but not defined: --foo   See 'docker run --help'.   

126 if the contained command cannot be invoked:

$ docker run busybox /etc; echo $?
# docker: Error response from daemon: Container command '/etc' could not be invoked.   

127 if the contained command cannot be found

$ docker run busybox foo; echo $?
# docker: Error response from daemon: Container command 'foo' not found or does not exist.   127 Exit code of contained command

otherwise

$ docker run busybox /bin/sh -c 'exit 3'; echo $?
# 3

可以找到另一个退出代码规则here

| Code  |            Meaning             |         Example         |                                                   Comments                                                   |
|-------|--------------------------------|-------------------------|--------------------------------------------------------------------------------------------------------------|
| 1     | Catchall for general errors    | let "var1 = 1/0"        | Miscellaneous errors, such as "divide by zero" and other impermissible operations                            |
| 2     | Misuse of shell builtins       | empty_function() {}     | Missing keyword or command, or permission problem (and diff return code on a failed binary file comparison). |
| 126   | Command invoked cannot execute | /dev/null               | Permission problem or command is not an executable                                                           |
| 127   | "command not found"            | illegal_command         | Possible problem with $PATH or a typo                                                                        |
| 128   | Invalid argument to exit       | exit 3.14159            | exit takes only integer args in the range 0 - 255 (see first footnote)                                       |
| 128+n | Fatal error signal "n"         | kill -9 $PPID of script | $? returns 137 (128 + 9)                                                                                     |
| 130   | Script terminated by Control-C | Ctl-C                   | Control-C is fatal error signal 2, (130 = 128 + 2, see above)                                                |
| 255*  | Exit status out of range       | exit -1                 | exit takes only integer args in the range 0 - 255                                                            |

根据你的例子:

如果您需要更多信息来解释状态代码,您可以查看 Mesos TaskStatus 更新中的 Message 字段,例如 Mesos 将有关 OOM 的信息放在那里。在 Mesos 日志中也可以找到相同的信息。要调试命令返回非零代码的原因,您可以检查存储在执行程序沙箱中的文件,尤其是 stderr/stdout 或命令特定日志。