为什么 Mesos 框架没有被提供资源?
Why is Mesos framework not being offered resources?
我正在使用 Mesos 1.0.1。我添加了一个具有新角色 docker_gpu_worker
的代理。我用这个角色注册了一个框架。该框架不接受任何报价。使用其他角色的其他框架(相同的 Java 代码)工作正常。三个Mesos master我都没有重启。有没有人知道可能出了什么问题?
在 master/frameworks
,我看到了我的框架:
"{
"id": "fd01b1b0-eb73-4d40-8774-009171ae1db1-0701",
"name": "/data4/Users/mikeb/jobs/999",
"pid": "scheduler-77345362-b85c-4044-8db5-0106b9015119@x.x.x.x:57617",
"used_resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"offered_resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"capabilities": [],
"hostname": "x-x-x-x.ec2.internal",
"webui_url": "",
"active": true,
"user": "mikeb",
"failover_timeout": 10080,
"checkpoint": true,
"role": "docker_gpu_worker",
"registered_time": 1507028279.18887,
"unregistered_time": 0,
"principal": "test-framework-java",
"resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"tasks": [],
"completed_tasks": [],
"offers": [],
"executors": []
}"
在master/roles
我看到了我的角色:
"{
"frameworks": [
"fd01b1b0-eb73-4d40-8774-009171ae1db1-0701",
"fd01b1b0-eb73-4d40-8774-009171ae1db1-0673",
"fd01b1b0-eb73-4d40-8774-009171ae1db1-0335"
],
"name": "docker_gpu_worker",
"resources": {
"cpus": 0,
"disk": 0,
"gpus": 0,
"mem": 0
},
"weight": 1
}"
在 master/slaves
我看到我的代理人:
"{
"id": "fd01b1b0-eb73-4d40-8774-009171ae1db1-S5454",
"pid": "slave(1)@x.x.x.x:5051",
"hostname": "x-x-x-x.ec2.internal",
"registered_time": 1506692213.24938,
"resources": {
"disk": 35056,
"mem": 59363,
"gpus": 4,
"cpus": 32,
"ports": "[31000-32000]"
},
"used_resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"offered_resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"reserved_resources": {
"docker_gpu_worker": {
"disk": 35056,
"mem": 59363,
"gpus": 4,
"cpus": 32,
"ports": "[31000-32000]"
}
},
"unreserved_resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"attributes": {},
"active": true,
"version": "1.0.1",
"reserved_resources_full": {
"docker_gpu_worker": [
{
"name": "gpus",
"type": "SCALAR",
"scalar": {
"value": 4
},
"role": "docker_gpu_worker"
},
{
"name": "cpus",
"type": "SCALAR",
"scalar": {
"value": 32
},
"role": "docker_gpu_worker"
},
{
"name": "mem",
"type": "SCALAR",
"scalar": {
"value": 59363
},
"role": "docker_gpu_worker"
},
{
"name": "disk",
"type": "SCALAR",
"scalar": {
"value": 35056
},
"role": "docker_gpu_worker"
},
{
"name": "ports",
"type": "RANGES",
"ranges": {
"range": [
{
"begin": 31000,
"end": 32000
}
]
},
"role": "docker_gpu_worker"
}
]
},
"used_resources_full": [],
"offered_resources_full": []
}"
我们已经追踪到这个 Mesos 代理配置的问题:
--isolation="filesystem/linux,cgroups/devices,gpu/nvidia"
删除它,代理可以正常工作,但无法访问 GPU 资源。此配置是根据 docs 对 Nvidia GPU 支持的要求,这些文档似乎表明版本 1.0.1 支持它。我们正在继续调查。
你可以静态地向 master 注册角色,
如果您在 运行 时间添加代理角色,则不知道掌握
master 需要重新启动 mesos master 才能看到这个角色。
尝试重新启动 mesos master。
必须为框架启用 GPU_RESOURCES
功能。
如http://mesos.readthedocs.io/en/latest/gpu-support/所示,
例如,这可以通过在 mesos-execute
命令中指定 --framework_capabilities="GPU_RESOURCES"
来实现,或者在 C++ 中使用如下代码:
FrameworkInfo framework;
framework.add_capabilities()->set_type(
FrameworkInfo::Capability::GPU_RESOURCES);
对于 Marathon 框架,必须使用 --enable_features "gpu_resources"
选项启动 Marathon 服务,如 Enable GPU resources (CUDA) on DC/OS
中所示
我正在使用 Mesos 1.0.1。我添加了一个具有新角色 docker_gpu_worker
的代理。我用这个角色注册了一个框架。该框架不接受任何报价。使用其他角色的其他框架(相同的 Java 代码)工作正常。三个Mesos master我都没有重启。有没有人知道可能出了什么问题?
在 master/frameworks
,我看到了我的框架:
"{
"id": "fd01b1b0-eb73-4d40-8774-009171ae1db1-0701",
"name": "/data4/Users/mikeb/jobs/999",
"pid": "scheduler-77345362-b85c-4044-8db5-0106b9015119@x.x.x.x:57617",
"used_resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"offered_resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"capabilities": [],
"hostname": "x-x-x-x.ec2.internal",
"webui_url": "",
"active": true,
"user": "mikeb",
"failover_timeout": 10080,
"checkpoint": true,
"role": "docker_gpu_worker",
"registered_time": 1507028279.18887,
"unregistered_time": 0,
"principal": "test-framework-java",
"resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"tasks": [],
"completed_tasks": [],
"offers": [],
"executors": []
}"
在master/roles
我看到了我的角色:
"{
"frameworks": [
"fd01b1b0-eb73-4d40-8774-009171ae1db1-0701",
"fd01b1b0-eb73-4d40-8774-009171ae1db1-0673",
"fd01b1b0-eb73-4d40-8774-009171ae1db1-0335"
],
"name": "docker_gpu_worker",
"resources": {
"cpus": 0,
"disk": 0,
"gpus": 0,
"mem": 0
},
"weight": 1
}"
在 master/slaves
我看到我的代理人:
"{
"id": "fd01b1b0-eb73-4d40-8774-009171ae1db1-S5454",
"pid": "slave(1)@x.x.x.x:5051",
"hostname": "x-x-x-x.ec2.internal",
"registered_time": 1506692213.24938,
"resources": {
"disk": 35056,
"mem": 59363,
"gpus": 4,
"cpus": 32,
"ports": "[31000-32000]"
},
"used_resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"offered_resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"reserved_resources": {
"docker_gpu_worker": {
"disk": 35056,
"mem": 59363,
"gpus": 4,
"cpus": 32,
"ports": "[31000-32000]"
}
},
"unreserved_resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"attributes": {},
"active": true,
"version": "1.0.1",
"reserved_resources_full": {
"docker_gpu_worker": [
{
"name": "gpus",
"type": "SCALAR",
"scalar": {
"value": 4
},
"role": "docker_gpu_worker"
},
{
"name": "cpus",
"type": "SCALAR",
"scalar": {
"value": 32
},
"role": "docker_gpu_worker"
},
{
"name": "mem",
"type": "SCALAR",
"scalar": {
"value": 59363
},
"role": "docker_gpu_worker"
},
{
"name": "disk",
"type": "SCALAR",
"scalar": {
"value": 35056
},
"role": "docker_gpu_worker"
},
{
"name": "ports",
"type": "RANGES",
"ranges": {
"range": [
{
"begin": 31000,
"end": 32000
}
]
},
"role": "docker_gpu_worker"
}
]
},
"used_resources_full": [],
"offered_resources_full": []
}"
我们已经追踪到这个 Mesos 代理配置的问题:
--isolation="filesystem/linux,cgroups/devices,gpu/nvidia"
删除它,代理可以正常工作,但无法访问 GPU 资源。此配置是根据 docs 对 Nvidia GPU 支持的要求,这些文档似乎表明版本 1.0.1 支持它。我们正在继续调查。
你可以静态地向 master 注册角色, 如果您在 运行 时间添加代理角色,则不知道掌握 master 需要重新启动 mesos master 才能看到这个角色。 尝试重新启动 mesos master。
必须为框架启用 GPU_RESOURCES
功能。
如http://mesos.readthedocs.io/en/latest/gpu-support/所示,
例如,这可以通过在 mesos-execute
命令中指定 --framework_capabilities="GPU_RESOURCES"
来实现,或者在 C++ 中使用如下代码:
FrameworkInfo framework;
framework.add_capabilities()->set_type(
FrameworkInfo::Capability::GPU_RESOURCES);
对于 Marathon 框架,必须使用 --enable_features "gpu_resources"
选项启动 Marathon 服务,如 Enable GPU resources (CUDA) on DC/OS