游牧作业部署问题(raw_exec 模式,v1.0.1)
Problem with nomad job deployment (raw_exec mode, v1.0.1)
最近从 nomad v.0.9.6 更新到 nomad v.1.01 中断了作业部署。
不幸的是,我无法从游牧代理那里获得任何关于“待处理或已死亡”状态的有用信息。
我还从 web-ui 检查了跟踪监视器,但没有成功。
能否就如何从代理处获得 reject/pending 原因提供一些建议?
我使用“raw_exec”驱动程序(非特权用户,driver.raw_exec.enable”=“1”)
F
或者部署我使用nomad-sdk(版本0.11.3.0)
您可以在此处找到职位定义(从游牧民的角度来看):
OS 详情:
cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)
Linux blade1.lab.bulb.hr 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Nomad 代理详细信息:
[root@blade1 ~]# nomad node-status
ID DC Name Class Drain Eligibility Status
5838e8b0 dc1 blade1.lab.bulb.hr <none> false eligible ready
详细输出:
[root@blade1 ~]# nomad node-status -verbose
ID DC Name Class Address Version Drain Eligibility Status
5838e8b0-ebd3-5c47-a949-df3d601e0da1 dc1 blade1.lab.bulb.hr <none> 192.168.112.31 1.0.1 false eligible ready
[root@blade1 ~]# nomad node-status -verbose 5838e8b0-ebd3-5c47-a949-df3d601e0da1
ID = 5838e8b0-ebd3-5c47-a949-df3d601e0da1
Name = blade1.lab.bulb.hr
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
CSI Controllers = <none>
CSI Drivers = <none>
Uptime = 1516h1m31s
Drivers
Driver Detected Healthy Message Time
docker false false Failed to connect to docker daemon 2020-12-18T14:37:09+01:00
exec false false Driver must run as root 2020-12-18T14:37:09+01:00
java false false Driver must run as root 2020-12-18T14:37:09+01:00
qemu false false <none> 2020-12-18T14:37:09+01:00
raw_exec true true Healthy 2020-12-18T14:37:09+01:00
Node Events
Time Subsystem Message Details
2020-12-18T14:37:09+01:00 Cluster Node registered <none>
Allocated Resources
CPU Memory Disk
0/18000 MHz 0 B/53 GiB 0 B/70 GiB
Allocation Resource Utilization
CPU Memory
0/18000 MHz 0 B/53 GiB
Host Resource Utilization
CPU Memory Disk
499/20000 MHz 33 GiB/63 GiB (/dev/mapper/vg00-root)
Allocations
No allocations placed
Attributes
consul.datacenter = dacs
consul.revision = 1e03567d3
consul.server = true
consul.version = 1.8.5
cpu.arch = amd64
driver.raw_exec = 1
kernel.name = linux
kernel.version = 3.10.0-693.21.1.el7.x86_64
memory.totalbytes = 67374776320
nomad.advertise.address = 192.168.112.31:5656
nomad.revision = c9c68aa55a7275f22d2338f2df53e67ebfcb9238
nomad.version = 1.0.1
os.name = centos
os.signals = SIGTTIN,SIGUSR2,SIGXCPU,SIGBUS,SIGILL,SIGQUIT,SIGCHLD,SIGIOT,SIGKILL,SIGINT,SIGSTOP,SIGSYS,SIGTTOU,SIGFPE,SIGSEGV,SIGTSTP,SIGURG,SIGWINCH,SIGCONT,SIGIO,SIGTRAP,SIGXFSZ,SIGHUP,SIGPIPE,SIGTERM,SIGPROF,SIGABRT,SIGALRM,SIGUSR1
os.version = 7.4.1708
unique.cgroup.mountpoint = /sys/fs/cgroup/systemd
unique.consul.name = grabber1
unique.hostname = blade1.lab.bulb.hr
unique.network.ip-address = 192.168.112.31
unique.storage.bytesfree = 74604830720
unique.storage.bytestotal = 126698909696
unique.storage.volume = /dev/mapper/vg00-root
Meta
connect.gateway_image = envoyproxy/envoy:v${NOMAD_envoy_version}
connect.log_level = info
connect.proxy_concurrency = 1
connect.sidecar_image = envoyproxy/envoy:v${NOMAD_envoy_version}
工作状态详情
[root@blade1 ~]# nomad status
ID Type Priority Status Submit Date
lightningCollector-lightningCollector service 50 pending 2020-12-18T15:06:09+01:00
[root@blade1 ~]# nomad status lightningCollector-lightningCollector
ID = lightningCollector-lightningCollector
Name = lightningCollector-lightningCollector
Submit Date = 2020-12-18T15:06:09+01:00
Type = service
Priority = 50
Datacenters = dc1
Namespace = default
Status = pending
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
lightningCollector-lightningCollector-0 0 0 0 0 0 0
Allocations
No allocations placed
感谢您付出的努力和时间!
问候,
伊万
我在本地测试了你的工作并且能够重现你的体验。我注意到在作业中设置了 ParentID,Nomad 使用它来跟踪定期或调度作业的子实例。
将 ParentID
值设置为 ""
后,我能够提交作业并且它已正确评估和安排。
我对这些版本进行了一些测试并确定了 0.12.0 和 0.12.1 中的行为发生了变化。我提交了 hashicorp/nomad #10422 以回应这种行为差异。
最近从 nomad v.0.9.6 更新到 nomad v.1.01 中断了作业部署。 不幸的是,我无法从游牧代理那里获得任何关于“待处理或已死亡”状态的有用信息。 我还从 web-ui 检查了跟踪监视器,但没有成功。
能否就如何从代理处获得 reject/pending 原因提供一些建议?
我使用“raw_exec”驱动程序(非特权用户,driver.raw_exec.enable”=“1”) F 或者部署我使用nomad-sdk(版本0.11.3.0)
您可以在此处找到职位定义(从游牧民的角度来看):
OS 详情:
cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)
Linux blade1.lab.bulb.hr 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Nomad 代理详细信息:
[root@blade1 ~]# nomad node-status
ID DC Name Class Drain Eligibility Status
5838e8b0 dc1 blade1.lab.bulb.hr <none> false eligible ready
详细输出:
[root@blade1 ~]# nomad node-status -verbose
ID DC Name Class Address Version Drain Eligibility Status
5838e8b0-ebd3-5c47-a949-df3d601e0da1 dc1 blade1.lab.bulb.hr <none> 192.168.112.31 1.0.1 false eligible ready
[root@blade1 ~]# nomad node-status -verbose 5838e8b0-ebd3-5c47-a949-df3d601e0da1
ID = 5838e8b0-ebd3-5c47-a949-df3d601e0da1
Name = blade1.lab.bulb.hr
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
CSI Controllers = <none>
CSI Drivers = <none>
Uptime = 1516h1m31s
Drivers
Driver Detected Healthy Message Time
docker false false Failed to connect to docker daemon 2020-12-18T14:37:09+01:00
exec false false Driver must run as root 2020-12-18T14:37:09+01:00
java false false Driver must run as root 2020-12-18T14:37:09+01:00
qemu false false <none> 2020-12-18T14:37:09+01:00
raw_exec true true Healthy 2020-12-18T14:37:09+01:00
Node Events
Time Subsystem Message Details
2020-12-18T14:37:09+01:00 Cluster Node registered <none>
Allocated Resources
CPU Memory Disk
0/18000 MHz 0 B/53 GiB 0 B/70 GiB
Allocation Resource Utilization
CPU Memory
0/18000 MHz 0 B/53 GiB
Host Resource Utilization
CPU Memory Disk
499/20000 MHz 33 GiB/63 GiB (/dev/mapper/vg00-root)
Allocations
No allocations placed
Attributes
consul.datacenter = dacs
consul.revision = 1e03567d3
consul.server = true
consul.version = 1.8.5
cpu.arch = amd64
driver.raw_exec = 1
kernel.name = linux
kernel.version = 3.10.0-693.21.1.el7.x86_64
memory.totalbytes = 67374776320
nomad.advertise.address = 192.168.112.31:5656
nomad.revision = c9c68aa55a7275f22d2338f2df53e67ebfcb9238
nomad.version = 1.0.1
os.name = centos
os.signals = SIGTTIN,SIGUSR2,SIGXCPU,SIGBUS,SIGILL,SIGQUIT,SIGCHLD,SIGIOT,SIGKILL,SIGINT,SIGSTOP,SIGSYS,SIGTTOU,SIGFPE,SIGSEGV,SIGTSTP,SIGURG,SIGWINCH,SIGCONT,SIGIO,SIGTRAP,SIGXFSZ,SIGHUP,SIGPIPE,SIGTERM,SIGPROF,SIGABRT,SIGALRM,SIGUSR1
os.version = 7.4.1708
unique.cgroup.mountpoint = /sys/fs/cgroup/systemd
unique.consul.name = grabber1
unique.hostname = blade1.lab.bulb.hr
unique.network.ip-address = 192.168.112.31
unique.storage.bytesfree = 74604830720
unique.storage.bytestotal = 126698909696
unique.storage.volume = /dev/mapper/vg00-root
Meta
connect.gateway_image = envoyproxy/envoy:v${NOMAD_envoy_version}
connect.log_level = info
connect.proxy_concurrency = 1
connect.sidecar_image = envoyproxy/envoy:v${NOMAD_envoy_version}
工作状态详情
[root@blade1 ~]# nomad status
ID Type Priority Status Submit Date
lightningCollector-lightningCollector service 50 pending 2020-12-18T15:06:09+01:00
[root@blade1 ~]# nomad status lightningCollector-lightningCollector
ID = lightningCollector-lightningCollector
Name = lightningCollector-lightningCollector
Submit Date = 2020-12-18T15:06:09+01:00
Type = service
Priority = 50
Datacenters = dc1
Namespace = default
Status = pending
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
lightningCollector-lightningCollector-0 0 0 0 0 0 0
Allocations
No allocations placed
感谢您付出的努力和时间! 问候, 伊万
我在本地测试了你的工作并且能够重现你的体验。我注意到在作业中设置了 ParentID,Nomad 使用它来跟踪定期或调度作业的子实例。
将 ParentID
值设置为 ""
后,我能够提交作业并且它已正确评估和安排。
我对这些版本进行了一些测试并确定了 0.12.0 和 0.12.1 中的行为发生了变化。我提交了 hashicorp/nomad #10422 以回应这种行为差异。