ceph pg ID查询挂起/ stuck/unclean PG
ceph pg ID query hangs/ stuck/unclean PG
Ceph 版本:0.94.1
ceph -s
cluster 30266c5f-5e10-4027-936c-e4409667b409
health HEALTH_WARN
65 pgs stale
22 pgs stuck inactive
65 pgs stuck stale
22 pgs stuck unclean
monmap e7: 7 mons at {kvm1=10.136.8.129:6789/0,kvm2=10.136.8.130:6789/0,kvm3=10.136.8.131:6789/0,kvm4=10.136.8.132:6789/0,kvm5=10.136.8.133:6789/0,kvm6=10.136.8.134:6789/0,kvm7=10.136.8.135:6789/0}
election epoch 122, quorum 0,1,2,3,4,5,6 kvm1,kvm2,kvm3,kvm4,kvm5,kvm6,kvm7
osdmap e368: 14 osds: 14 up, 14 in
pgmap v1072573: 1128 pgs, 8 pools, 186 GB data, 51533 objects
630 GB used, 7330 GB / 8319 GB avail
1041 active+clean
65 stale+active+clean
22 creating
客户端 io 361 kB/s rd, 528 kB/s wr, 48 op/s
ceph osd stat
osdmap e368: 14 osds: 14 up, 14 in
如您所见,我对 stale/inactive/unclean 有疑问。我试着做
ceph pg 0.21 query
然后挂起。 (0.21 是陈旧的 pgs 之一)。 Strace 显示:
[pid 4850] futex(0x7f8cd8003984, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f8cd8003980,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...>
[pid 4855] <... sendmsg resumed> ) = 9
[pid 4850] <... futex resumed> ) = 1
[pid 4855] futex(0x7f8cd8026cd4, FUTEX_WAIT_PRIVATE, 19, NULL <unfinished ...>
[pid 4841] <... futex resumed> ) = 0
[pid 4850] futex(0x7f8cd801e2ac, FUTEX_WAIT_PRIVATE, 11, NULL <unfinished ...>
[pid 4841] futex(0x7f8cd8003900, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 4841] futex(0x7f8cd8003984, FUTEX_WAIT_PRIVATE, 39, NULL <unfinished ...>
[pid 4833] <... select resumed> ) = 0 (Timeout)
[pid 4833] select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
[pid 4833] select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
[pid 4833] select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
[pid 4833] select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
[pid 4833] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
[pid 4833] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
[pid 4833] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
它永远不会返回信息。其他 PG 确实显示了正确的 JSON 数据。
我尝试重新启动 osd0,但没有看到任何错误。
有人有什么想法吗?
您的网络配置很可能不允许某些 OSD 相互通信。 pg 0.21 dump
的问题可能是同一个问题。
与大多数 ceph
与 MON 通信的命令相反,pg 0.21 dump
将尝试 communicate directly with the OSD that hosts the pg。
由于ceph osd stat
returns所有的OSD都是up
和in
,说明MON和OSD之间的通信没有问题
我找到问题了!它是在通过 crush 规则删除后没有 OSD 的池。我不太确定为什么要创建 PG 并且规则只允许移动 OSD,但这不是 material。
删除所有空池后,我现在很好了。
对于那些想要程序的人如何找到它:
第一个:
ceph health detail
要找出哪个有问题,则:
ceph pg ls-by-pool
将 pg 与池匹配。然后删除池:
ceph osd pool delete <pool name> <pool name> --yes-i-really-really-mean-it
Ceph 版本:0.94.1
ceph -s
cluster 30266c5f-5e10-4027-936c-e4409667b409
health HEALTH_WARN
65 pgs stale
22 pgs stuck inactive
65 pgs stuck stale
22 pgs stuck unclean
monmap e7: 7 mons at {kvm1=10.136.8.129:6789/0,kvm2=10.136.8.130:6789/0,kvm3=10.136.8.131:6789/0,kvm4=10.136.8.132:6789/0,kvm5=10.136.8.133:6789/0,kvm6=10.136.8.134:6789/0,kvm7=10.136.8.135:6789/0}
election epoch 122, quorum 0,1,2,3,4,5,6 kvm1,kvm2,kvm3,kvm4,kvm5,kvm6,kvm7
osdmap e368: 14 osds: 14 up, 14 in
pgmap v1072573: 1128 pgs, 8 pools, 186 GB data, 51533 objects
630 GB used, 7330 GB / 8319 GB avail
1041 active+clean
65 stale+active+clean
22 creating
客户端 io 361 kB/s rd, 528 kB/s wr, 48 op/s
ceph osd stat
osdmap e368: 14 osds: 14 up, 14 in
如您所见,我对 stale/inactive/unclean 有疑问。我试着做
ceph pg 0.21 query
然后挂起。 (0.21 是陈旧的 pgs 之一)。 Strace 显示:
[pid 4850] futex(0x7f8cd8003984, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f8cd8003980,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...>
[pid 4855] <... sendmsg resumed> ) = 9
[pid 4850] <... futex resumed> ) = 1
[pid 4855] futex(0x7f8cd8026cd4, FUTEX_WAIT_PRIVATE, 19, NULL <unfinished ...>
[pid 4841] <... futex resumed> ) = 0
[pid 4850] futex(0x7f8cd801e2ac, FUTEX_WAIT_PRIVATE, 11, NULL <unfinished ...>
[pid 4841] futex(0x7f8cd8003900, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 4841] futex(0x7f8cd8003984, FUTEX_WAIT_PRIVATE, 39, NULL <unfinished ...>
[pid 4833] <... select resumed> ) = 0 (Timeout)
[pid 4833] select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
[pid 4833] select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
[pid 4833] select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
[pid 4833] select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
[pid 4833] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
[pid 4833] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
[pid 4833] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
它永远不会返回信息。其他 PG 确实显示了正确的 JSON 数据。 我尝试重新启动 osd0,但没有看到任何错误。
有人有什么想法吗?
您的网络配置很可能不允许某些 OSD 相互通信。 pg 0.21 dump
的问题可能是同一个问题。
与大多数 ceph
与 MON 通信的命令相反,pg 0.21 dump
将尝试 communicate directly with the OSD that hosts the pg。
由于ceph osd stat
returns所有的OSD都是up
和in
,说明MON和OSD之间的通信没有问题
我找到问题了!它是在通过 crush 规则删除后没有 OSD 的池。我不太确定为什么要创建 PG 并且规则只允许移动 OSD,但这不是 material。
删除所有空池后,我现在很好了。
对于那些想要程序的人如何找到它:
第一个:
ceph health detail
要找出哪个有问题,则:
ceph pg ls-by-pool
将 pg 与池匹配。然后删除池:
ceph osd pool delete <pool name> <pool name> --yes-i-really-really-mean-it