ceph pg 修复不会立即开始
ceph pg repair doesnt start right away
我的集群上时不时会出现一个 pg 不一致错误。正如文档所建议的那样,我 运行 ceph pg repair pg.id 并且命令给出“instructing pg x on osd y to repair”似乎按预期工作。但是它并没有立即开始,这可能是什么原因造成的?我正在 运行宁 24 小时擦洗所以在任何给定时间我至少有 8-10 pgs 被擦洗或深度擦洗。 pg 进程(如清理或修复)是否形成一个队列,我的修复命令是否只是等待轮到它?或者这背后还有其他问题?
编辑:
Ceph 运行状况详细信息输出:
pg 57.ee is active+clean+inconsistent, acting [16,46,74,59,5]
的输出
rados list-inconsistent-obj 57.ee --format=json-pretty
{
"epoch": 55281,
"inconsistents": [
{
"object": {
"name": "10001a447c7.00005b03",
"nspace": "",
"locator": "",
"snap": "head",
"version": 150876
},
"errors": [],
"union_shard_errors": [
"read_error"
],
"selected_object_info": {
"oid": {
"oid": "10001a447c7.00005b03",
"key": "",
"snapid": -2,
"hash": 3954101486,
"max": 0,
"pool": 57,
"namespace": ""
},
"version": "55268'150876",
"prior_version": "0'0",
"last_reqid": "client.42086585.0:355736",
"user_version": 150876,
"size": 4194304,
"mtime": "2021-03-15 21:52:43.651368",
"local_mtime": "2021-03-15 21:52:45.399035",
"lost": 0,
"flags": [
"dirty",
"data_digest"
],
"truncate_seq": 0,
"truncate_size": 0,
"data_digest": "0xf88f1537",
"omap_digest": "0xffffffff",
"expected_object_size": 0,
"expected_write_size": 0,
"alloc_hint_flags": 0,
"manifest": {
"type": 0
},
"watchers": {}
},
"shards": [
{
"osd": 5,
"primary": false,
"shard": 4,
"errors": [],
"size": 1400832,
"omap_digest": "0xffffffff",
"data_digest": "0x00000000"
},
{
"osd": 16,
"primary": true,
"shard": 0,
"errors": [],
"size": 1400832,
"omap_digest": "0xffffffff",
"data_digest": "0x00000000"
},
{
"osd": 46,
"primary": false,
"shard": 1,
"errors": [],
"size": 1400832,
"omap_digest": "0xffffffff",
"data_digest": "0x00000000"
},
{
"osd": 59,
"primary": false,
"shard": 3,
"errors": [
"read_error"
],
"size": 1400832
},
{
"osd": 74,
"primary": false,
"shard": 2,
"errors": [],
"size": 1400832,
"omap_digest": "0xffffffff",
"data_digest": "0x00000000"
}
]
}
]
}
此 pg 在 EC 池中。当我 运行 ceph pg repair 57.ee 我得到输出:
instructing pg 57.ees0 on osd.16 to repair
然而,正如您从 pg 报告中看到的,不一致的碎片在 osd 59 中。我认为输出末尾的“s0”指的是第一个碎片,所以我也尝试了这样的修复命令:
ceph pg repair 57.ees3 但我收到一个错误,告诉我这是无效命令。
您有 I/O 个错误,经常由于磁盘故障而发生,正如您看到的分片错误:
errors": [],
"union_shard_errors": [
"read_error"
有问题的碎片在“osd”上:59
尝试强制再次读取有问题的对象:
# rados -p EC_pool get 10001a447c7.00005b03
清理导致读取对象,并返回读取错误,这意味着对象被标记为消失,当发生这种情况时,它将尝试从其他地方恢复对象(对等、恢复、回填)
我的集群上时不时会出现一个 pg 不一致错误。正如文档所建议的那样,我 运行 ceph pg repair pg.id 并且命令给出“instructing pg x on osd y to repair”似乎按预期工作。但是它并没有立即开始,这可能是什么原因造成的?我正在 运行宁 24 小时擦洗所以在任何给定时间我至少有 8-10 pgs 被擦洗或深度擦洗。 pg 进程(如清理或修复)是否形成一个队列,我的修复命令是否只是等待轮到它?或者这背后还有其他问题?
编辑:
Ceph 运行状况详细信息输出:
pg 57.ee is active+clean+inconsistent, acting [16,46,74,59,5]
的输出
rados list-inconsistent-obj 57.ee --format=json-pretty
{
"epoch": 55281,
"inconsistents": [
{
"object": {
"name": "10001a447c7.00005b03",
"nspace": "",
"locator": "",
"snap": "head",
"version": 150876
},
"errors": [],
"union_shard_errors": [
"read_error"
],
"selected_object_info": {
"oid": {
"oid": "10001a447c7.00005b03",
"key": "",
"snapid": -2,
"hash": 3954101486,
"max": 0,
"pool": 57,
"namespace": ""
},
"version": "55268'150876",
"prior_version": "0'0",
"last_reqid": "client.42086585.0:355736",
"user_version": 150876,
"size": 4194304,
"mtime": "2021-03-15 21:52:43.651368",
"local_mtime": "2021-03-15 21:52:45.399035",
"lost": 0,
"flags": [
"dirty",
"data_digest"
],
"truncate_seq": 0,
"truncate_size": 0,
"data_digest": "0xf88f1537",
"omap_digest": "0xffffffff",
"expected_object_size": 0,
"expected_write_size": 0,
"alloc_hint_flags": 0,
"manifest": {
"type": 0
},
"watchers": {}
},
"shards": [
{
"osd": 5,
"primary": false,
"shard": 4,
"errors": [],
"size": 1400832,
"omap_digest": "0xffffffff",
"data_digest": "0x00000000"
},
{
"osd": 16,
"primary": true,
"shard": 0,
"errors": [],
"size": 1400832,
"omap_digest": "0xffffffff",
"data_digest": "0x00000000"
},
{
"osd": 46,
"primary": false,
"shard": 1,
"errors": [],
"size": 1400832,
"omap_digest": "0xffffffff",
"data_digest": "0x00000000"
},
{
"osd": 59,
"primary": false,
"shard": 3,
"errors": [
"read_error"
],
"size": 1400832
},
{
"osd": 74,
"primary": false,
"shard": 2,
"errors": [],
"size": 1400832,
"omap_digest": "0xffffffff",
"data_digest": "0x00000000"
}
]
}
]
}
此 pg 在 EC 池中。当我 运行 ceph pg repair 57.ee 我得到输出:
instructing pg 57.ees0 on osd.16 to repair
然而,正如您从 pg 报告中看到的,不一致的碎片在 osd 59 中。我认为输出末尾的“s0”指的是第一个碎片,所以我也尝试了这样的修复命令:
ceph pg repair 57.ees3 但我收到一个错误,告诉我这是无效命令。
您有 I/O 个错误,经常由于磁盘故障而发生,正如您看到的分片错误:
errors": [],
"union_shard_errors": [
"read_error"
有问题的碎片在“osd”上:59
尝试强制再次读取有问题的对象:
# rados -p EC_pool get 10001a447c7.00005b03
清理导致读取对象,并返回读取错误,这意味着对象被标记为消失,当发生这种情况时,它将尝试从其他地方恢复对象(对等、恢复、回填)