drbd 同步失败并出现 ProtocolError
drbd sync fails with ProtocolError
我目前有一对已决定停止同步的 drbd 服务器,我似乎无能为力让它们再次同步。同步过程通过两台服务器之间的专用交叉电缆(1gbps 铜缆)进行。
这是我在 r01 的日志中看到的内容:
Aug 9 16:09:44 r02 kernel: [12739.178449] block drbd0: receiver (re)started
Aug 9 16:09:44 r02 kernel: [12739.178454] block drbd0: conn( Unconnected -> WFConnection )
Aug 9 16:09:44 r02 kernel: [12739.912037] block drbd0: Handshake successful: Agreed network protocol version 91
Aug 9 16:09:44 r02 kernel: [12739.912048] block drbd0: conn( WFConnection -> WFReportParams )
Aug 9 16:09:44 r02 kernel: [12739.912074] block drbd0: Starting asender thread (from drbd0_receiver [3740])
Aug 9 16:09:44 r02 kernel: [12739.936681] block drbd0: data-integrity-alg: <not-used>
Aug 9 16:09:44 r02 kernel: [12739.936691] block drbd0: Considerable difference in lower level device sizes: 256503768s vs. 1344982880s
Aug 9 16:09:44 r02 kernel: [12739.942918] block drbd0: drbd_sync_handshake:
Aug 9 16:09:44 r02 kernel: [12739.942923] block drbd0: self E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0
Aug 9 16:09:44 r02 kernel: [12739.942928] block drbd0: peer E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0
Aug 9 16:09:44 r02 kernel: [12739.942933] block drbd0: uuid_compare()=-1 by rule 50
Aug 9 16:09:44 r02 kernel: [12739.942935] block drbd0: Becoming sync target due to disk states.
Aug 9 16:09:44 r02 kernel: [12739.942946] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Aug 9 16:09:44 r02 kernel: [12740.099597] block drbd0: conn( WFBitMapT -> WFSyncUUID )
Aug 9 16:09:44 r02 kernel: [12740.104324] block drbd0: updated sync uuid BF8D25FBE26085B0:0000000000000000:0000000000000000:0000000000000000
Aug 9 16:09:44 r02 kernel: [12740.104423] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Aug 9 16:09:44 r02 kernel: [12740.106582] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Aug 9 16:09:44 r02 kernel: [12740.106591] block drbd0: conn( WFSyncUUID -> SyncTarget )
Aug 9 16:09:44 r02 kernel: [12740.106599] block drbd0: Began resync as SyncTarget (will sync 128250804 KB [32062701 bits set]).
Aug 9 16:09:44 r02 kernel: [12740.140796] block drbd0: meta connection shut down by peer.
Aug 9 16:09:44 r02 kernel: [12740.141304] block drbd0: sock was shut down by peer
Aug 9 16:09:44 r02 kernel: [12740.141309] block drbd0: peer( Primary -> Unknown ) conn( SyncTarget -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
Aug 9 16:09:44 r02 kernel: [12740.141316] block drbd0: short read expecting header on sock: r=0
Aug 9 16:09:44 r02 kernel: [12740.142235] block drbd0: asender terminated
Aug 9 16:09:44 r02 kernel: [12740.142238] block drbd0: Terminating drbd0_asender
Aug 9 16:09:44 r02 kernel: [12740.151561] block drbd0: bitmap WRITE of 979 pages took 2 jiffies
Aug 9 16:09:44 r02 kernel: [12740.151567] block drbd0: 122 GB (32062701 bits) marked out-of-sync by on disk bit-map.
Aug 9 16:09:44 r02 kernel: [12740.151580] block drbd0: Connection closed
Aug 9 16:09:44 r02 kernel: [12740.151586] block drbd0: conn( BrokenPipe -> Unconnected )
Aug 9 16:09:44 r02 kernel: [12740.151592] block drbd0: receiver terminated
对于 r01:
Aug 9 16:09:44 r01 kernel: [3438273.766768] block drbd0: receiver (re)started
Aug 9 16:09:44 r01 kernel: [3438273.771898] block drbd0: conn( Unconnected -> WFConnection )
Aug 9 16:09:44 r01 kernel: [3438274.474411] block drbd0: Handshake successful: Agreed network protocol version 91
Aug 9 16:09:44 r01 kernel: [3438274.483299] block drbd0: conn( WFConnection -> WFReportParams )
Aug 9 16:09:44 r01 kernel: [3438274.490420] block drbd0: Starting asender thread (from drbd0_receiver [6366])
Aug 9 16:09:44 r01 kernel: [3438274.498900] block drbd0: data-integrity-alg: <not-used>
Aug 9 16:09:44 r01 kernel: [3438274.505166] block drbd0: Considerable difference in lower level device sizes: 1344982880s vs. 256503768s
Aug 9 16:09:44 r01 kernel: [3438274.516226] block drbd0: max_segment_size ( = BIO size ) = 65536
Aug 9 16:09:44 r01 kernel: [3438274.523385] block drbd0: drbd_sync_handshake:
Aug 9 16:09:44 r01 kernel: [3438274.528677] block drbd0: self E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0
Aug 9 16:09:44 r01 kernel: [3438274.541195] block drbd0: peer E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0
Aug 9 16:09:44 r01 kernel: [3438274.553710] block drbd0: uuid_compare()=1 by rule 70
Aug 9 16:09:44 r01 kernel: [3438274.559677] block drbd0: Becoming sync source due to disk states.
Aug 9 16:09:44 r01 kernel: [3438274.566897] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
Aug 9 16:09:44 r01 kernel: [3438274.666397] block drbd0: conn( WFBitMapS -> SyncSource )
Aug 9 16:09:44 r01 kernel: [3438274.672845] block drbd0: Began resync as SyncSource (will sync 128250804 KB [32062701 bits set]).
Aug 9 16:09:44 r01 kernel: [3438274.683196] block drbd0: /build/buildd-linux-2.6_2.6.32-48squeeze3-amd64-mcoLgp/linux-2.6-2.6.32/debian/build/source_amd64_none/drivers/block/drbd/drbd_receiver.c:1932: sector: 0s, size: 65536
Aug 9 16:09:45 r01 kernel: [3438274.702834] block drbd0: error receiving RSDataRequest, l: 24!
Aug 9 16:09:45 r01 kernel: [3438274.702837] block drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> ProtocolError )
Aug 9 16:09:45 r01 kernel: [3438274.703005] block drbd0: asender terminated
Aug 9 16:09:45 r01 kernel: [3438274.703009] block drbd0: Terminating drbd0_asender
Aug 9 16:09:45 r01 kernel: [3438274.711319] block drbd0: Connection closed
Aug 9 16:09:45 r01 kernel: [3438274.711323] block drbd0: conn( ProtocolError -> Unconnected )
Aug 9 16:09:45 r01 kernel: [3438274.711329] block drbd0: receiver terminated
这只会一遍又一遍地重复。
两台服务器上的配置应该是相同的:
r01:~$ rsync --dry-run --verbose --checksum --itemize-changes 10.0.255.254:/etc/drbd.conf /etc/
sent 11 bytes received 51 bytes 124.00 bytes/sec
total size is 615 speedup is 9.92 (DRY RUN)
这是配置的样子:
r01:~$ cat /etc/drbd.conf
global {
usage-count no;
}
resource drbd0 {
protocol C;
handlers { pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; exit 1"; }
startup {
degr-wfc-timeout 60; # 1 minute.
wfc-timeout 55;
}
disk {
on-io-error detach;
}
syncer {
rate 100M;
al-extents 257;
}
on r01.c07.mtsvc.net {
device /dev/drbd0;
disk /dev/cciss/c0d0p3;
address 10.0.255.253:7788;
meta-disk internal;
}
on r02.c07.mtsvc.net {
device /dev/drbd0;
disk /dev/cciss/c0d0p6;
address 10.0.255.254:7788;
meta-disk internal;
}
}
这是双方的网络配置:
r01:~$ sudo ifconfig -a | grep -B 2 -A 8 10.0.255
eth2 Link encap:Ethernet HWaddr 00:26:55:d6:f8:fc
inet addr:10.0.255.253 Bcast:10.0.255.255 Mask:255.255.255.0
inet6 addr: fe80::226:55ff:fed6:f8fc/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:4062510240 errors:0 dropped:0 overruns:0 frame:0
TX packets:5692251259 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5512604514975 (5.0 TiB) TX bytes:5820995499388 (5.2 TiB)
Interrupt:24 Memory:fbe80000-fbea0000
r01:~$ sudo ifconfig -a | grep -B 2 -A 8 10.0.255
eth2 Link encap:Ethernet HWaddr 00:1b:78:5c:a8:fd
inet addr:10.0.255.254 Bcast:10.0.255.255 Mask:255.255.255.252
inet6 addr: fe80::21b:78ff:fe5c:a8fd/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:321977747 errors:0 dropped:0 overruns:0 frame:0
TX packets:264683964 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:332813827055 (309.9 GiB) TX bytes:328142295363 (305.6 GiB)
Interrupt:17 Memory:fdfa0000-fdfc0000
最初,r01 和 r02 都是 运行 Debian Squeeze (drbd 8.3.7)。然后我用 Debian Wheezy (drbd 8.3.13) 重建了 r02。事情 运行 顺利了几天,然后在重新启动 drbd 后,这个问题开始了。我有几个其他的 drbd 集群,我一直在以同样的方式升级。有的完全升级为Wheezy,有的还是半Squeeze,半Wheezy,还好。
到目前为止,我已尝试解决此问题。
- 擦除 r02 上的 drbd 卷并尝试重新同步
- 擦除、重新安装并重新配置 r02。
- 用不同的硬件替换 r02,并从头开始重建。
- 更换交叉线(两次)
在接下来的几天里,我将用 100% 不同的硬件替换 r01。但即使这有效,我仍然不知所措。我真的很想了解导致此问题的原因以及解决它的正确方法。
DRBD 在 8.3.7 和 8.3.13 之间发生了很多变化;包括重新同步工作方式的重大变化:https://blogs.linbit.com/p/128/drbd-sync-rate-controller/
您可以尝试从您的资源配置中删除任何不需要的设置(因此,syncer{} 部分)并调整 DRBD:# drbdadm adjust all
如果仍然无法连接,您可能必须升级旧节点才能使它们同步:http://www.drbd.org/download/drbd/8.3/drbd-8.3.13.tar.gz
我目前有一对已决定停止同步的 drbd 服务器,我似乎无能为力让它们再次同步。同步过程通过两台服务器之间的专用交叉电缆(1gbps 铜缆)进行。
这是我在 r01 的日志中看到的内容:
Aug 9 16:09:44 r02 kernel: [12739.178449] block drbd0: receiver (re)started
Aug 9 16:09:44 r02 kernel: [12739.178454] block drbd0: conn( Unconnected -> WFConnection )
Aug 9 16:09:44 r02 kernel: [12739.912037] block drbd0: Handshake successful: Agreed network protocol version 91
Aug 9 16:09:44 r02 kernel: [12739.912048] block drbd0: conn( WFConnection -> WFReportParams )
Aug 9 16:09:44 r02 kernel: [12739.912074] block drbd0: Starting asender thread (from drbd0_receiver [3740])
Aug 9 16:09:44 r02 kernel: [12739.936681] block drbd0: data-integrity-alg: <not-used>
Aug 9 16:09:44 r02 kernel: [12739.936691] block drbd0: Considerable difference in lower level device sizes: 256503768s vs. 1344982880s
Aug 9 16:09:44 r02 kernel: [12739.942918] block drbd0: drbd_sync_handshake:
Aug 9 16:09:44 r02 kernel: [12739.942923] block drbd0: self E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0
Aug 9 16:09:44 r02 kernel: [12739.942928] block drbd0: peer E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0
Aug 9 16:09:44 r02 kernel: [12739.942933] block drbd0: uuid_compare()=-1 by rule 50
Aug 9 16:09:44 r02 kernel: [12739.942935] block drbd0: Becoming sync target due to disk states.
Aug 9 16:09:44 r02 kernel: [12739.942946] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Aug 9 16:09:44 r02 kernel: [12740.099597] block drbd0: conn( WFBitMapT -> WFSyncUUID )
Aug 9 16:09:44 r02 kernel: [12740.104324] block drbd0: updated sync uuid BF8D25FBE26085B0:0000000000000000:0000000000000000:0000000000000000
Aug 9 16:09:44 r02 kernel: [12740.104423] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Aug 9 16:09:44 r02 kernel: [12740.106582] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Aug 9 16:09:44 r02 kernel: [12740.106591] block drbd0: conn( WFSyncUUID -> SyncTarget )
Aug 9 16:09:44 r02 kernel: [12740.106599] block drbd0: Began resync as SyncTarget (will sync 128250804 KB [32062701 bits set]).
Aug 9 16:09:44 r02 kernel: [12740.140796] block drbd0: meta connection shut down by peer.
Aug 9 16:09:44 r02 kernel: [12740.141304] block drbd0: sock was shut down by peer
Aug 9 16:09:44 r02 kernel: [12740.141309] block drbd0: peer( Primary -> Unknown ) conn( SyncTarget -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
Aug 9 16:09:44 r02 kernel: [12740.141316] block drbd0: short read expecting header on sock: r=0
Aug 9 16:09:44 r02 kernel: [12740.142235] block drbd0: asender terminated
Aug 9 16:09:44 r02 kernel: [12740.142238] block drbd0: Terminating drbd0_asender
Aug 9 16:09:44 r02 kernel: [12740.151561] block drbd0: bitmap WRITE of 979 pages took 2 jiffies
Aug 9 16:09:44 r02 kernel: [12740.151567] block drbd0: 122 GB (32062701 bits) marked out-of-sync by on disk bit-map.
Aug 9 16:09:44 r02 kernel: [12740.151580] block drbd0: Connection closed
Aug 9 16:09:44 r02 kernel: [12740.151586] block drbd0: conn( BrokenPipe -> Unconnected )
Aug 9 16:09:44 r02 kernel: [12740.151592] block drbd0: receiver terminated
对于 r01:
Aug 9 16:09:44 r01 kernel: [3438273.766768] block drbd0: receiver (re)started
Aug 9 16:09:44 r01 kernel: [3438273.771898] block drbd0: conn( Unconnected -> WFConnection )
Aug 9 16:09:44 r01 kernel: [3438274.474411] block drbd0: Handshake successful: Agreed network protocol version 91
Aug 9 16:09:44 r01 kernel: [3438274.483299] block drbd0: conn( WFConnection -> WFReportParams )
Aug 9 16:09:44 r01 kernel: [3438274.490420] block drbd0: Starting asender thread (from drbd0_receiver [6366])
Aug 9 16:09:44 r01 kernel: [3438274.498900] block drbd0: data-integrity-alg: <not-used>
Aug 9 16:09:44 r01 kernel: [3438274.505166] block drbd0: Considerable difference in lower level device sizes: 1344982880s vs. 256503768s
Aug 9 16:09:44 r01 kernel: [3438274.516226] block drbd0: max_segment_size ( = BIO size ) = 65536
Aug 9 16:09:44 r01 kernel: [3438274.523385] block drbd0: drbd_sync_handshake:
Aug 9 16:09:44 r01 kernel: [3438274.528677] block drbd0: self E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0
Aug 9 16:09:44 r01 kernel: [3438274.541195] block drbd0: peer E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0
Aug 9 16:09:44 r01 kernel: [3438274.553710] block drbd0: uuid_compare()=1 by rule 70
Aug 9 16:09:44 r01 kernel: [3438274.559677] block drbd0: Becoming sync source due to disk states.
Aug 9 16:09:44 r01 kernel: [3438274.566897] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
Aug 9 16:09:44 r01 kernel: [3438274.666397] block drbd0: conn( WFBitMapS -> SyncSource )
Aug 9 16:09:44 r01 kernel: [3438274.672845] block drbd0: Began resync as SyncSource (will sync 128250804 KB [32062701 bits set]).
Aug 9 16:09:44 r01 kernel: [3438274.683196] block drbd0: /build/buildd-linux-2.6_2.6.32-48squeeze3-amd64-mcoLgp/linux-2.6-2.6.32/debian/build/source_amd64_none/drivers/block/drbd/drbd_receiver.c:1932: sector: 0s, size: 65536
Aug 9 16:09:45 r01 kernel: [3438274.702834] block drbd0: error receiving RSDataRequest, l: 24!
Aug 9 16:09:45 r01 kernel: [3438274.702837] block drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> ProtocolError )
Aug 9 16:09:45 r01 kernel: [3438274.703005] block drbd0: asender terminated
Aug 9 16:09:45 r01 kernel: [3438274.703009] block drbd0: Terminating drbd0_asender
Aug 9 16:09:45 r01 kernel: [3438274.711319] block drbd0: Connection closed
Aug 9 16:09:45 r01 kernel: [3438274.711323] block drbd0: conn( ProtocolError -> Unconnected )
Aug 9 16:09:45 r01 kernel: [3438274.711329] block drbd0: receiver terminated
这只会一遍又一遍地重复。
两台服务器上的配置应该是相同的:
r01:~$ rsync --dry-run --verbose --checksum --itemize-changes 10.0.255.254:/etc/drbd.conf /etc/
sent 11 bytes received 51 bytes 124.00 bytes/sec
total size is 615 speedup is 9.92 (DRY RUN)
这是配置的样子:
r01:~$ cat /etc/drbd.conf
global {
usage-count no;
}
resource drbd0 {
protocol C;
handlers { pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; exit 1"; }
startup {
degr-wfc-timeout 60; # 1 minute.
wfc-timeout 55;
}
disk {
on-io-error detach;
}
syncer {
rate 100M;
al-extents 257;
}
on r01.c07.mtsvc.net {
device /dev/drbd0;
disk /dev/cciss/c0d0p3;
address 10.0.255.253:7788;
meta-disk internal;
}
on r02.c07.mtsvc.net {
device /dev/drbd0;
disk /dev/cciss/c0d0p6;
address 10.0.255.254:7788;
meta-disk internal;
}
}
这是双方的网络配置:
r01:~$ sudo ifconfig -a | grep -B 2 -A 8 10.0.255
eth2 Link encap:Ethernet HWaddr 00:26:55:d6:f8:fc
inet addr:10.0.255.253 Bcast:10.0.255.255 Mask:255.255.255.0
inet6 addr: fe80::226:55ff:fed6:f8fc/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:4062510240 errors:0 dropped:0 overruns:0 frame:0
TX packets:5692251259 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5512604514975 (5.0 TiB) TX bytes:5820995499388 (5.2 TiB)
Interrupt:24 Memory:fbe80000-fbea0000
r01:~$ sudo ifconfig -a | grep -B 2 -A 8 10.0.255
eth2 Link encap:Ethernet HWaddr 00:1b:78:5c:a8:fd
inet addr:10.0.255.254 Bcast:10.0.255.255 Mask:255.255.255.252
inet6 addr: fe80::21b:78ff:fe5c:a8fd/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:321977747 errors:0 dropped:0 overruns:0 frame:0
TX packets:264683964 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:332813827055 (309.9 GiB) TX bytes:328142295363 (305.6 GiB)
Interrupt:17 Memory:fdfa0000-fdfc0000
最初,r01 和 r02 都是 运行 Debian Squeeze (drbd 8.3.7)。然后我用 Debian Wheezy (drbd 8.3.13) 重建了 r02。事情 运行 顺利了几天,然后在重新启动 drbd 后,这个问题开始了。我有几个其他的 drbd 集群,我一直在以同样的方式升级。有的完全升级为Wheezy,有的还是半Squeeze,半Wheezy,还好。
到目前为止,我已尝试解决此问题。
- 擦除 r02 上的 drbd 卷并尝试重新同步
- 擦除、重新安装并重新配置 r02。
- 用不同的硬件替换 r02,并从头开始重建。
- 更换交叉线(两次)
在接下来的几天里,我将用 100% 不同的硬件替换 r01。但即使这有效,我仍然不知所措。我真的很想了解导致此问题的原因以及解决它的正确方法。
DRBD 在 8.3.7 和 8.3.13 之间发生了很多变化;包括重新同步工作方式的重大变化:https://blogs.linbit.com/p/128/drbd-sync-rate-controller/
您可以尝试从您的资源配置中删除任何不需要的设置(因此,syncer{} 部分)并调整 DRBD:# drbdadm adjust all
如果仍然无法连接,您可能必须升级旧节点才能使它们同步:http://www.drbd.org/download/drbd/8.3/drbd-8.3.13.tar.gz