drbd 同步失败并出现 ProtocolError

drbd sync fails with ProtocolError

我目前有一对已决定停止同步的 drbd 服务器,我似乎无能为力让它们再次同步。同步过程通过两台服务器之间的专用交叉电缆(1gbps 铜缆)进行。

这是我在 r01 的日志中看到的内容:

Aug  9 16:09:44 r02 kernel: [12739.178449] block drbd0: receiver (re)started
Aug  9 16:09:44 r02 kernel: [12739.178454] block drbd0: conn( Unconnected -> WFConnection ) 
Aug  9 16:09:44 r02 kernel: [12739.912037] block drbd0: Handshake successful: Agreed network protocol version 91
Aug  9 16:09:44 r02 kernel: [12739.912048] block drbd0: conn( WFConnection -> WFReportParams ) 
Aug  9 16:09:44 r02 kernel: [12739.912074] block drbd0: Starting asender thread (from drbd0_receiver [3740])
Aug  9 16:09:44 r02 kernel: [12739.936681] block drbd0: data-integrity-alg: <not-used>
Aug  9 16:09:44 r02 kernel: [12739.936691] block drbd0: Considerable difference in lower level device sizes: 256503768s vs. 1344982880s
Aug  9 16:09:44 r02 kernel: [12739.942918] block drbd0: drbd_sync_handshake:
Aug  9 16:09:44 r02 kernel: [12739.942923] block drbd0: self E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0
Aug  9 16:09:44 r02 kernel: [12739.942928] block drbd0: peer E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0
Aug  9 16:09:44 r02 kernel: [12739.942933] block drbd0: uuid_compare()=-1 by rule 50
Aug  9 16:09:44 r02 kernel: [12739.942935] block drbd0: Becoming sync target due to disk states.
Aug  9 16:09:44 r02 kernel: [12739.942946] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
Aug  9 16:09:44 r02 kernel: [12740.099597] block drbd0: conn( WFBitMapT -> WFSyncUUID ) 
Aug  9 16:09:44 r02 kernel: [12740.104324] block drbd0: updated sync uuid BF8D25FBE26085B0:0000000000000000:0000000000000000:0000000000000000
Aug  9 16:09:44 r02 kernel: [12740.104423] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Aug  9 16:09:44 r02 kernel: [12740.106582] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Aug  9 16:09:44 r02 kernel: [12740.106591] block drbd0: conn( WFSyncUUID -> SyncTarget ) 
Aug  9 16:09:44 r02 kernel: [12740.106599] block drbd0: Began resync as SyncTarget (will sync 128250804 KB [32062701 bits set]).
Aug  9 16:09:44 r02 kernel: [12740.140796] block drbd0: meta connection shut down by peer.
Aug  9 16:09:44 r02 kernel: [12740.141304] block drbd0: sock was shut down by peer
Aug  9 16:09:44 r02 kernel: [12740.141309] block drbd0: peer( Primary -> Unknown ) conn( SyncTarget -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) 
Aug  9 16:09:44 r02 kernel: [12740.141316] block drbd0: short read expecting header on sock: r=0
Aug  9 16:09:44 r02 kernel: [12740.142235] block drbd0: asender terminated
Aug  9 16:09:44 r02 kernel: [12740.142238] block drbd0: Terminating drbd0_asender
Aug  9 16:09:44 r02 kernel: [12740.151561] block drbd0: bitmap WRITE of 979 pages took 2 jiffies
Aug  9 16:09:44 r02 kernel: [12740.151567] block drbd0: 122 GB (32062701 bits) marked out-of-sync by on disk bit-map.
Aug  9 16:09:44 r02 kernel: [12740.151580] block drbd0: Connection closed
Aug  9 16:09:44 r02 kernel: [12740.151586] block drbd0: conn( BrokenPipe -> Unconnected ) 
Aug  9 16:09:44 r02 kernel: [12740.151592] block drbd0: receiver terminated

对于 r01:

Aug  9 16:09:44 r01 kernel: [3438273.766768] block drbd0: receiver (re)started
Aug  9 16:09:44 r01 kernel: [3438273.771898] block drbd0: conn( Unconnected -> WFConnection ) 
Aug  9 16:09:44 r01 kernel: [3438274.474411] block drbd0: Handshake successful: Agreed network protocol version 91
Aug  9 16:09:44 r01 kernel: [3438274.483299] block drbd0: conn( WFConnection -> WFReportParams ) 
Aug  9 16:09:44 r01 kernel: [3438274.490420] block drbd0: Starting asender thread (from drbd0_receiver [6366])
Aug  9 16:09:44 r01 kernel: [3438274.498900] block drbd0: data-integrity-alg: <not-used>
Aug  9 16:09:44 r01 kernel: [3438274.505166] block drbd0: Considerable difference in lower level device sizes: 1344982880s vs. 256503768s
Aug  9 16:09:44 r01 kernel: [3438274.516226] block drbd0: max_segment_size ( = BIO size ) = 65536
Aug  9 16:09:44 r01 kernel: [3438274.523385] block drbd0: drbd_sync_handshake:
Aug  9 16:09:44 r01 kernel: [3438274.528677] block drbd0: self E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0
Aug  9 16:09:44 r01 kernel: [3438274.541195] block drbd0: peer E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0
Aug  9 16:09:44 r01 kernel: [3438274.553710] block drbd0: uuid_compare()=1 by rule 70
Aug  9 16:09:44 r01 kernel: [3438274.559677] block drbd0: Becoming sync source due to disk states.
Aug  9 16:09:44 r01 kernel: [3438274.566897] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) 
Aug  9 16:09:44 r01 kernel: [3438274.666397] block drbd0: conn( WFBitMapS -> SyncSource ) 
Aug  9 16:09:44 r01 kernel: [3438274.672845] block drbd0: Began resync as SyncSource (will sync 128250804 KB [32062701 bits set]).
Aug  9 16:09:44 r01 kernel: [3438274.683196] block drbd0: /build/buildd-linux-2.6_2.6.32-48squeeze3-amd64-mcoLgp/linux-2.6-2.6.32/debian/build/source_amd64_none/drivers/block/drbd/drbd_receiver.c:1932: sector: 0s, size: 65536
Aug  9 16:09:45 r01 kernel: [3438274.702834] block drbd0: error receiving RSDataRequest, l: 24!
Aug  9 16:09:45 r01 kernel: [3438274.702837] block drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> ProtocolError ) 
Aug  9 16:09:45 r01 kernel: [3438274.703005] block drbd0: asender terminated
Aug  9 16:09:45 r01 kernel: [3438274.703009] block drbd0: Terminating drbd0_asender
Aug  9 16:09:45 r01 kernel: [3438274.711319] block drbd0: Connection closed
Aug  9 16:09:45 r01 kernel: [3438274.711323] block drbd0: conn( ProtocolError -> Unconnected ) 
Aug  9 16:09:45 r01 kernel: [3438274.711329] block drbd0: receiver terminated

这只会一遍又一遍地重复。

两台服务器上的配置应该是相同的:

r01:~$ rsync --dry-run --verbose --checksum --itemize-changes 10.0.255.254:/etc/drbd.conf /etc/

sent 11 bytes  received 51 bytes  124.00 bytes/sec
total size is 615  speedup is 9.92 (DRY RUN)

这是配置的样子:

r01:~$ cat /etc/drbd.conf
global {
   usage-count no;
}

resource drbd0 {
  protocol C;
  handlers { pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; exit 1"; }
  startup {
    degr-wfc-timeout 60;    # 1 minute.
    wfc-timeout 55;
  }

  disk {
    on-io-error   detach;
  }

  syncer {
    rate 100M;
    al-extents 257;
  }

  on r01.c07.mtsvc.net {
    device     /dev/drbd0;
    disk       /dev/cciss/c0d0p3;
    address    10.0.255.253:7788;
    meta-disk  internal;
  }

  on r02.c07.mtsvc.net {
    device     /dev/drbd0;
    disk       /dev/cciss/c0d0p6;
    address    10.0.255.254:7788;
    meta-disk  internal;
  }
}

这是双方的网络配置:

r01:~$ sudo ifconfig -a | grep -B 2 -A 8 10.0.255

eth2      Link encap:Ethernet  HWaddr 00:26:55:d6:f8:fc  
          inet addr:10.0.255.253  Bcast:10.0.255.255  Mask:255.255.255.0
          inet6 addr: fe80::226:55ff:fed6:f8fc/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:4062510240 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5692251259 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:5512604514975 (5.0 TiB)  TX bytes:5820995499388 (5.2 TiB)
          Interrupt:24 Memory:fbe80000-fbea0000 

r01:~$ sudo ifconfig -a | grep -B 2 -A 8 10.0.255

eth2      Link encap:Ethernet  HWaddr 00:1b:78:5c:a8:fd  
          inet addr:10.0.255.254  Bcast:10.0.255.255  Mask:255.255.255.252
          inet6 addr: fe80::21b:78ff:fe5c:a8fd/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:321977747 errors:0 dropped:0 overruns:0 frame:0
          TX packets:264683964 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:332813827055 (309.9 GiB)  TX bytes:328142295363 (305.6 GiB)
          Interrupt:17 Memory:fdfa0000-fdfc0000 

最初,r01 和 r02 都是 运行 Debian Squeeze (drbd 8.3.7)。然后我用 Debian Wheezy (drbd 8.3.13) 重建了 r02。事情 运行 顺利了几天,然后在重新启动 drbd 后,这个问题开始了。我有几个其他的 drbd 集群,我一直在以同样的方式升级。有的完全升级为Wheezy,有的还是半Squeeze,半Wheezy,还好。

到目前为止,我已尝试解决此问题。

在接下来的几天里,我将用 100% 不同的硬件替换 r01。但即使这有效,我仍然不知所措。我真的很想了解导致此问题的原因以及解决它的正确方法。

DRBD 在 8.3.7 和 8.3.13 之间发生了很多变化;包括重新同步工作方式的重大变化:https://blogs.linbit.com/p/128/drbd-sync-rate-controller/

您可以尝试从您的资源配置中删除任何不需要的设置(因此,syncer{} 部分)并调整 DRBD:# drbdadm adjust all

如果仍然无法连接,您可能必须升级旧节点才能使它们同步:http://www.drbd.org/download/drbd/8.3/drbd-8.3.13.tar.gz