Hyperledger Fabric | Orderer PODs keeps restarting | Error: Was the raft log corrupted, truncated, or lost?

Question

我是运行 Azure Kubernetes 集群中的 Hyperledger Fabric 网络。我使用单个 Azure 文件卷 (1000 GB) 作为我的持久卷。

但是，我的 Orderer POD 不断重启。

订购者 POD 正在记录以下错误：

2022-02-13 04:40:22.342 UTC 0080 PANI [orderer.consensus.etcdraft] commitTo -> tocommit(8) is out of range [lastIndex(5)]. Was the raft log corrupted, truncated, or lost? channel=system-channel node=3
panic: tocommit(8) is out of range [lastIndex(5)]. Was the raft log corrupted, truncated, or lost?

以下是来自 Orderer POD 的详细日志：

2022-02-13 04:40:22.342 UTC 007f INFO [orderer.consensus.etcdraft] becomeFollower -> 3 became follower at term 2 channel=system-channel node=3
2022-02-13 04:40:22.342 UTC 0080 PANI [orderer.consensus.etcdraft] commitTo -> tocommit(8) is out of range [lastIndex(5)]. Was the raft log corrupted, truncated, or lost? channel=system-channel node=3
panic: tocommit(8) is out of range [lastIndex(5)]. Was the raft log corrupted, truncated, or lost?
 
go.uber.org/zap.(*SugaredLogger).log(0xc000332e20, 0xf0000000000004, 0x10b467e, 0x5d, 0xc000533b00, 0x2, 0x2, 0x0, 0x0, 0x0)
    /go/src/github.com/hyperledger/fabric/vendor/go.uber.org/zap/sugar.go:234 +0xf6
go.uber.org/zap.(*SugaredLogger).Panicf(...)
    /go/src/github.com/hyperledger/fabric/vendor/go.uber.org/zap/sugar.go:159
github.com/hyperledger/fabric/common/flogging.(*FabricLogger).Panicf(0xc000332e28, 0x10b467e, 0x5d, 0xc000533b00, 0x2, 0x2)
    /go/src/github.com/hyperledger/fabric/common/flogging/zap.go:74 +0x7c
go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc0001a4310, 0x8)
    /go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/log.go:203 +0x135
go.etcd.io/etcd/raft.(*raft).handleHeartbeat(0xc000aaab40, 0x8, 0x3, 0x1, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/raft.go:1324 +0x54
go.etcd.io/etcd/raft.stepFollower(0xc000aaab40, 0x8, 0x3, 0x1, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/raft.go:1269 +0x439
go.etcd.io/etcd/raft.(*raft).Step(0xc000aaab40, 0x8, 0x3, 0x1, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/raft.go:971 +0x1235
go.etcd.io/etcd/raft.(*node).run(0xc00007e660, 0xc000aaab40)
    /go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/node.go:357 +0xd78
created by go.etcd.io/etcd/raft.StartNode
    /go/src/github.com/hyperledger/fabric/vendor/go.etcd.io/etcd/raft/node.go:233 +0x409
2022-02-13 04:45:28.330 UTC 0001 WARN [localconfig] completeInitialization -> General.GenesisFile should be replaced by General.BootstrapFile
2022-02-13 04:45:28.331 UTC 0002 INFO [localconfig] completeInitialization -> Kafka.Version unset, setting to 0.10.2.0
2022-02-13 04:45:28.331 UTC 0003 INFO [orderer.common.server] prettyPrintStruct -> Orderer config values:
    General.ListenAddress = "0.0.0.0"
    General.ListenPort = 7050
    General.TLS.Enabled = true
    General.TLS.PrivateKey = "/organizations/ordererOrganizations/example.com/orderers/orderer3.example.com/tls/server.key"
    General.TLS.Certificate = "/organizations/ordererOrganizations/example.com/orderers/orderer3.example.com/tls/server.crt"
    General.TLS.RootCAs = [/organizations/ordererOrganizations/example.com/orderers/orderer3.example.com/tls/ca.crt]
    General.TLS.ClientAuthRequired = false
    General.TLS.ClientRootCAs = []
    General.TLS.TLSHandshakeTimeShift = 0s
    General.Cluster.ListenAddress = ""
    General.Cluster.ListenPort = 0
    General.Cluster.ServerCertificate = ""
    General.Cluster.ServerPrivateKey = ""
    General.Cluster.ClientCertificate = "/organizations/ordererOrganizations/example.com/orderers/orderer3.example.com/tls/server.crt"
    General.Cluster.ClientPrivateKey = "/organizations/ordererOrganizations/example.com/orderers/orderer3.example.com/tls/server.key"
    General.Cluster.RootCAs = []
    General.Cluster.DialTimeout = 5s
    General.Cluster.RPCTimeout = 7s
    General.Cluster.ReplicationBufferSize = 20971520
    General.Cluster.ReplicationPullTimeout = 5s
    General.Cluster.ReplicationRetryTimeout = 5s
    General.Cluster.ReplicationBackgroundRefreshInterval = 5m0s
    General.Cluster.ReplicationMaxRetries = 12
    General.Cluster.SendBufferSize = 10
    General.Cluster.CertExpirationWarningThreshold = 168h0m0s
    General.Cluster.TLSHandshakeTimeShift = 0s
    General.Keepalive.ServerMinInterval = 1m0s
    General.Keepalive.ServerInterval = 2h0m0s
    General.Keepalive.ServerTimeout = 20s
    General.ConnectionTimeout = 0s
    General.GenesisMethod = "file"
    General.GenesisFile = "/system-genesis-block/genesis.block"
    Kafka.Retry.LongTotal = 12h0m0s
    Kafka.Retry.NetworkTimeouts.DialTimeout = 10s
    Kafka.Retry.NetworkTimeouts.ReadTimeout = 10s
    Kafka.Retry.NetworkTimeouts.WriteTimeout = 10s
    Kafka.Retry.Metadata.RetryMax = 3
    Kafka.Retry.Metadata.RetryBackoff = 250ms
    Kafka.Retry.Producer.RetryMax = 3
    Kafka.Retry.Producer.RetryBackoff = 100ms
    Kafka.Retry.Consumer.RetryBackoff = 2s
    Kafka.Verbose = false
    Kafka.Version = 0.10.2.0
    Kafka.TLS.Enabled = false
    Kafka.TLS.PrivateKey = ""
    Kafka.TLS.Certificate = ""
    Kafka.TLS.RootCAs = []
    Kafka.TLS.ClientAuthRequired = false
    Kafka.TLS.ClientRootCAs = []
    Kafka.TLS.TLSHandshakeTimeShift = 0s
    Kafka.SASLPlain.Enabled = false
    Kafka.SASLPlain.User = ""
    Kafka.SASLPlain.Password = ""
    Kafka.Topic.ReplicationFactor = 3
    Debug.BroadcastTraceDir = ""
    Debug.DeliverTraceDir = ""
    Consensus = map[SnapDir:/var/hyperledger/production/orderer/etcdraft/snapshot WALDir:/var/hyperledger/production/orderer/etcdraft/wal]
    Operations.ListenAddress = "127.0.0.1:8443"
    Operations.TLS.Enabled = false
    Operations.TLS.PrivateKey = ""
    Operations.TLS.Certificate = ""
    Operations.TLS.RootCAs = []
    Operations.TLS.ClientAuthRequired = false
    Operations.TLS.ClientRootCAs = []
    Operations.TLS.TLSHandshakeTimeShift = 0s
    Metrics.Provider = "disabled"
    Metrics.Statsd.Network = "udp"
    Metrics.Statsd.Address = "127.0.0.1:8125"
    Metrics.Statsd.WriteInterval = 30s
    Metrics.Statsd.Prefix = ""
    ChannelParticipation.Enabled = false
    ChannelParticipation.MaxRequestBodySize = 1048576
    Admin.ListenAddress = "127.0.0.1:9443"
    Admin.TLS.Enabled = false
    Admin.TLS.PrivateKey = ""
    Admin.TLS.Certificate = ""
    Admin.TLS.RootCAs = []
    Admin.TLS.ClientAuthRequired = true
    Admin.TLS.ClientRootCAs = []
    Admin.TLS.TLSHandshakeTimeShift = 0s
2022-02-13 04:45:28.773 UTC 0004 INFO [orderer.common.server] initializeServerConfig -> Starting orderer with TLS enabled
2022-02-13 04:45:28.822 UTC 0005 INFO [blkstorage] NewProvider -> Creating new file ledger directory at /var/hyperledger/production/orderer/chains
2022-02-13 04:45:28.870 UTC 0006 INFO [orderer.common.server] Main -> Bootstrapping the system channel
2022-02-13 04:45:28.880 UTC 0007 INFO [blkstorage] newBlockfileMgr -> Getting block information from block storage
2022-02-13 04:45:28.920 UTC 0008 INFO [orderer.common.server] initializeBootstrapChannel -> Initialized the system channel 'system-channel' from bootstrap block
2022-02-13 04:45:28.923 UTC 0009 INFO [orderer.common.server] extractSystemChannel -> Found system channel config block, number: 0
2022-02-13 04:45:28.923 UTC 000a INFO [orderer.common.server] selectClusterBootBlock -> Cluster boot block is bootstrap (genesis) block; Blocks Header.Number system-channel=0, bootstrap=0
2022-02-13 04:45:28.926 UTC 000b INFO [orderer.common.server] Main -> Starting with system channel: system-channel, consensus type: etcdraft
2022-02-13 04:45:28.926 UTC 000c INFO [orderer.common.server] Main -> Setting up cluster
2022-02-13 04:45:28.926 UTC 000d INFO [orderer.common.server] reuseListener -> Cluster listener is not configured, defaulting to use the general listener on port 7050
2022-02-13 04:45:28.959 UTC 000e INFO [orderer.common.cluster] loadVerifier -> Loaded verifier for channel system-channel from config block at index 0
2022-02-13 04:45:28.960 UTC 000f INFO [certmonitor] trackCertExpiration -> The enrollment certificate will expire on 2023-02-10 09:12:00 +0000 UTC
2022-02-13 04:45:28.960 UTC 0010 INFO [certmonitor] trackCertExpiration -> The server TLS certificate will expire on 2023-02-10 09:13:00 +0000 UTC
2022-02-13 04:45:28.960 UTC 0011 INFO [certmonitor] trackCertExpiration -> The client TLS certificate will expire on 2023-02-10 09:13:00 +0000 UTC
2022-02-13 04:45:28.966 UTC 0012 INFO [orderer.consensus.etcdraft] HandleChain -> EvictionSuspicion not set, defaulting to 10m0s
2022-02-13 04:45:28.966 UTC 0013 INFO [orderer.consensus.etcdraft] HandleChain -> With system channel: after eviction InactiveChainRegistry.TrackChain will be called
2022-02-13 04:45:28.966 UTC 0014 INFO [orderer.consensus.etcdraft] createOrReadWAL -> No WAL data found, creating new WAL at path '/var/hyperledger/production/orderer/etcdraft/wal/system-channel' channel=system-channel node=3
2022-02-13 04:45:28.983 UTC 0015 INFO [orderer.commmon.multichannel] initSystemChannel -> Starting system channel 'system-channel' with genesis block hash ac270210ec2258b99948adc06552a14d49463c3457933b1e24a151502c6487e5 and orderer type etcdraft
2022-02-13 04:45:28.983 UTC 0016 INFO [orderer.consensus.etcdraft] Start -> Starting Raft node channel=system-channel node=3
2022-02-13 04:45:28.983 UTC 0017 INFO [orderer.common.cluster] Configure -> Entering, channel: system-channel, nodes: [ID: 1,

我们可以在 configtx.yaml 或 orderer-deployment.yaml 中设置任何 config/env 变量来避免这个问题吗？
是否有任何超时等。我们可以在configtx.yaml或orderer-deployment.yaml中increase/set来避免这个问题？
我的另一个观察是，当我们（手动）删除 Orderer POD（一个或两个）并且当它自动启动新的 Orderer PODs 时，它们将保留重新启动时出现同样的错误。

Answer 1

原来我的 WAL 日志目录被删除了。任何登陆此问题的人，请在您的订购者部署中设置以下（如果尚未设置）ENV 变量：

- name: ORDERER_CONSENSUS_SNAPDIR
  value: /var/hyperledger/production/orderer/etcdraft/snapshot
- name: ORDERER_CONSENSUS_WALDIR
  value: /var/hyperledger/production/orderer/etcdraft/wal

如果您已经在 Orderer 部署中设置了上述 ENV 变量，请检查目录是否为空，以便在上述位置重新启动 Orderer。如果它们为空，则表示您的 WAL 日志已被删除。如果 WAL 日志不可用，订购者将无法恢复。

Hyperledger Fabric | Orderer PODs keeps restarting | Error: Was the raft log corrupted, truncated, or lost?

Hyperledger Fabric | Orderer PODs keeps restarting | Error: Was the raft log corrupted, truncated, or lost?

azure

raft

kubernetes

hyperledger-fabric