断网后ElasticSearch节点看不到其他节点

ElasticSearch nodes after network break cannot see other nodes

我们目前正在对我们的 ElasticSearch 设置进行故障转移测试。所以这是我们使用的设置:

我们有 4 台 ElasticSearch 机器 运行。我们将它们命名为 ES1、ES2、ES3 和 ES4。 我们在它们上有一些索引,每个索引有 5 个分片和 1 个副本,因此索引有 10 个分片。一切都很好地分布在每个节点上,因此如果一个节点发生故障,一切仍将正常工作。

4 个节点在 Windows 7 个 64 位和 8GB RAM 上。节点使用集群名称相互发现。

我拔下 ES1 机器的插头,看看是否一切正常,一切正常,万岁!

但现在奇怪了,我们再次插入 ES1,但这个并没有回到集群上(名为 wc2014 FYI)。他似乎也独自一人在名为 wc2014 的集群中。

这是我在日志中找到的一些信息:

当我们拔掉插头时(这对我来说很正常)

org.elasticsearch.transport.NodeDisconnectedException: [IPDIRECTOR-119][inet[/10.194.1.119:9300]][cluster:monitor/nodes/info[n]] disconnected
[2015-08-12 11:27:04,619][DEBUG][action.admin.indices.stats] [IPDIRECTOR-118] [wc2014_nearline][4], node[fxTcr9-FR52jecm5a2adRg], [P], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@1b999e6c]
org.elasticsearch.transport.NodeDisconnectedException: [IPDIRECTOR-119][inet[/10.194.1.119:9300]][indices:monitor/stats[s]] disconnected
[2015-08-12 11:27:04,619][DEBUG][action.admin.indices.stats] [IPDIRECTOR-118] [wc2014_mediaresource][4], node[fxTcr9-FR52jecm5a2adRg], [P], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@1b999e6c]
org.elasticsearch.transport.NodeDisconnectedException: [IPDIRECTOR-119][inet[/10.194.1.119:9300]][indices:monitor/stats[s]] disconnected
[2015-08-12 11:27:04,619][DEBUG][action.admin.indices.stats] [IPDIRECTOR-118] [wc2014_edit][4], node[fxTcr9-FR52jecm5a2adRg], [P], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@1b999e6c]
org.elasticsearch.transport.NodeDisconnectedException: [IPDIRECTOR-119][inet[/10.194.1.119:9300]][indices:monitor/stats[s]] disconnected
[2015-08-12 11:27:04,619][DEBUG][action.admin.indices.stats] [IPDIRECTOR-118] [wc2014_log][4], node[fxTcr9-FR52jecm5a2adRg], [P], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@1b999e6c]
org.elasticsearch.transport.NodeDisconnectedException: [IPDIRECTOR-119][inet[/10.194.1.119:9300]][indices:monitor/stats[s]] disconnected
[2015-08-12 11:27:04,619][DEBUG][action.admin.indices.stats] [IPDIRECTOR-118] [wc2014_metadata][4], node[fxTcr9-FR52jecm5a2adRg], [R], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@1b999e6c]
org.elasticsearch.transport.NodeDisconnectedException: [IPDIRECTOR-119][inet[/10.194.1.119:9300]][indices:monitor/stats[s]] disconnected
[2015-08-12 11:27:04,619][DEBUG][action.admin.indices.stats] [IPDIRECTOR-118] [wc2014_ipwsedit][4], node[fxTcr9-FR52jecm5a2adRg], [P], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@1b999e6c]

然后我有类似这样的不同错误:

[2015-08-12 11:27:09,797][DEBUG][action.admin.cluster.node.info] [IPDIRECTOR-118] failed to execute on node [fxTcr9-FR52jecm5a2adRg]
org.elasticsearch.transport.SendRequestTransportException: [IPDIRECTOR-119][inet[/10.194.1.119:9300]][cluster:monitor/nodes/info[n]]
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:286)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.start(TransportNodesOperationAction.java:165)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access0(TransportNodesOperationAction.java:97)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:70)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:43)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
    at org.elasticsearch.client.node.NodeClusterAdminClient.execute(NodeClusterAdminClient.java:77)
    at org.elasticsearch.client.FilterClient$ClusterAdmin.execute(FilterClient.java:161)
    at org.elasticsearch.rest.BaseRestHandler$HeadersAndContextCopyClient$ClusterAdmin.execute(BaseRestHandler.java:125)
    at org.elasticsearch.client.support.AbstractClusterAdminClient.nodesInfo(AbstractClusterAdminClient.java:187)
    at org.elasticsearch.rest.action.admin.cluster.node.info.RestNodesInfoAction.handleRequest(RestNodesInfoAction.java:102)
    at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:53)
    at org.elasticsearch.rest.RestController.executeHandler(RestController.java:225)
    at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:170)
    at org.elasticsearch.http.HttpServer.internalDispatchRequest(HttpServer.java:121)
    at org.elasticsearch.http.HttpServer$Dispatcher.dispatchRequest(HttpServer.java:83)
    at org.elasticsearch.http.netty.NettyHttpServerTransport.dispatchRequest(NettyHttpServerTransport.java:329)
    at org.elasticsearch.http.netty.HttpRequestHandler.messageReceived(HttpRequestHandler.java:63)
    at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.http.netty.pipelining.HttpPipeliningHandler.messageReceived(HttpPipeliningHandler.java:60)
    at org.elasticsearch.common.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.common.netty.handler.codec.http.HttpContentEncoder.messageReceived(HttpContentEncoder.java:82)
    at org.elasticsearch.common.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.common.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:145)
    at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.common.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:108)
    at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
    at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459)
    at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536)
    at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
    at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
    at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
    at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [IPDIRECTOR-119][inet[/10.194.1.119:9300]] Node not connected
    at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:936)
    at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:629)
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:276)
    ... 58 more

当我们重新插入节点时:

[2015-08-12 11:39:59,177][INFO ][cluster.service          ] [IPDIRECTOR-118] added {[IPDIRECTOR-119][3kybxeb7TMm30Pzh7rrmhA][Ipdirector-119][inet[/10.194.1.119:9300]],}, reason: zen-disco-receive(from master [[IPDIRECTOR-121][BX8BT6OgRjWM5YEhlxt9mQ][Ipdirector-121][inet[/10.194.1.121:9300]]])
[2015-08-12 11:48:07,768][INFO ][discovery.zen            ] [IPDIRECTOR-118] master_left [[IPDIRECTOR-121][BX8BT6OgRjWM5YEhlxt9mQ][Ipdirector-121][inet[/10.194.1.121:9300]]], reason [transport disconnected]
[2015-08-12 11:48:07,769][WARN ][discovery.zen            ] [IPDIRECTOR-118] master left (reason = transport disconnected), current nodes: {[IPDIRECTOR-118][Z9UA4kJxTIa6B3tY4F-_vw][Ipdirector-118][inet[/10.194.1.118:9300]],[IPDIRECTOR-119][3kybxeb7TMm30Pzh7rrmhA][Ipdirector-119][inet[/10.194.1.119:9300]],[IPDIRECTOR-120][EQzx7BprQa6EVOT3V6zlqQ][Ipdirector-120][inet[/10.194.1.120:9300]],}
[2015-08-12 11:48:07,769][INFO ][cluster.service          ] [IPDIRECTOR-118] removed {[IPDIRECTOR-121][BX8BT6OgRjWM5YEhlxt9mQ][Ipdirector-121][inet[/10.194.1.121:9300]],}, reason: zen-disco-master_failed ([IPDIRECTOR-121][BX8BT6OgRjWM5YEhlxt9mQ][Ipdirector-121][inet[/10.194.1.121:9300]])
[2015-08-12 11:48:11,541][WARN ][discovery.zen.ping.unicast] [IPDIRECTOR-118] failed to send ping to [[IPDIRECTOR-119][3kybxeb7TMm30Pzh7rrmhA][Ipdirector-119][inet[/10.194.1.119:9300]]]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [IPDIRECTOR-119][inet[/10.194.1.119:9300]][internal:discovery/zen/unicast] request_id [124460] timed out after [3750ms]
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
[2015-08-12 11:48:11,541][WARN ][discovery.zen.ping.unicast] [IPDIRECTOR-118] failed to send ping to [[IPDIRECTOR-120][EQzx7BprQa6EVOT3V6zlqQ][Ipdirector-120][inet[/10.194.1.120:9300]]]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [IPDIRECTOR-120][inet[/10.194.1.120:9300]][internal:discovery/zen/unicast] request_id [124461] timed out after [3750ms]
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

一些更多的超时,然后是很多这样的错误:

[2015-08-12 11:48:26,677][WARN ][gateway.local            ] [IPDIRECTOR-118] [wc2014_clip][4]: failed to list shard stores on node [EQzx7BprQa6EVOT3V6zlqQ]
org.elasticsearch.action.FailedNodeException: Failed node [EQzx7BprQa6EVOT3V6zlqQ]
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access00(TransportNodesOperationAction.java:97)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.handleException(TransportNodesOperationAction.java:178)
    at org.elasticsearch.transport.TransportService$Adapter.run(TransportService.java:468)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [IPDIRECTOR-120][inet[/10.194.1.120:9300]][internal:cluster/nodes/indices/shard/store[n]] disconnected
[2015-08-12 11:48:26,677][WARN ][gateway.local            ] [IPDIRECTOR-118] [wc2014_clip][4]: failed to list shard stores on node [3kybxeb7TMm30Pzh7rrmhA]
org.elasticsearch.action.FailedNodeException: Failed node [3kybxeb7TMm30Pzh7rrmhA]
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access00(TransportNodesOperationAction.java:97)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.handleException(TransportNodesOperationAction.java:178)
    at org.elasticsearch.transport.TransportService$Adapter.run(TransportService.java:468)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [IPDIRECTOR-119][inet[/10.194.1.119:9300]][internal:cluster/nodes/indices/shard/store[n]] disconnected
[2015-08-12 11:48:27,081][WARN ][gateway.local            ] [IPDIRECTOR-118] [wc2014_clip][3]: failed to list shard stores on node [EQzx7BprQa6EVOT3V6zlqQ]
org.elasticsearch.action.FailedNodeException: Failed node [EQzx7BprQa6EVOT3V6zlqQ]
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access00(TransportNodesOperationAction.java:97)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.handleException(TransportNodesOperationAction.java:178)
    at org.elasticsearch.transport.TransportService.run(TransportService.java:290)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.transport.SendRequestTransportException: [IPDIRECTOR-120][inet[/10.194.1.120:9300]][internal:cluster/nodes/indices/shard/store[n]]
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:286)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.start(TransportNodesOperationAction.java:165)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access0(TransportNodesOperationAction.java:97)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:70)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:43)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:55)
    at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.list(TransportNodesListShardStoreMetaData.java:79)
    at org.elasticsearch.gateway.local.LocalGatewayAllocator.buildShardStores(LocalGatewayAllocator.java:458)
    at org.elasticsearch.gateway.local.LocalGatewayAllocator.allocateUnassigned(LocalGatewayAllocator.java:292)
    at org.elasticsearch.cluster.routing.allocation.allocator.ShardsAllocators.allocateUnassigned(ShardsAllocators.java:74)
    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:219)
    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:162)
    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:148)
    at org.elasticsearch.discovery.zen.ZenDiscovery.execute(ZenDiscovery.java:387)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:365)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:188)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:158)
    ... 3 more
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [IPDIRECTOR-120][inet[/10.194.1.120:9300]] Node not connected
    at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:936)
    at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:629)
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:276)

如果我想解决这个问题,我必须手动重启节点,然后一切恢复正常。

节点不应该自动与 ES2、3、4 对话并一起回到集群中而无需我对其进行一些手动操作吗?

谢谢, 马蒂亚斯.

检查elesticsearch.yml文件

/etc/elasticsearch/elasticsearch.yml

您需要验证发现类型是否与您在 ex ec2 中 运行 的环境相匹配。

好的,我们已经找到了我们遇到的问题的解决方案。我们有 4 台 ElasticSearch 机器,但只有一台设置在主节点中,所以当网络关闭时,2 个集群开始并排运行。