Cassandra 节点无法完成加入操作

Question

尝试向现有 C* 2.1.11 集群添加新节点时，该节点似乎已完成 bootstrap 的流式处理阶段，但我无法找到其原因的解释未从 JOINING 状态移动；所有节点的 cassandra 日志在所有流式传输过程中都没有显示错误。

nodetool status报告节点在所有节点中为UJ，负载量大于其余节点：

# nodetool status
Datacenter: us-east-vpc
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  xx.xx.xx.78   564.96 GB  256     ?       xxxx-f3c7d9d40e92  1d
UN  xx.xx.xx.110  534.63 GB  256     ?       xxxx-9419faa478ca  1a
UN  xx.xx.xx.171  557.13 GB  256     ?       xxxx-7a5b2723e438  1a
UN  xx.xx.xx.203  406.98 GB  256     ?       xxxx-1331d9c44992  1a
UN  xx.xx.xx.26   579.55 GB  256     ?       xxxx-88b202a8cedc  1c
UN  xx.xx.xx.122  603.39 GB  256     ?       xxxx-b0b81ebabeb2  1d
UN  xx.xx.xx.233  565.3 GB   256     ?       xxxx-a2fa9ad67741  1c
UJ  xx.xx.xx.56   881.91 GB  256     ?       xxxx-9863c7799fad  1d

nodetool netstats 在其他节点上没有显示 activity 但在新节点上显示要传输的空文件列表：

# nodetool netstats
Mode: JOINING
Bootstrap xxxx-8d0c340f238b
    /xx.xx.xx.233
    /xx.xx.xx.122
    /xx.xx.xx.171
    /xx.xx.xx.78
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed
Commands                        n/a         0             50
Responses                       n/a         0          64941

nodetool info 在尝试检索令牌范围信息时抛出错误：

# nodetool info
ID                     : xxxx-9863c7799fad
Gossip active          : true
Thrift active          : false
Native Transport active: false
Load                   : 881.91 GB
Generation No          : 1475450119
Uptime (seconds)       : 12081
Heap Memory (MB)       : 1480.71 / 1996.00
Off Heap Memory (MB)   : 204.47
Data Center            : us-east-vpc
Rack                   : 1d
Exceptions             : 2
Key Cache              : entries 3262, size 788.43 KB, capacity 99 MB, 43 hits, 3249 requests, 0.013 recent hit rate, 14400 save period in seconds
Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache          : entries 0, size 0 bytes, capacity 49 MB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
error: null
-- StackTrace --
java.lang.AssertionError
    at org.apache.cassandra.locator.TokenMetadata.getTokens(TokenMetadata.java:474)
    at org.apache.cassandra.service.StorageService.getTokens(StorageService.java:2263)
    at org.apache.cassandra.service.StorageService.getTokens(StorageService.java:2252)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
    at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
    at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
    at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
    at com.sun.jmx.mbeanserver.PerInterface.getAttribute(PerInterface.java:83)
    at com.sun.jmx.mbeanserver.MBeanSupport.getAttribute(MBeanSupport.java:206)
    at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:647)
    at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:678)
    at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1445)
    at javax.management.remote.rmi.RMIConnectionImpl.access0(RMIConnectionImpl.java:76)
    at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1309)
    at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1401)
    at javax.management.remote.rmi.RMIConnectionImpl.getAttribute(RMIConnectionImpl.java:639)
    at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:324)
    at sun.rmi.transport.Transport.run(Transport.java:200)
    at sun.rmi.transport.Transport.run(Transport.java:197)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
    at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:568)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run[=12=](TCPTransport.java:683)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

任何帮助将不胜感激。

编辑 10 月 3 日 发现实例是运行 out of space，最后我们得到一个错误，没有足够的 space 来完成压缩。扩展分区并清除 /data 文件夹以从头开始 bootstrap；扩盘后，推流完成，但还是无法从UJ移动到UN；日志上没有错误，nodetool tpstats 显示没有待处理的任务，nodetool netstats 没有返回任何待处理的 activity，具有相同的 bootstrap UUID：

# nodetool netstats
Mode: JOINING
Bootstrap xxxx-8d0c340f238b
    /xx.xx.xx.233
    /xx.xx.xx.122
    /xx.xx.xx.171
    /xx.xx.xx.78
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed
Commands                        n/a         0            130
Responses                       n/a         0         256088

还有一个问题，为什么那个节点的负载会增加

Answer 1

由于没有错误报告，并且流式传输过程已完成，我们假设节点已准备好加入集群。

我们在cassandra.yaml文件中添加了auto_bootstrap: False指令，重启了节点中的服务，它加入了集群。

加入集群后，执行了全面修复和清理。

Cassandra 节点无法完成加入操作

Cassandra node can't complete joining operation

cassandra

cassandra-2.1