重启的namenode遭受块报告风暴
Restarted namenode suffer from block report storm
当我们基于hadoop v2.4.1的standby namenode故障重启时,发现namenode退出safemode后忙得无法及时响应。我们扔了好几堆,它们看起来都是这样的,
Thread 212 (IPC Server handler 148 on 8020):
State: WAITING
Blocked count: 66
Waited count: 598
Waiting on java.util.concurrent.locks.ReentrantLock$FairSync@60ea5634
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:229)
java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeLock(FSNamesystem.java:1378)
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1676)
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1019)
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152)
org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService.callBlockingMethod(DatanodeProtocolProtos.java:28061)
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:2013)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:2009)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:415)
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
几乎所有服务器处理程序都在等待 FSNameSystem#writeLock 处理 incremental/full 报告!
配置文件:
dfs.blockreport.initialDelay: 120.
dfs.blockreport.intervalMsec: 6h.
server handlers number: 200.
datanodes number:400.
The namenode takes 0.5~1s to process a block report.
datanodes的日志看到很多offerService IOExceptions和重试。
NN日志显示一个datanode的存储被处理了两次以上,有的甚至达到10次。
blockLog.info("BLOCK* processReport: from storage " + storage.getStorageID()
+ " node " + nodeID + ", blocks: " + newReport.getNumberOfBlocks()
+ ", processing time: " + (endTime - startTime) + " msecs");
有没有人遇到过同样的问题,有什么想法吗?
最后我们根据服务器的特点,减少了服务器的IPC handler数量,解决了问题。希望对遇到同样问题的小伙伴有所帮助!
当我们基于hadoop v2.4.1的standby namenode故障重启时,发现namenode退出safemode后忙得无法及时响应。我们扔了好几堆,它们看起来都是这样的,
Thread 212 (IPC Server handler 148 on 8020): State: WAITING Blocked count: 66 Waited count: 598 Waiting on java.util.concurrent.locks.ReentrantLock$FairSync@60ea5634 Stack: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:229) java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeLock(FSNamesystem.java:1378) org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1676) org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1019) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService.callBlockingMethod(DatanodeProtocolProtos.java:28061) org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) org.apache.hadoop.ipc.Server$Handler.run(Server.java:2013) org.apache.hadoop.ipc.Server$Handler.run(Server.java:2009) java.security.AccessController.doPrivileged(Native Method) javax.security.auth.Subject.doAs(Subject.java:415) org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
几乎所有服务器处理程序都在等待 FSNameSystem#writeLock 处理 incremental/full 报告!
配置文件:
dfs.blockreport.initialDelay: 120.
dfs.blockreport.intervalMsec: 6h.
server handlers number: 200.
datanodes number:400.
The namenode takes 0.5~1s to process a block report.
datanodes的日志看到很多offerService IOExceptions和重试。
NN日志显示一个datanode的存储被处理了两次以上,有的甚至达到10次。
blockLog.info("BLOCK* processReport: from storage " + storage.getStorageID()
+ " node " + nodeID + ", blocks: " + newReport.getNumberOfBlocks()
+ ", processing time: " + (endTime - startTime) + " msecs");
有没有人遇到过同样的问题,有什么想法吗?
最后我们根据服务器的特点,减少了服务器的IPC handler数量,解决了问题。希望对遇到同样问题的小伙伴有所帮助!