一个很常见的报错log
2015-03-05 03:10:35,461 FATAL [regionserver60020-WAL.AsyncSyncer0] wal.FSHLog: Error while AsyncSyncer sync, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): BP-1540478979-192.168.5.117-1409220943611:blk_1098635649_24898817 does not exist or is not under Constructionblk_1098635649_24900382
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:5956)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:6023)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:645)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:874)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$
2015-03-05 03:10:35,460 FATAL [regionserver60020-WAL.AsyncSyncer2] wal.FSHLog: Error while AsyncSyncer sync, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): BP-1540478979-192.168.5.117-1409220943611:blk_1098635649_24898817 does not exist or is not under Constructionblk_1098635649_24900382
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:5956)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:6023)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:645)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:874)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server
2015-03-05 03:10:35,459 FATAL [regionserver60020-WAL.AsyncSyncer3] wal.FSHLog: Error while AsyncSyncer sync, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): BP-1540478979-192.168.5.117-1409220943611:blk_1098635649_24898817 does not exist or is not under Constructionblk_1098635649_24900382
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:5956)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:6023)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:645)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:874)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
1072365,2-9 99%
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
2015-03-05 03:10:35,463 FATAL [regionserver60020] regionserver.HRegionServer: ABORTING region server hostxxx,60020,1424960304895: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected
; currently processing hostxxx,60020,1424960304895 as dead server
at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1357)
at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:73)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing hostxxx,60020,1424960304895 as dead server
at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1357)
at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException): org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently proces
sing hostxxx,60020,1424960304895 as dead server
at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1357)
at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
再往前边看看会返现gc警告
2015-03-05 03:08:25,092 DEBUG [LruStats #0] hfile.LruBlockCache: Total=12.67 GB, free=125.43 MB, max=12.79 GB, blocks=203552, accesses=65532282, hits=33745890, hitRatio=51.50%, , cachingAccesses=3
4739624, cachingHits=28976512, cachingHitsRatio=83.41%, evictions=536, evicted=5540234, evictedPerRun=10336.2578125
2015-03-05 03:10:35,390 WARN [regionserver60020.periodicFlusher] util.Sleeper: We slept 124408ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, se
e http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2015-03-05 03:10:35,390 INFO [regionserver60020-SendThread(host141:42181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 127580ms for sessionid 0x34add7868858eeb,
closing socket connection and attempting reconnect
2015-03-05 03:10:35,390 WARN [regionserver60020.compactionChecker] util.Sleeper: We slept 124408ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad,
see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2015-03-05 03:10:35,390 WARN [RpcServer.handler=14,port=60020] ipc.RpcServer: RpcServer.respondercallId: 3046 service: ClientService methodName: Scan size: 23 connection: 192.168.5.186:3366: outp
ut error
2015-03-05 03:10:35,390 WARN [regionserver60020] util.Sleeper: We slept 121547ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.ap
ache.org/book.html#trouble.rs.runtime.zkexpired
总结原因:
gc时间过长(可能是gc问题,cpu问题使得gc得不到线程执行)
使得zk认为rs已经dead,
zk返回deadregion到master,master就让其他rs负责dead rs下的regions
其他rs会读取wal进行恢复region,处理完的wal,会把wal文件删除
dead rs的gc完成,恢复之后,找不到wal产生报错,
dead rs从zk得知自己dead了,就close了
我的解决方法:
设置SurvivorRatio=2,增大survivor大小,减少老生带增加速度,减少cms触发几率
设置-XX:CMSInitiatingOccupancyFraction=60,提早cms进行老生带的回收,减少cms的时间
这样避免老生带在回收的时候占满,触发full gc(避免promotion failed和concurrent mode failure)
最后hbase rs jvm:
export HBASE_REGIONSERVER_OPTS="-Xmx33g -Xms33g -Xmn2g -XX:SurvivorRatio=1 -XX:PermSize=128M -XX:MaxPermSize=128M -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv6Addresses=false -XX:MaxTenuringThreshold=15 -XX:+CMSParallelRemarkEnabled -XX:+UseFastAccessorMethods -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+HeapDumpOnOutOfMemoryError -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -Xloggc:$HBASE_HOME/hbase-0.98.1-cdh5.1.0/logs/gc-hbase.log"