2014-05-12注定是春光灿烂猪八戒的一天,历史595无故障的hadoop服务器,终于还是出了问题,事前无人登陆操作服务器,此故障属于自发行为,目前未知发生原因。

细节描述: namenode无法启动. 先贴出错误信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
2014-05-12 07:17:39,447 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = DC.aws/127.0.0.1
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.205.0
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205 -r 1179940; compiled by 'hortonfo' on Fri Oct  7 06:20:32 UTC 2011
************************************************************/
2014-05-12 07:17:39,600 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2014-05-12 07:17:39,613 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered.
2014-05-12 07:17:39,614 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2014-05-12 07:17:39,614 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system started
2014-05-12 07:17:39,764 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered.
2014-05-12 07:17:39,773 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source jvm registered.
2014-05-12 07:17:39,774 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source NameNode registered.
2014-05-12 07:17:39,800 INFO org.apache.hadoop.hdfs.util.GSet: VM type       = 64-bit
2014-05-12 07:17:39,800 INFO org.apache.hadoop.hdfs.util.GSet: 2% max memory = 17.77875 MB
2014-05-12 07:17:39,800 INFO org.apache.hadoop.hdfs.util.GSet: capacity      = 2^21 = 2097152 entries
2014-05-12 07:17:39,800 INFO org.apache.hadoop.hdfs.util.GSet: recommended=2097152, actual=2097152
2014-05-12 07:17:39,823 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=root
2014-05-12 07:17:39,823 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
2014-05-12 07:17:39,823 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
2014-05-12 07:17:39,829 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.block.invalidate.limit=100
2014-05-12 07:17:39,829 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
2014-05-12 07:17:40,045 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStateMBean and NameNodeMXBean
2014-05-12 07:17:40,065 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occuring more than 10 times
2014-05-12 07:17:40,078 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 3349287
2014-05-12 07:18:01,677 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed.
java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at java.io.DataInputStream.readLong(DataInputStream.java:399)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:817)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:362)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:97)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:384)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:358)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:497)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1268)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1277)
2014-05-12 07:18:01,678 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at java.io.DataInputStream.readLong(DataInputStream.java:399)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:817)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:362)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:97)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:384)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:358)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:497)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1268)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1277)
2014-05-12 07:18:01,679 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at DC.aws/127.0.0.1
************************************************************/

找了半天,也没找到解决方法。我们的做的是伪分布式环境,到底该怎么搞呢?

   format属于大招了,臣妾办不到啊...

补充说明:

   我的namenode中 fsp_w_picpath 文件为445M

   我的secondarynamenode中fsp_w_picpath文件为281M

   很明显是二者不同的. 目前有点头绪,正在解救服务器

……………………………………………………………………………………………………

经过长达两个多小时的奋战,终于搞定了...--主要是和之前离职开发的沟通耗费时间

我查看SNN和NN下的current和p_w_picpath目录大小,发现 产生了文件差异,这已经很说明数据已经产生了丢失,在这种情况下,只能采取如下方式来减小数据丢失,尽快回复程序正常

解决方法核心:

   用SNN的current和p_w_picpath目录覆盖NN的current和p_w_picpath目录。--当然了,覆盖之前的备份是运维必须做的一定要和开发和老总沟通好,确定风险之后进行操作.


缺陷:无法100%恢复数据,必然会造成数据的缺失。

改进:改为真正分布式结构,避免单点存储问题。或者更改架构,和开发沟通,用ext4文件系统,取代hdfs,重新开发新的配套代码。


   在此,感谢 广州-no-python(QQ...未经本人允许,暂时保密)和北京-乾坤-运维(QQ...未经本人允许,暂时保密)的鼎力帮忙,这两位大神耗费自己的宝贵时间,给我的排错过程提供了宝贵的指点,非常钦佩! 以后定当向他们学习,帮助其他有困难的运维伙伴们!





本文出自 “技术成就未来” 博客,请务必保留此出处http://jishuweiwang.blog.51cto.com/6977090/1409901