FSImage 和Edits Log文件用于保存Namenode节点的元数据,用于持久化保存HDFS里各个数据文件之间的对应关系。FSImage在硬盘式以文件的方式保存集群中包括文件目录,数据块与相关datanode之间的映射关系。可能基于性能的考虑, FSImage并不是实时的更新以反映当前HDFS的文件及目录情况, 当前HDFS对于文件及目录等操作都以日志的形式保存于edits.log文件中,基于最小化停机时间的考虑,会存在一个备用的namenode节点, 通过IPC通信,定期的将edits.logs合并进FSImage中, 这样在HDFS下次重启时,namenode将花费较少的时间基于FSImage和edits.log文件在内存中重建HDFS。 感觉有点类似于oralce里面的dynamic check point.
FSImage作为存储集群里面相关文件名及其一系列block与datanode的映射关系, 其存储结构又是怎么样呢? 我们通过分析org.apache.hadoop.hdfs.server.namenode.FSImage可一窥究竟。
Hadoop1.2.1
1 boolean loadFSImage(File curFile) throws IOException {
2 FSNamesystem fsNamesys = FSNamesystem.getFSNamesystem();
3 FSDirectory fsDir = fsNamesys.dir;
4
5 //
6 // Load in bits
7 //
8 boolean needToSave = true;
9 DataInputStream in = new DataInputStream(new BufferedInputStream(
10 new FileInputStream(curFile)));
11 try {
12 // read image version: first appeared in version -1
13 //image版本号
14 int imgVersion = in.readInt();
15 // read namespaceID: first appeared in version -2
16 //命名空间id
17 this.namespaceID = in.readInt();
18
19 // read number of files
20 //文件或目录的数目,根据版本的不同,加以区别
21 long numFiles;
22 if (imgVersion <= -16) {
23 numFiles = in.readLong();
24 } else {
25 numFiles = in.readInt();
26 }
27
28 this.layoutVersion = imgVersion;
29 // read in the last generation stamp.
30 //时间戳
31 if (imgVersion <= -12) {
32 long genstamp = in.readLong();
33 fsNamesys.setGenerationStamp(genstamp);
34 }
35
36 needToSave = (imgVersion != FSConstants.LAYOUT_VERSION);
37
38 // read file info
39 short replication = FSNamesystem.getFSNamesystem().getDefaultReplication();
40
41 LOG.info("Number of files = " + numFiles);
42
43 String path;
44 String parentPath = "";
45 INodeDirectory parentINode = fsDir.rootDir;
46 //开始重建目录树
47 for (long i = 0; i < numFiles; i++) {
48 long modificationTime = 0;
49 long atime = 0;
50 long blockSize = 0;
51 path = readString(in);//文件或者目录的路径名
52 replication = in.readShort();//副本因子,默认为3,可配置 (如果是目录,这里应为0)
53 replication = FSEditLog.adjustReplication(replication);
54 modificationTime = in.readLong();//文件的mtime
55 if (imgVersion <= -17) {
56 atime = in.readLong(); //atime
57 }
58 if (imgVersion <= -8) {
59 blockSize = in.readLong(); //block的大小,(目录为0)
60 }
61 int numBlocks = in.readInt(); //对应文件所包含的block总数,(目录为0)
62 Block blocks[] = null;
63
64 // for older versions, a blocklist of size 0
65 // indicates a directory.
66 if ((-9 <= imgVersion && numBlocks > 0) ||
67 (imgVersion < -9 && numBlocks >= 0)) {
68 blocks = new Block[numBlocks];
69 for (int j = 0; j < numBlocks; j++) {
70 blocks[j] = new Block();
71 if (-14 < imgVersion) {
72 blocks[j].set(in.readLong(), in.readLong(),
73 Block.GRANDFATHER_GENERATION_STAMP);
74 } else {
75 blocks[j].readFields(in);
76 }
77 }
78 }
79 // Older versions of HDFS does not store the block size in inode.
80 // If the file has more than one block, use the size of the
81 // first block as the blocksize. Otherwise use the default block size.
82 //
83 if (-8 <= imgVersion && blockSize == 0) {
84 if (numBlocks > 1) {
85 blockSize = blocks[0].getNumBytes();
86 } else {
87 long first = ((numBlocks == 1) ? blocks[0].getNumBytes(): 0);
88 blockSize = Math.max(fsNamesys.getDefaultBlockSize(), first);
89 }
90 }
91
92 // get quota only when the node is a directory
93 long nsQuota = -1L;
94 if (imgVersion <= -16 && blocks == null) {
95 nsQuota = in.readLong();//nsQuota
96 }
97 long dsQuota = -1L;
98 if (imgVersion <= -18 && blocks == null) {
99 dsQuota = in.readLong();//dsQuota
100 }
101
102 PermissionStatus permissions = fsNamesys.getUpgradePermission();
103 if (imgVersion <= -11) {
104 permissions = PermissionStatus.read(in);
105 }
106 if (path.length() == 0) { // it is the root
107 // update the root's attributes
108 if (nsQuota != -1 || dsQuota != -1) {
109 fsDir.rootDir.setQuota(nsQuota, dsQuota);
110 }
111 fsDir.rootDir.setModificationTime(modificationTime);
112 fsDir.rootDir.setPermissionStatus(permissions);
113 continue;
114 }
115 // check if the new inode belongs to the same parent
116 if(!isParent(path, parentPath)) {
117 parentINode = null;
118 parentPath = getParent(path);
119 }
120 // add new inode
121 parentINode = fsDir.addToParent(path, parentINode, permissions,
122 blocks, replication, modificationTime,
123 atime, nsQuota, dsQuota, blockSize);
124 }
125
126 // load datanode info
127 this.loadDatanodes(imgVersion, in);
128
129 // load Files Under Construction
130 this.loadFilesUnderConstruction(imgVersion, in, fsNamesys);
131
132 this.loadSecretManagerState(imgVersion, in, fsNamesys);
133
134 } finally {
135 in.close();
136 }
137
138 return needToSave;
139 }
140
141 public void set(long blkid, long len, long genStamp) {
142 this.blockId = blkid;
143 this.numBytes = len;
144 this.generationStamp = genStamp;
145 }
View Code
imgVersion(int):当前image的版本信息
namespaceID(int):unknown
numFiles(long):整个文件系统中包含有多少文件和目录
genStamp(long):image的时间戳
path(String):该目录或文件的路径,
replications(short):副本数
mtime(long):mtime
atime(long):atime
blocksize(long):目录的blocksize都为0
numBlocks(int):实际有多少个文件块,目录的该值都为-1
if(numBlocks > 0){
blockid(long):属于该文件的block的blockid,
numBytes(long):该block的大小
genStamp(long):该block的时间戳
}
nsQuota(long):namespace Quota值,若没加Quota限制则为-1
dsQuota(long):disk Quota值,若没加限制则也为-1
...
..
.其他fields
Remark: 实际代码中对于不同版本的FSImage文件有一些分支判断
没有找到官方对于FSImage文件结构的描述,只能通过源码进行推断。 如果对于FSImage及edits.log文件结构清楚后, 应该可以实现脱离hadoop client 或者API对hdfs进行离线分析。 比如直接访问FSImage得到HDFS中的文件清单等信息,甚至直接定位到相关的datanode上的block