hadoop多次format hadoop fsimage

转载

架构思维大师 2023-07-14 14:29:49

文章标签 hadoop多次format 大数据 Hadoop hdfs bash 文章分类 Hadoop 大数据

概览

离线fsimage查看器是一个将 hdfs fsimage 文件的内容转储为人类可读格式的工具，并提供只读的 WebHDFS API，以允许离线分析和检查 Hadoop 集群的名称空间。该工具能够相对快速地处理非常大的fsimage文件。该工具处理 Hadoop 2.4及以上版本中包含的格式。如果您希望处理旧的格式，可以使用 Hadoop 2.3或 oiv _ legacy Command 的离线fsimage查看器。如果工具不能处理fsimage文件，它将干净地退出。离线fsimage查看器不需要运行 Hadoop 集群; 它在操作中完全离线。

离线fsimage查看器提供了几个输出处理器:

Web 是默认的输出处理器。它启动一个公开只读的 WebHDFS API 的 HTTP 服务器。用户可以使用 HTTP REST API 交互式地调查名称空间。
XML 创建一个 fsimage 的 XML 文档，并包含 fsimage 中的所有信息。这个处理器的输出可以通过 XML 工具进行自动化处理和分析。由于 XML 语法的冗长性，该处理器还将生成最大量的输出。
FileDistribution 是分析名称空间fsimage文件大小的工具。为了运行该工具，应该通过指定 maxSize 和一个步骤来定义一个整数范围[0，maxSize ]。整数的范围被划分为大小为[0，s [1] ，... ，s [ n-1] ，maxSize ]的部分，处理器计算系统中有多少文件落入每个部分[ s [ i-1] ，s [ i ]]中。请注意，大于 maxSize 的文件总是落入最后一个段中。默认情况下，输出文件格式为选项卡分隔的两列表: Size 和 NumFiles。其中 Size 表示段的开始，numFiles 表示形成映像的文件数，其大小在该段中。通过指定 option-format，输出文件将以人类可读的方式格式化，而不是在 Size 列中显示大量字节。此外，“大小”列将更改为“大小范围”列。

用法

Web Processor

Web 处理器启动一个 HTTP 服务器，该服务器公开只读的 WebHDFS API。用户可以指定要侦听的地址(默认情况下 localhost: 5978)。

bash$ bin/hdfs oiv -i fsimage
   14/04/07 13:25:14 INFO offlineImageViewer.WebImageViewer: WebImageViewer
   started. Listening on /127.0.0.1:5978. Press Ctrl+C to stop the viewer.

用户可以通过以下 shell 命令访问查看器并获取 fsimage 的信息:

bash$ bin/hdfs dfs -ls webhdfs://127.0.0.1:5978/
   Found 2 items
   drwxrwx--* - root supergroup          0 2014-03-26 20:16 webhdfs://127.0.0.1:5978/tmp
   drwxr-xr-x   - root supergroup          0 2014-03-31 14:08 webhdfs://127.0.0.1:5978/user

要获取所有文件和目录的信息，只需使用以下命令:

bash$ bin/hdfs dfs -ls -R webhdfs://127.0.0.1:5978/

用户还可以通过 HTTP REST API 获得 JSON 格式的 filestat。

bash$ curl -i http://127.0.0.1:5978/webhdfs/v1/?op=liststatus
   HTTP/1.1 200 OK
   Content-Type: application/json
   Content-Length: 252

   {"FileStatuses":{"FileStatus":[
   {"fileId":16386,"accessTime":0,"replication":0,"owner":"theuser","length":0,"permission":"755","blockSize":0,"modificationTime":1392772497282,"type":"DIRECTORY","group":"supergroup","childrenNum":1,"pathSuffix":"user"}
   ]}}

Web 处理器现在支持以下操作:

列表状态
Getfileatus
GETACLSTATUS
GETXATTRS
LISTXATTRS
[CONTENTSUMMARY] (./WebHDFS.html#Get_Content_Summary_of_a_Directory)

XML 处理器

XML 处理器用于转储文件中的所有内容。用户可以通过-i 和-o 命令行指定输入和输出文件。

bash$ bin/hdfs oiv -p XML -i fsimage -o fsimage.xml

这将创建一个名为 fsimage.xml 的文件，其中包含 fsimage 中的所有信息。对于非常大的fsimage文件，这个过程可能需要几分钟。

将离线fsimage查看器与 XML 处理器一起应用将导致以下输出:

<?xml version="1.0"?>
   <fsimage>
   <NameSection>
     <genstampV1>1000</genstampV1>
     <genstampV2>1002</genstampV2>
     <genstampV1Limit>0</genstampV1Limit>
     <lastAllocatedBlockId>1073741826</lastAllocatedBlockId>
     <txid>37</txid>
   </NameSection>
   <INodeSection>
     <lastInodeId>16400</lastInodeId>
     <inode>
       <id>16385</id>
       <type>DIRECTORY</type>
       <name></name>
       <mtime>1392772497282</mtime>
       <permission>theuser:supergroup:rwxr-xr-x</permission>
       <nsquota>9223372036854775807</nsquota>
       <dsquota>-1</dsquota>
     </inode>
   ...remaining output omitted...

ReverseXML 处理器

ReverseXML 处理器与 XML 处理器相反。用户可以通过-i 和-o 命令行指定输入 XML 文件和输出 fsimage 文件。

bash$ bin/hdfs oiv -p ReverseXML -i fsimage.xml -o fsimage

这将从一个 XML 文件重新构造一个 fsimage。

FileDistribution处理器

FileDistribution 处理器可以分析名称空间映像中的文件大小。用户可以通过-maxSize 和-step 命令行指定，maxSize (默认值为128gb)和 step (默认值为2mb)。

bash$ bin/hdfs oiv -p FileDistribution -maxSize maxSize -step size -i fsimage -o output

处理器将计算系统中有多少文件落入每个段中。输出文件格式化为标签分隔的两列表格，如下面的输出所示:

Size	NumFiles
   4	1
   12	1
   16	1
   20	1
   totalFiles = 4
   totalDirectories = 2
   totalBlocks = 4
   totalSpace = 48
   maxFileSize = 21

为了使输出结果看起来更具可读性，用户还可以指定-format 选项。

bash$ bin/hdfs oiv -p FileDistribution -maxSize maxSize -step size -format -i fsimage -o output

这将产生以下产出:

Size Range	NumFiles
   (0 B, 4 B]	1
   (8 B, 12 B]	1
   (12 B, 16 B]	1
   (16 B, 21 B]	1
   totalFiles = 4
   totalDirectories = 2
   totalBlocks = 4
   totalSpace = 48
   maxFileSize = 21

Delimited处理器

Delimited处理程序生成 fsimage 的文本表示形式，每个元素之间用分隔符字符串分隔(缺省情况下为 t)。用户可以通过分隔符选项指定新的分隔符字符串。

bash$ bin/hdfs oiv -p Delimited -delimiter delimiterString -i fsimage -o output

此外，用户可以通过以下命令指定一个临时目录来缓存中间结果:

bash$ bin/hdfs oiv -p Delimited -delimiter delimiterString -t temporaryDir -i fsimage -o output

如果没有设置，带分隔符的处理器将在输出文本之前在内存中构造名称空间。这个处理器的输出结果应该类似于下面的输出:

Path	Replication	ModificationTime	AccessTime	PreferredBlockSize	BlocksCount	FileSize	NSQUOTA	DSQUOTA	Permission	UserName	GroupName
   /	0	2017-02-13 10:39	1970-01-01 08:00	0	0	0	9223372036854775807	-1	drwxr-xr-x	root	supergroup
   /dir0	0	2017-02-13 10:39	1970-01-01 08:00	0	0	0	-1	-1	drwxr-xr-x	root	supergroup
   /dir0/file0	1	2017-02-13 10:39	2017-02-13 10:39	134217728	1	1	0	0	-rw-r--r--	root	supergroup
   /dir0/file1	1	2017-02-13 10:39	2017-02-13 10:39	134217728	1	1	0	0	-rw-r--r--	root	supergroup
   /dir0/file2	1	2017-02-13 10:39	2017-02-13 10:39	134217728	1	1	0	0	-rw-r--r--	root	supergroup

Options

Flag	Description
`-i`	`--inputFile` input file
`-o`	`--outputFile` output file
`-p`	`--processor` processor
`-addr` address	Specify the address(host:port) to listen. (localhost:5978 by default). This option is used with Web processor.
`-maxSize` size	Specify the range [0, maxSize] of file sizes to be analyzed in bytes (128GB by default). This option is used with FileDistribution processor.
`-step` size	Specify the granularity of the distribution in bytes (2MB by default). This option is used with FileDistribution processor.
`-format`	Format the output result in a human-readable fashion rather than a number of bytes. (false by default). This option is used with FileDistribution processor.
`-delimiter` arg	Delimiting string to use with Delimited processor.
`-t`	`--temp` temporary dir
`-h`	`--help`

分析结果

离线fsimage查看器可以轻松收集有关 hdfs 名称空间的大量数据。然后，可以使用这些信息探索文件系统使用模式，或者查找符合任意条件的特定文件，以及其他类型的名称空间分析。

oiv_legacy Command

由于基于 protocolbuffer 的 fsimage (HDFS-5698)引入了内部布局变化，如果希望在没有大量内存的情况下进行处理或使用这些处理器，可以使用 oiv _ legacy 命令(与 Hadoop 2.3中的 oiv 相同)。

用法

设置 dfs.namenode.legacy-oiv-image。在检查点期间，将其命名空间保存为旧的 fsimage 格式。
对旧格式 fsimage 使用 oiv _ legacy 命令。

bash$ bin/hdfs oiv_legacy -i fsimage_old -o output

Options

Flag	Description
`-i`	`--inputFile` input file
`-o`	`--outputFile` output file
`-p`	`--processor` processor
`-maxSize` size	Specify the range [0, maxSize] of file sizes to be analyzed in bytes (128GB by default). This option is used with FileDistribution processor.
`-step` size	Specify the granularity of the distribution in bytes (2MB by default). This option is used with FileDistribution processor.
`-format`	Format the output result in a human-readable fashion rather than a number of bytes. (false by default). This option is used with FileDistribution processor.
`-skipBlocks`	Do not enumerate individual blocks within files. This may save processing time and outfile file space on namespaces with very large files. The Ls processor reads the blocks to correctly determine file sizes and ignores this option.
`-printToScreen`	Pipe output of processor to console as well as specified file. On extremely large namespaces, this may increase processing time by an order of magnitude.
`-delimiter` arg	When used in conjunction with the Delimited processor, replaces the default tab delimiter with the string specified by arg.
`-h`	`--help`