hadoop 归档 hdfs归档文件

转载

mob6454cc714ea1 2023-10-27 00:48:15

文章标签 hadoop 归档 hadoop mapreduce HDFS Hadoop 文章分类 Hadoop 大数据

A：scp实现两个远程主机之间的文件复制:

推的命令：把111上的user文件推到 112的家目录下命令

目录必须加 -r, 是文件则不用加

[root@bigdata111 ~]# scp -r user root@bigdata112:/root/
itstar                                        100%  121     0.1KB/s   00:00    
aa                                            100%    0     0.0KB/s   00:00

拉的命令：把111上的/plus目录拉到本地112 上的家目录

目录必须加 -r, 是文件则不用加

[root@bigdata112 ~]# scp -r root@bigdata111:/root/plus1 /root/
test                                          100%    0     0.0KB/s   00:00    
123                                           100%    0     0.0KB/s   00:00    
456                                           100% 5395     5.3KB/s   00:00

B: 归档操作：

Hadoop（不适合存储小文件）存档

1）理论概述

每个文件均按块存储，每个块的元数据存储在namenode的内存中，因此hadoop存储小文件会非常低效。因为大量的小文件会耗尽namenode中的大部分内存。但注意，存储小文件所需要的磁盘容量和存储这些文件原始内容所需要的磁盘空间相比也不会增多。例如，一个1MB的文件以大小为128MB的块存储，使用的是1MB的磁盘空间，而不是128MB。

Hadoop存档文件或HAR文件，是一个更高效的文件存档工具，它将文件存入HDFS块，在减少namenode内存使用的同时，允许对文件进行透明的访问。具体说来，Hadoop存档文件可以用作MapReduce的输入。

2）.案例实操

（1）需要启动yarn进程

start-yarn.sh

（2）归档文件

归档成一个叫做xxx.har的文件夹，该文件夹下有相应的数据文件。Xx.har目录是一个整体，该目录看成是一个归档文件即可。

用法：hadoop archive -archiveName 归档名称 -p 父目录 [-r <复制因子>] 原路径（可以多个）目的路径

bin/ foo.har -p /plus -r 3 a b c /

1.首先要有文件：最后一个就是

[root@bigdata111 ~]# ll
总用量 527772
-rwx-wx---. 2 liqing root        5500 9月  28 21:52 123
-rwx-wx---. 2 liqing root        5500 9月  28 21:52 aa
drwxrwxrwx. 3 root   root          15 9月  17 12:30 aaaaa
drwxr-xr-x. 2 root   root           6 9月  19 22:07 aaaaaaaa
-rw-r--r--. 1 root   root         194 9月  17 19:42 aa.zip
-rw-------. 1 root   root        1536 7月  28 19:07 anaconda-ks.cfg
-rwxrwxrwx. 1 liqing liqing        27 9月  28 21:52 bb
lrwxrwxrwx. 1 root   root           2 9月  18 18:55 blianjie -> bb
-rw-r--r--. 1 root   root          28 10月  4 21:14 cc
-rw-r--r--. 1 root   root         189 9月  17 14:40 dd1.gz
-rw-r--r--. 1 root   root        1583 9月  18 16:35 ddddddd
-rw-r--r--. 1 root   root         189 9月  17 14:36 dd.gz
-rw-r--r--. 1 root   root         564 9月  17 14:37 ff.gz
-rw-r--r--. 1 root   root           4 9月  17 23:19 gg
-rwxrwxrwx. 1 root   root          16 9月  28 23:32 hh
drwxr-xr-x. 4 root   root          28 9月  18 14:45 itstar
drwxr-xr-x. 3 root   root          46 8月   2 19:16 liqing
lrwxrwxrwx. 1 root   root           2 9月  17 12:32 mm -> aa
drwxr-xr-x. 2 root   root          29 9月  18 14:17 mod222
-rw-r--r--. 1 root   root         108 9月  17 14:33 mod.gz
-rw-r--r--. 1 root   root          10 9月  28 23:40 ooo
drwxr-xr-x. 2 root   root          30 10月  8 18:02 plus
drwxr-xr-x. 4 root   root          29 9月  18 14:35 plus1
-rw-r--r--. 1 root   root   540330028 8月  23 20:31 Python素材.rar
-rw-r--r--. 1 root   root       15650 10月  8 22:59 ss
-rwxrwxrwx. 1 root   root           0 8月   7 11:15 test1.java
drwxr-xr-x. 2   1001   1001        43 10月  8 18:29 test2.java

2.将文件上传到HDFS集群：

[root@bigdata111 ~]# hdfs dfs -put test2.java /

3.开始将文件归档：

[root@bigdata111 ~]#  hadoop archive -archiveName foo1.har -p /test2.java /

4.底层会走一个MaperReduce任务：

打印的日志结果：

19/10/08 22:57:15 INFO client.RMProxy: Connecting to ResourceManager at bigdata112/192.168.1.122:8032
19/10/08 22:57:17 INFO client.RMProxy: Connecting to ResourceManager at bigdata112/192.168.1.122:8032
19/10/08 22:57:17 INFO client.RMProxy: Connecting to ResourceManager at bigdata112/192.168.1.122:8032
19/10/08 22:57:18 INFO mapreduce.JobSubmitter: number of splits:1
19/10/08 22:57:18 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1570522163334_0004
19/10/08 22:57:19 INFO impl.YarnClientImpl: Submitted application application_1570522163334_0004
19/10/08 22:57:19 INFO mapreduce.Job: The url to track the job: http://bigdata112:8088/proxy/application_1570522163334_0004/
19/10/08 22:57:19 INFO mapreduce.Job: Running job: job_1570522163334_0004
19/10/08 22:57:36 INFO mapreduce.Job: Job job_1570522163334_0004 running in uber mode : false
19/10/08 22:57:36 INFO mapreduce.Job:  map 0% reduce 0%
19/10/08 22:57:51 INFO mapreduce.Job:  map 100% reduce 0%
19/10/08 22:58:05 INFO mapreduce.Job:  map 100% reduce 100%
19/10/08 22:58:06 INFO mapreduce.Job: Job job_1570522163334_0004 completed successfully
19/10/08 22:58:06 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=292
		FILE: Number of bytes written=319701
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=14643
		HDFS: Number of bytes written=14489
		HDFS: Number of read operations=19
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=8
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=1
		Other local map tasks=1
		Total time spent by all maps in occupied slots (ms)=13339
		Total time spent by all reduces in occupied slots (ms)=10992
		Total time spent by all map tasks (ms)=13339
		Total time spent by all reduce tasks (ms)=10992
		Total vcore-milliseconds taken by all map tasks=13339
		Total vcore-milliseconds taken by all reduce tasks=10992
		Total megabyte-milliseconds taken by all map tasks=13659136
		Total megabyte-milliseconds taken by all reduce tasks=11255808
	Map-Reduce Framework
		Map input records=4
		Map output records=4
		Map output bytes=278
		Map output materialized bytes=292
		Input split bytes=116
		Combine input records=0
		Combine output records=0
		Reduce input groups=4
		Reduce shuffle bytes=292
		Reduce input records=4
		Reduce output records=0
		Spilled Records=8
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=405
		CPU time spent (ms)=4800
		Physical memory (bytes) snapshot=318402560
		Virtual memory (bytes) snapshot=4166209536
		Total committed heap usage (bytes)=182063104
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=323
	File Output Format Counters 
		Bytes Written=0

5.在Web的界面会出现，自己设置生成的文件 foo1.har

drwxr-xr-x	root	supergroup	0 B	Oct 08 22:58	0	0 B	foo1.har

6.点进去 foo1.har目录会有生成的文件：

Hadoop存档是特殊格式的存档。Hadoop存档映射到文件系统目录。Hadoop归档文件总是带有* .har扩展名

Hadoop存档目录包含元数据（采用_index和_masterindex形式）

数据部分data（part- *）文件。

_index文件包含归档文件的名称和部分文件中的位置。

-rw-r--r--	root	supergroup	0 B	Oct 08 22:58	3	128 MB	_SUCCESS	
-rw-r--r--	root	supergroup	262 B	Oct 08 22:58	3	128 MB	_index	
-rw-r--r--	root	supergroup	23 B	Oct 08 22:58	3	128 MB	_masterindex	
-rw-r--r--	root	supergroup	13.87 KB	Oct 08 22:57	3	512 MB	part-0

hadoop 归档 hdfs归档文件_HDFS

C:解归档操作：（节省namenode元数据的内存空间，但要手动删除以前得数据）

第一种方法：

1.首先HDFS本地要有存在的目录接收：

把foo1.har文件解压到 HDFS的/itstar目录下（相当于拷贝进去）

查看归档: hadoop fs -lsr har:///foo1.har

[root@bigdata111 ~]# hadoop fs -lsr har:///foo1.har
lsr: DEPRECATED: Please use 'ls -R' instead.
-rw-r--r--   3 root supergroup      12288 2019-10-08 22:56 har:///foo1.har/.swp
-rw-r--r--   3 root supergroup          9 2019-10-08 22:56 har:///foo1.har/aa
-rw-r--r--   3 root supergroup       1907 2019-10-08 22:56 har:///foo1.har/test1.java
[root@bigdata111 ~]# hadoop fs -cp har:///foo1.har/* /itstar

第二种方法：

1.首先最终的目录是不存在的（走MaperReduce）

[root@bigdata111 ~]# hadoop distcp har:/foo1.har /123

2.打印的日志结果：

19/10/08 23:26:54 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[har:/foo1.har], targetPath=/123, targetPathExists=false, filtersFile='null'}
19/10/08 23:26:54 INFO client.RMProxy: Connecting to ResourceManager at bigdata112/192.168.1.122:8032
19/10/08 23:26:55 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 4; dirCnt = 1
19/10/08 23:26:55 INFO tools.SimpleCopyListing: Build file listing completed.
19/10/08 23:26:55 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
19/10/08 23:26:55 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
19/10/08 23:26:55 INFO tools.DistCp: Number of paths in the copy list: 4
19/10/08 23:26:55 INFO tools.DistCp: Number of paths in the copy list: 4
19/10/08 23:26:55 INFO client.RMProxy: Connecting to ResourceManager at bigdata112/192.168.1.122:8032
19/10/08 23:26:57 INFO mapreduce.JobSubmitter: number of splits:4
19/10/08 23:26:57 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1570522163334_0005
19/10/08 23:26:58 INFO impl.YarnClientImpl: Submitted application application_1570522163334_0005
19/10/08 23:26:58 INFO mapreduce.Job: The url to track the job: http://bigdata112:8088/proxy/application_1570522163334_0005/
19/10/08 23:26:58 INFO tools.DistCp: DistCp job-id: job_1570522163334_0005
19/10/08 23:26:58 INFO mapreduce.Job: Running job: job_1570522163334_0005
19/10/08 23:27:12 INFO mapreduce.Job: Job job_1570522163334_0005 running in uber mode : false
19/10/08 23:27:12 INFO mapreduce.Job:  map 0% reduce 0%
19/10/08 23:27:24 INFO mapreduce.Job:  map 25% reduce 0%
19/10/08 23:27:25 INFO mapreduce.Job:  map 50% reduce 0%
19/10/08 23:27:31 INFO mapreduce.Job:  map 100% reduce 0%
19/10/08 23:27:32 INFO mapreduce.Job: Job job_1570522163334_0005 completed successfully
19/10/08 23:27:32 INFO mapreduce.Job: Counters: 33
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=643036
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=17110
		HDFS: Number of bytes written=14204
		HDFS: Number of read operations=117
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=15
	Job Counters 
		Launched map tasks=4
		Other local map tasks=4
		Total time spent by all maps in occupied slots (ms)=51372
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=51372
		Total vcore-milliseconds taken by all map tasks=51372
		Total megabyte-milliseconds taken by all map tasks=52604928
	Map-Reduce Framework
		Map input records=4
		Map output records=0
		Input split bytes=536
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=499
		CPU time spent (ms)=3530
		Physical memory (bytes) snapshot=418123776
		Virtual memory (bytes) snapshot=8321691648
		Total committed heap usage (bytes)=138149888
	File Input Format Counters 
		Bytes Read=1230
	File Output Format Counters 
		Bytes Written=0
	DistCp Counters
		Bytes Copied=14204
		Bytes Expected=14204
		Files Copied=4

3.在web界面查看：

drwxr-xr-x	root	supergroup	0 B	Oct 08 23:27	0	0 B	123

4.数据大小没有改变，只是归档后减少namenode元数据占用的内存：

-rw-r--r--	root	supergroup	12 KB	Oct 08 23:27	3	128 MB	.swp	
-rw-r--r--	root	supergroup	9 B	Oct 08 23:27	3	128 MB	aa	
-rw-r--r--	root	supergroup	1.86 KB	Oct 08 23:27	3	128 MB	test1.java

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：python 连接hbase查看表字段 phoenix命令行查看hbase表

下一篇：android room 大于安卓中的room

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯