A:scp实现两个远程主机之间的文件复制:
推的命令: 把111上的user文件 推到 112的家目录下 命令
目录必须加 -r, 是文件则不用加
[root@bigdata111 ~]# scp -r user root@bigdata112:/root/
itstar 100% 121 0.1KB/s 00:00
aa 100% 0 0.0KB/s 00:00
拉的命令: 把111上的/plus目录拉到 本地112 上 的家目录
目录必须加 -r, 是文件则不用加
[root@bigdata112 ~]# scp -r root@bigdata111:/root/plus1 /root/
test 100% 0 0.0KB/s 00:00
123 100% 0 0.0KB/s 00:00
456 100% 5395 5.3KB/s 00:00
B: 归档操作:
Hadoop(不适合存储小文件)存档
1)理论概述
每个文件均按块存储,每个块的元数据存储在namenode的内存中,因此hadoop存储小文件会非常低效。因为大量的小文件会耗尽namenode中的大部分内存。但注意,存储小文件所需要的磁盘容量和存储这些文件原始内容所需要的磁盘空间相比也不会增多。例如,一个1MB的文件以大小为128MB的块存储,使用的是1MB的磁盘空间,而不是128MB。
Hadoop存档文件或HAR文件,是一个更高效的文件存档工具,它将文件存入HDFS块,在减少namenode内存使用的同时,允许对文件进行透明的访问。具体说来,Hadoop存档文件可以用作MapReduce的输入。
2).案例实操
(1)需要启动yarn进程
start-yarn.sh
(2)归档文件
归档成一个叫做xxx.har的文件夹,该文件夹下有相应的数据文件。Xx.har目录是一个整体,该目录看成是一个归档文件即可。
用法:hadoop archive -archiveName 归档名称 -p 父目录 [-r <复制因子>] 原路径(可以多个) 目的路径
bin/ foo.har -p /plus -r 3 a b c /
1.首先要有文件:最后一个就是
[root@bigdata111 ~]# ll
总用量 527772
-rwx-wx---. 2 liqing root 5500 9月 28 21:52 123
-rwx-wx---. 2 liqing root 5500 9月 28 21:52 aa
drwxrwxrwx. 3 root root 15 9月 17 12:30 aaaaa
drwxr-xr-x. 2 root root 6 9月 19 22:07 aaaaaaaa
-rw-r--r--. 1 root root 194 9月 17 19:42 aa.zip
-rw-------. 1 root root 1536 7月 28 19:07 anaconda-ks.cfg
-rwxrwxrwx. 1 liqing liqing 27 9月 28 21:52 bb
lrwxrwxrwx. 1 root root 2 9月 18 18:55 blianjie -> bb
-rw-r--r--. 1 root root 28 10月 4 21:14 cc
-rw-r--r--. 1 root root 189 9月 17 14:40 dd1.gz
-rw-r--r--. 1 root root 1583 9月 18 16:35 ddddddd
-rw-r--r--. 1 root root 189 9月 17 14:36 dd.gz
-rw-r--r--. 1 root root 564 9月 17 14:37 ff.gz
-rw-r--r--. 1 root root 4 9月 17 23:19 gg
-rwxrwxrwx. 1 root root 16 9月 28 23:32 hh
drwxr-xr-x. 4 root root 28 9月 18 14:45 itstar
drwxr-xr-x. 3 root root 46 8月 2 19:16 liqing
lrwxrwxrwx. 1 root root 2 9月 17 12:32 mm -> aa
drwxr-xr-x. 2 root root 29 9月 18 14:17 mod222
-rw-r--r--. 1 root root 108 9月 17 14:33 mod.gz
-rw-r--r--. 1 root root 10 9月 28 23:40 ooo
drwxr-xr-x. 2 root root 30 10月 8 18:02 plus
drwxr-xr-x. 4 root root 29 9月 18 14:35 plus1
-rw-r--r--. 1 root root 540330028 8月 23 20:31 Python素材.rar
-rw-r--r--. 1 root root 15650 10月 8 22:59 ss
-rwxrwxrwx. 1 root root 0 8月 7 11:15 test1.java
drwxr-xr-x. 2 1001 1001 43 10月 8 18:29 test2.java
2.将文件上传到HDFS集群:
[root@bigdata111 ~]# hdfs dfs -put test2.java /
3.开始将文件归档:
[root@bigdata111 ~]# hadoop archive -archiveName foo1.har -p /test2.java /
4.底层会走一个MaperReduce任务:
打印的日志结果:
19/10/08 22:57:15 INFO client.RMProxy: Connecting to ResourceManager at bigdata112/192.168.1.122:8032
19/10/08 22:57:17 INFO client.RMProxy: Connecting to ResourceManager at bigdata112/192.168.1.122:8032
19/10/08 22:57:17 INFO client.RMProxy: Connecting to ResourceManager at bigdata112/192.168.1.122:8032
19/10/08 22:57:18 INFO mapreduce.JobSubmitter: number of splits:1
19/10/08 22:57:18 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1570522163334_0004
19/10/08 22:57:19 INFO impl.YarnClientImpl: Submitted application application_1570522163334_0004
19/10/08 22:57:19 INFO mapreduce.Job: The url to track the job: http://bigdata112:8088/proxy/application_1570522163334_0004/
19/10/08 22:57:19 INFO mapreduce.Job: Running job: job_1570522163334_0004
19/10/08 22:57:36 INFO mapreduce.Job: Job job_1570522163334_0004 running in uber mode : false
19/10/08 22:57:36 INFO mapreduce.Job: map 0% reduce 0%
19/10/08 22:57:51 INFO mapreduce.Job: map 100% reduce 0%
19/10/08 22:58:05 INFO mapreduce.Job: map 100% reduce 100%
19/10/08 22:58:06 INFO mapreduce.Job: Job job_1570522163334_0004 completed successfully
19/10/08 22:58:06 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=292
FILE: Number of bytes written=319701
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=14643
HDFS: Number of bytes written=14489
HDFS: Number of read operations=19
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=13339
Total time spent by all reduces in occupied slots (ms)=10992
Total time spent by all map tasks (ms)=13339
Total time spent by all reduce tasks (ms)=10992
Total vcore-milliseconds taken by all map tasks=13339
Total vcore-milliseconds taken by all reduce tasks=10992
Total megabyte-milliseconds taken by all map tasks=13659136
Total megabyte-milliseconds taken by all reduce tasks=11255808
Map-Reduce Framework
Map input records=4
Map output records=4
Map output bytes=278
Map output materialized bytes=292
Input split bytes=116
Combine input records=0
Combine output records=0
Reduce input groups=4
Reduce shuffle bytes=292
Reduce input records=4
Reduce output records=0
Spilled Records=8
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=405
CPU time spent (ms)=4800
Physical memory (bytes) snapshot=318402560
Virtual memory (bytes) snapshot=4166209536
Total committed heap usage (bytes)=182063104
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=323
File Output Format Counters
Bytes Written=0
5.在Web的界面会出现, 自己设置生成的文件 foo1.har
drwxr-xr-x root supergroup 0 B Oct 08 22:58 0 0 B foo1.har
6.点进去 foo1.har目录 会有生成的文件:
Hadoop存档是特殊格式的存档。Hadoop存档映射到文件系统目录。Hadoop归档文件总是带有* .har扩展名
Hadoop存档目录包含元数据(采用_index和_masterindex形式)
数据部分data(part- *)文件。
_index文件包含归档文件的名称和部分文件中的位置。
-rw-r--r-- root supergroup 0 B Oct 08 22:58 3 128 MB _SUCCESS
-rw-r--r-- root supergroup 262 B Oct 08 22:58 3 128 MB _index
-rw-r--r-- root supergroup 23 B Oct 08 22:58 3 128 MB _masterindex
-rw-r--r-- root supergroup 13.87 KB Oct 08 22:57 3 512 MB part-0
C:解归档操作:(节省namenode元数据的内存空间,但要手动删除以前得数据)
第一种方法:
1.首先HDFS本地要有存在的目录接收:
把foo1.har文件解压到 HDFS的/itstar目录下(相当于拷贝进去)
查看归档: hadoop fs -lsr har:///foo1.har
[root@bigdata111 ~]# hadoop fs -lsr har:///foo1.har
lsr: DEPRECATED: Please use 'ls -R' instead.
-rw-r--r-- 3 root supergroup 12288 2019-10-08 22:56 har:///foo1.har/.swp
-rw-r--r-- 3 root supergroup 9 2019-10-08 22:56 har:///foo1.har/aa
-rw-r--r-- 3 root supergroup 1907 2019-10-08 22:56 har:///foo1.har/test1.java
[root@bigdata111 ~]# hadoop fs -cp har:///foo1.har/* /itstar
第二种方法:
1.首先最终的目录是不存在的(走MaperReduce)
[root@bigdata111 ~]# hadoop distcp har:/foo1.har /123
2.打印的日志结果:
19/10/08 23:26:54 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[har:/foo1.har], targetPath=/123, targetPathExists=false, filtersFile='null'}
19/10/08 23:26:54 INFO client.RMProxy: Connecting to ResourceManager at bigdata112/192.168.1.122:8032
19/10/08 23:26:55 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 4; dirCnt = 1
19/10/08 23:26:55 INFO tools.SimpleCopyListing: Build file listing completed.
19/10/08 23:26:55 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
19/10/08 23:26:55 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
19/10/08 23:26:55 INFO tools.DistCp: Number of paths in the copy list: 4
19/10/08 23:26:55 INFO tools.DistCp: Number of paths in the copy list: 4
19/10/08 23:26:55 INFO client.RMProxy: Connecting to ResourceManager at bigdata112/192.168.1.122:8032
19/10/08 23:26:57 INFO mapreduce.JobSubmitter: number of splits:4
19/10/08 23:26:57 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1570522163334_0005
19/10/08 23:26:58 INFO impl.YarnClientImpl: Submitted application application_1570522163334_0005
19/10/08 23:26:58 INFO mapreduce.Job: The url to track the job: http://bigdata112:8088/proxy/application_1570522163334_0005/
19/10/08 23:26:58 INFO tools.DistCp: DistCp job-id: job_1570522163334_0005
19/10/08 23:26:58 INFO mapreduce.Job: Running job: job_1570522163334_0005
19/10/08 23:27:12 INFO mapreduce.Job: Job job_1570522163334_0005 running in uber mode : false
19/10/08 23:27:12 INFO mapreduce.Job: map 0% reduce 0%
19/10/08 23:27:24 INFO mapreduce.Job: map 25% reduce 0%
19/10/08 23:27:25 INFO mapreduce.Job: map 50% reduce 0%
19/10/08 23:27:31 INFO mapreduce.Job: map 100% reduce 0%
19/10/08 23:27:32 INFO mapreduce.Job: Job job_1570522163334_0005 completed successfully
19/10/08 23:27:32 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=643036
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=17110
HDFS: Number of bytes written=14204
HDFS: Number of read operations=117
HDFS: Number of large read operations=0
HDFS: Number of write operations=15
Job Counters
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=51372
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=51372
Total vcore-milliseconds taken by all map tasks=51372
Total megabyte-milliseconds taken by all map tasks=52604928
Map-Reduce Framework
Map input records=4
Map output records=0
Input split bytes=536
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=499
CPU time spent (ms)=3530
Physical memory (bytes) snapshot=418123776
Virtual memory (bytes) snapshot=8321691648
Total committed heap usage (bytes)=138149888
File Input Format Counters
Bytes Read=1230
File Output Format Counters
Bytes Written=0
DistCp Counters
Bytes Copied=14204
Bytes Expected=14204
Files Copied=4
3.在web界面查看:
drwxr-xr-x root supergroup 0 B Oct 08 23:27 0 0 B 123
4.数据大小没有改变,只是归档后减少namenode元数据占用 的内存:
-rw-r--r-- root supergroup 12 KB Oct 08 23:27 3 128 MB .swp
-rw-r--r-- root supergroup 9 B Oct 08 23:27 3 128 MB aa
-rw-r--r-- root supergroup 1.86 KB Oct 08 23:27 3 128 MB test1.java