hadoop默认不支持lzo的压缩格式;
lzo压缩工具,支持对超过block大小的数据进行切分;
关于lzo压缩的弊端:
1.需要手工或者shell对lzo文件创建index;
2.lzo格式的文件,通过hive执行同样的查询操作时要比其它格式慢几倍之多;
总结:暂时不建议使用lzo压缩
环境:
系统:Centos 7.5
hadoop:hadoop-2.6.0-cdh5.7.0.tar.gz(通过hadoop-2.6.0-cdh5.7.0-src.tar.gz编译生成的)
lzo所需组件:
lzo
lzop
hadoop-gpl-packaging:gpl-packaging的作用主要是对压缩的lzo文件创建索引,否则的话,无论压缩文件是否大于hdfs的block大小,都只会1个map操作。
编译安装lzo:
#安装相关依赖 yum -y install lzo-devel zlib-devel gcc autoconf automake libtool #编译lzo cd /usr/local/src wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz tar -zxvf lzo-2.06.tar.gz cd lzo-2.06 ./configure -enable-shared -prefix=/usr/local/lzo make make install 注:编译完lzo包之后,会在/usr/local/lzo/目录下生成一些文件。 [root@hadoop004 lzo]# ll total 0 drwxr-xr-x 3 root root 17 Apr 19 10:30 include drwxr-xr-x 2 root root 103 Apr 19 10:30 lib drwxr-xr-x 3 root root 17 Apr 19 10:30 share #查看lzop命令: [root@hadoop004 lzo]# which lzop /usr/bin/lzop #lzo命令使用方法及压缩测试 lzo压缩:lzop -v filename lzo解压:lzop -dv filename [root@hadoop004 hadoop]# du -sh * 69M access.log 16M access.log.lzo 注:.lzo为压缩过后的文件,可以看到压缩比例很高。
安装hadoop-lzo:
#编译安装hadoop-lzo的准备工作 cd /usr/local/src wget https://github.com/twitter/hadoop-lzo/archive/master.zip unzip master.zip cd hadoop-lzo-master/ 注:因为hadoop使用的是2.6.0;所以版本修改为2.6.0;修改的文件为pom.xml <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <hadoop.current.version>2.6.0</hadoop.current.version> <hadoop.old.version>1.0.4</hadoop.old.version> </properties> #开始编译 mvn clean package -Dmaven.test.skip=true 注:编译成功 [INFO] Building jar: /usr/local/src/hadoop-lzo-master/target/hadoop-lzo-0.4.21-SNAPSHOT-javadoc.jar [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 01:08 min [INFO] Finished at: 2019-04-19T10:39:55+08:00 [INFO] Final Memory: 28M/136M [INFO] ------------------------------------------------------------------------ #编译生成的文件hadoop-lzo-0.4.21-SNAPSHOT.jar,很重要(在target目录下) [root@hadoop004 target]# pwd /usr/local/src/hadoop-lzo-master/target [root@hadoop004 target]# ll total 432 drwxr-xr-x 2 root root 4096 Apr 19 10:39 antrun drwxr-xr-x 5 root root 4096 Apr 19 10:39 apidocs drwxr-xr-x 5 root root 77 Apr 19 10:39 classes drwxr-xr-x 3 root root 25 Apr 19 10:39 generated-sources -rw-r--r-- 1 root root 188906 Apr 19 10:39 hadoop-lzo-0.4.21-SNAPSHOT.jar -rw-r--r-- 1 root root 185078 Apr 19 10:39 hadoop-lzo-0.4.21-SNAPSHOT-javadoc.jar -rw-r--r-- 1 root root 52021 Apr 19 10:39 hadoop-lzo-0.4.21-SNAPSHOT-sources.jar drwxr-xr-x 2 root root 71 Apr 19 10:39 javadoc-bundle-options drwxr-xr-x 2 root root 28 Apr 19 10:39 maven-archiver drwxr-xr-x 3 root root 28 Apr 19 10:39 native drwxr-xr-x 3 root root 18 Apr 19 10:39 test-classes 注:我们最终需要的就是hadoop-lzo-0.4.21-SNAPSHOT.jar文件
注:将hadoop-lzo-0.4.21-SNAPSHOT.jar包复制到各hadoop节点的${HADOOP_HOME}/share/hadoop/common/目录下才能被hadoop使用
hadoop各节点配置以支持lzo:
注:首先在master节点执行./stop-all.sh,以关闭所有节点的进程
#core-site.xml配置支持的压缩的类型 <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec, org.apache.hadoop.io.compress.BZip2Codec </value> </property> <property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property> #mapred-site.xml配置mapreduce各阶段的压缩类型 输入阶段的压缩 <property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property> 最终阶段的压缩 <property> <name>mapreduce.output.fileoutputformat.compress</name> <value>true</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.codec</name> <value>org.apache.hadoop.io.compress.BZip2Codec</value> </property>
注:配置完成之后在master节点执行./start-all.sh,启动各节点的相关的进程
执行job测试lzo压缩切分是否生效:
修改hadoop的block大小为10M;
#hdfs-site.xml <property> <name>dfs.blocksize</name> <value>10485760</value> </property>
准备测试文件,大小为78.7M;
[hadoop@hadoop002 hadoop]$ hadoop fs -ls /input Found 1 items -rw-r--r-- 3 hadoop hadoop 82552172 2019-04-19 11:46 /input/hadoop_data_000.txt [hadoop@hadoop002 hadoop]$ hadoop fs -du -s -h /input 78.7 M 236.2 M /input
执行MapReduce Job查看:
[hadoop@hadoop001 train]$ hadoop jar hadoop_train-1.0.jar com.g609.hadoop.etl.driver.LogETLDriver /input /output 19/04/19 11:46:43 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 19/04/19 11:46:44 INFO input.FileInputFormat: Total input paths to process : 1 19/04/19 11:46:44 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries 19/04/19 11:46:44 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev f1deea9a313f4017dd5323cb8bbb3732c1aaccc5] 19/04/19 11:46:44 INFO mapreduce.JobSubmitter: number of splits:8 19/04/19 11:46:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1555643762064_0001 19/04/19 11:46:45 INFO impl.YarnClientImpl: Submitted application application_1555643762064_0001 19/04/19 11:46:45 INFO mapreduce.Job: The url to track the job: http://hadoop001:8088/proxy/application_1555643762064_0001/ 19/04/19 11:46:45 INFO mapreduce.Job: Running job: job_1555643762064_0001 19/04/19 11:46:53 INFO mapreduce.Job: Job job_1555643762064_0001 running in uber mode : false 19/04/19 11:46:53 INFO mapreduce.Job: map 0% reduce 0% 19/04/19 11:47:05 INFO mapreduce.Job: map 38% reduce 0% 19/04/19 11:47:06 INFO mapreduce.Job: map 50% reduce 0% 19/04/19 11:47:14 INFO mapreduce.Job: map 75% reduce 0% 19/04/19 11:47:15 INFO mapreduce.Job: map 88% reduce 0% 19/04/19 11:47:18 INFO mapreduce.Job: map 88% reduce 29% 19/04/19 11:47:20 INFO mapreduce.Job: map 100% reduce 29% 19/04/19 11:47:24 INFO mapreduce.Job: map 100% reduce 72% 19/04/19 11:47:27 INFO mapreduce.Job: map 100% reduce 79% 19/04/19 11:47:30 INFO mapreduce.Job: map 100% reduce 86% 19/04/19 11:47:33 INFO mapreduce.Job: map 100% reduce 92% 19/04/19 11:47:36 INFO mapreduce.Job: map 100% reduce 100% 19/04/19 11:47:37 INFO mapreduce.Job: Job job_1555643762064_0001 completed successfully
注:可以看到执行job时启用了lzo压缩工具的切分功能,数据大小为78.7M,因为每个block大小为10M,启动了8个Map; 达到预期的效果
查看输出结果:可以看到输出结果的格式为.bz2,与我们在mapred-site.xml中配置的输出结果一致;
[hadoop@hadoop002 hadoop]$ hadoop fs -ls /output Found 2 items -rw-r--r-- 3 hadoop hadoop 0 2019-04-19 11:47 /output/_SUCCESS -rw-r--r-- 3 hadoop hadoop 14507599 2019-04-19 11:47 /output/part-r-00000.bz2
通过Hive查询测试切分是否成功:
#准备测试数据part-r-00000.lzo,大小为28.6M;块大小之前已经设置10M [hadoop@hadoop002 hadoop]$ hadoop fs -du -s -h /compress-lzo/day=20180717/* 28.6 M 85.8 M /compress-lzo/day=20180717/part-r-00000.lzo #创建Hive分区表并加载hdfs上的测试数据 CREATE EXTERNAL TABLE g6_access_lzo ( cdn string, region string, level string, time string, ip string, domain string, url string, traffic bigint) PARTITIONED BY ( day string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" LOCATION '/compress-lzo'; #刷新元数据 alter table g6_access_lzo add if not exists partition(day='20180717'); #第一次执行sql语句,查看Map的个数;可以看到Map个数为1; MapReduce Total cumulative CPU time: 4 seconds 660 msec Ended Job = job_1555643762064_0005 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.66 sec HDFS Read: 29983202 HDFS Write: 7 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 660 msec OK 500000 Time taken: 22.037 seconds, Fetched: 1 row(s) #对part-r-00000.lzo文件生成index操作 hadoop jar hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer /compress-lzo/day=20180717/part-r-00000.lzo [hadoop@hadoop001 ~]$ hadoop jar hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer /compress-lzo/day=20180717/part-r-00000.lzo 19/04/19 15:26:56 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries 19/04/19 15:26:56 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev f1deea9a313f4017dd5323cb8bbb3732c1aaccc5] 19/04/19 15:26:57 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file /compress-lzo/day=20180717/part-r-00000.lzo, size 0.03 GB... 19/04/19 15:26:58 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available 19/04/19 15:26:58 INFO lzo.LzoIndexer: Completed LZO Indexing in 0.36 seconds (78.54 MB/s). Index size is 2.03 KB. #查看生成的index文件 [hadoop@hadoop001 ~]$ hadoop fs -ls /compress-lzo/day=20180717/* -rw-r--r-- 3 hadoop hadoop 29975501 2019-04-19 15:19 /compress-lzo/day=20180717/part-r-00000.lzo -rw-r--r-- 3 hadoop hadoop 2080 2019-04-19 15:26 /compress-lzo/day=20180717/part-r-00000.lzo.index #第二次执行sql语句,查看Map的个数;可以看到Map个数为3; MapReduce Total cumulative CPU time: 11 seconds 260 msec Ended Job = job_1555643762064_0006 MapReduce Jobs Launched: Stage-Stage-1: Map: 3 Reduce: 1 Cumulative CPU: 11.26 sec HDFS Read: 30222902 HDFS Write: 7 SUCCESS Total MapReduce CPU Time Spent: 11 seconds 260 msec OK 500000 Time taken: 22.792 seconds, Fetched: 1 row(s)
创建Index之前;Map为1个
创建Index之后;Map为3个