Tez是Apache开源的DAG作业的计算引擎,是为了减小Hive作业的延迟而提出的解决方案,Tez已被Hortonworks用于Hive引擎的优化,经测试,性能提升约100倍。Tez+Hive仍然采用MapReduce计算框架,但对DAG的依赖关系进行了剪裁,并将多个小作业合并成一个大作业,这样不仅作业量减少了,而且写HDFS的次数也会大大减少。Tez具有以下几个特点: (1) 丰富的数据流(dataflow,NOT Streaming!)编程接口; (2) 扩展性良好的“Input-Processor-Output”运行模型; (3) 简化数据部署(充分利用了YARN框架,Tez本身仅是一个客户端编 程库,无需事先部署相关服务) (4) 性能优于MapReduce (5) 优化的资源管理(直接运行在资源管理系统YARN之上) (6) 动态生成物理数据流(dataflow) Tez和MapReduce的区别,如下图所示: 一、源代码安装 1.1 依赖软件包 本文的操作系统环境是Oracle Linux 7.4,需要安装以下依赖包:
[root@hdp01 ~]# yum -y install git bzip2 redhat-lsb
1.2 安装protobuf软件
[root@hdp01 src]# wget https://github.com/google/protobuf/releases/download/v3.5.1/protobuf-all-3.5.1.tar.gz
[root@hdp01 software]# tar -xzf /u02/software/src/protobuf-all-3.5.1.tar.gz
[root@hdp01 software]# cd /u02/protobuf-3.5.1;./configure;make;make install
--编译安装完成后,执行protoc命令出现以下结果则安装成功:
[root@hdp01 src]# protoc --version
libprotoc 3.5.1
1.3 编译安装tez
[hadoop@hdp01 src]$ wget http://mirrors.hust.edu.cn/apache/tez/0.9.0/apache-tez-0.9.0-src.tar.gz
[hadoop@hdp01 software]$ tar -xzf /u02/software/src/apache-tez-0.9.0-src.tar.gz
[hadoop@hdp01 software]$ cd apache-tez-0.9.0-src
--若protoc不是2.5.0版本,则必须编辑源代码文件夹里的pom.xml文件,修改protoc为系统当前使用的版本。
[hadoop@hdp01 apache-tez-0.9.0-src]$ mvn clean package -DskipTests=true -Dmaven.javadoc.skip=true
此过程比较漫长,编译成功后,如下图所示: 1.4 配置tez 编译后的tez-dist/target/tez-0.9.0.tar.gz就是需要的二进制软件包。 1.4.1 上传二进制软件包
[hadoop@hdp01 ~]$ hdfs dfs -mkdir /user/tez
[hadoop@hdp01 ~]$ hdfs dfs -put /u02/software/apache-tez-0.9.0-src/tez-dist/target/tez-0.9.0.tar.gz /user/tez
1.4.2 解压缩文件
[hadoop@hdp01 u01]$ tar -xzf /u02/software/apache-tez-0.9.0-src/tez-dist/target/tez-0.9.0.tar.gz
[hadoop@hdp01 u01]$ mv tez-0.9.0 tez
1.4.3 创建tez-site.xml文件 在hadoop主节点的$HADOOP_HOME/etc/hadoop/目录下创建tez-site.xml文件(只在主节点创建即可),内容如下:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>tez.lib.uris</name>
<value>${fs.defaultFS}/user/tez/tez-0.9.0.tar.gz</value>
</property>
<property>
<name>tez.container.max.java.heap.fraction</name>
<value>0.3</value>
</property>
</configuration>
1.4.4 编辑mapred-site.xml 将mapreduce.framework.name的值从yarn改为yarn-tez即可。 1.4.5 修改hadoop-env.sh 追加以下内容:
export TEZ_CONF_DIR=/u01/hadoop/etc/hadoop/tez-site.xml
export TEZ_JARS=/u01/tez
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:${TEZ_CONF_DIR}:${TEZ_JARS}/*:${TEZ_JARS}/lib/*
1.4.6 同步文件 这里需要将tez-site.xml、mapred-site.xml、hadoop-env.sh以及/u01/tez目录同步到集群其他节点,如下:
[hadoop@hdp01 hadoop]$ for i in {2..4};do scp hadoop-env.sh hdp0$i:/u01/hadoop/etc/hadoop/;done
[hadoop@hdp01 hadoop]$ for i in {2..4};do scp mapred-site.xml hdp0$i:/u01/hadoop/etc/hadoop/;done
[hadoop@hdp01 hadoop]$ for i in {2..4};do scp tez-site.xml hdp0$i:/u01/hadoop/etc/hadoop/;done
[hadoop@hdp01 hadoop]$ for i in {2..4};do scp -r /u01/tez hdp0$i:/u01;done
1.4.7 重启hadoop集群
[hadoop@hdp01 hadoop]$ stop-yarn.sh;stop-dfs.sh
[hadoop@hdp01 hadoop]$ start-dfs.sh;start-yarn.sh
到此,整个tez安装已完成。 二、测试验证 2.1 准备测试文件
[hadoop@hdp01 ~]$ echo "Hello World Hello Tez" > file01
[hadoop@hdp01 ~]$ echo "Hello World Goodbye Tez" > file02
[hadoop@hdp01 ~]$ hdfs dfs -mkdir /user/tez/input
[hadoop@hdp01 ~]$ hdfs dfs -mkdir /user/tez/output
[hadoop@hdp01 ~]$ hdfs dfs -put file0* /user/tez/input
2.2 使用以下命令验证
[hadoop@hdp01 ~]$ cd /u01/tez
[hadoop@hdp01 tez]$ [hadoop@hdp01 tez]$ hadoop jar tez-examples-0.9.0.jar orderedwordcount /user/tez/input /user/tez/output
17/12/26 11:49:47 INFO shim.HadoopShimsLoader: Trying to locate HadoopShimProvider for hadoopVersion=2.7.4, majorVersion=2, minorVersion=7
17/12/26 11:49:47 INFO shim.HadoopShimsLoader: Picked HadoopShim org.apache.tez.hadoop.shim.HadoopShim27, providerName=org.apache.tez.hadoop.shim.HadoopShim25_26_27Provider, overrideProviderViaConfig=null, hadoopVersion=2.7.4, majorVersion=2, minorVersion=7
17/12/26 11:49:47 INFO client.TezClient: Tez Client Version: [ component=tez-api, version=0.9.0, revision=0873a0118a895ca84cbdd221d8ef56fedc4b43d0, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, buildTime=2017-07-18T05:41:23Z ]
17/12/26 11:49:48 INFO examples.OrderedWordCount: Running OrderedWordCount
17/12/26 11:49:48 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
17/12/26 11:49:48 INFO client.TezClient: Submitting DAG application with id: application_1513929521869_0023
17/12/26 11:49:48 INFO client.TezClientUtils: Using tez.lib.uris value from configuration: hdfs://hdp01:9000/user/tez/tez.tar.gz
17/12/26 11:49:48 INFO client.TezClientUtils: Using tez.lib.uris.classpath value from configuration: null
17/12/26 11:49:48 INFO client.TezClient: Tez system stage directory hdfs://hdp01:9000/tmp/hadoop/tez/staging/.tez/application_1513929521869_0023 doesn't exist and is created
17/12/26 11:49:49 INFO client.TezClient: Submitting DAG to YARN, applicationId=application_1513929521869_0023, dagName=OrderedWordCount, callerContext={ context=TezExamples, callerType=null, callerId=null }
17/12/26 11:49:49 INFO impl.YarnClientImpl: Submitted application application_1513929521869_0023
17/12/26 11:49:49 INFO client.TezClient: The url to track the Tez AM: http://hdp04:8088/proxy/application_1513929521869_0023/
17/12/26 11:49:53 INFO client.DAGClientImpl: DAG initialized: CurrentState=Running
17/12/26 11:49:53 INFO client.DAGClientImpl: DAG: State: RUNNING Progress: 0% TotalTasks: 3 Succeeded: 0 Running: 0 Failed: 0 Killed: 0
17/12/26 11:49:53 INFO client.DAGClientImpl: VertexStatus: VertexName: Tokenizer Progress: 0% TotalTasks: 1 Succeeded: 0 Running: 0 Failed: 0 Killed: 0
17/12/26 11:49:53 INFO client.DAGClientImpl: VertexStatus: VertexName: Summation Progress: 0% TotalTasks: 1 Succeeded: 0 Running: 0 Failed: 0 Killed: 0
17/12/26 11:49:53 INFO client.DAGClientImpl: VertexStatus: VertexName: Sorter Progress: 0% TotalTasks: 1 Succeeded: 0 Running: 0 Failed: 0 Killed: 0
17/12/26 11:49:58 INFO client.DAGClientImpl: DAG: State: RUNNING Progress: 33.33% TotalTasks: 3 Succeeded: 1 Running: 1 Failed: 0 Killed: 0
17/12/26 11:49:58 INFO client.DAGClientImpl: VertexStatus: VertexName: Tokenizer Progress: 100% TotalTasks: 1 Succeeded: 1 Running: 0 Failed: 0 Killed: 0
17/12/26 11:49:58 INFO client.DAGClientImpl: VertexStatus: VertexName: Summation Progress: 0% TotalTasks: 1 Succeeded: 0 Running: 1 Failed: 0 Killed: 0
17/12/26 11:49:58 INFO client.DAGClientImpl: VertexStatus: VertexName: Sorter Progress: 0% TotalTasks: 1 Succeeded: 0 Running: 0 Failed: 0 Killed: 0
17/12/26 11:49:58 INFO client.DAGClientImpl: DAG: State: SUCCEEDED Progress: 100% TotalTasks: 3 Succeeded: 3 Running: 0 Failed: 0 Killed: 0
17/12/26 11:49:58 INFO client.DAGClientImpl: VertexStatus: VertexName: Tokenizer Progress: 100% TotalTasks: 1 Succeeded: 1 Running: 0 Failed: 0 Killed: 0
17/12/26 11:49:58 INFO client.DAGClientImpl: VertexStatus: VertexName: Summation Progress: 100% TotalTasks: 1 Succeeded: 1 Running: 0 Failed: 0 Killed: 0
17/12/26 11:49:58 INFO client.DAGClientImpl: VertexStatus: VertexName: Sorter Progress: 100% TotalTasks: 1 Succeeded: 1 Running: 0 Failed: 0 Killed: 0
17/12/26 11:49:58 INFO client.DAGClientImpl: DAG completed. FinalState=SUCCEEDED
执行成功后,查看output下面的文件,如下:
[hadoop@hdp02 ~]$ hdfs dfs -ls /user/tez/output
Found 2 items
-rw-r--r-- 3 hadoop supergroup 0 2017-12-26 11:49 /user/tez/output/_SUCCESS
-rw-r--r-- 3 hadoop supergroup 32 2017-12-26 11:49 /user/tez/output/part-v002-o000-r-00000
[hadoop@hdp02 ~]$ hdfs dfs -text /user/tez/output/part-v002-o000-r-00000
Goodbye 1
Tez 2
World 2
Hello 3
三、Hive操作验证 在hive控制台指定execution engine为tez即可,默认是mr(mapreduce)。
hive> set hive.execution.engine=tez;
hive> use hivedb;
hive> select count(*) from xj_student;
如果修改默认值为tez,需要编辑hive-site.xml文件,修改execution engine为tez,重启hive服务即可。 参考文献: 1、安装Tez 0.9.0 2、Install/Deploy Instructions for Tez