一、前置准备

CentOS7、jdk1.8、hive-2.3.6、hadoop-2.7.7、spark-2.0.0-bin-hadoop2-without-hive

二、简单了解

Hive版本及其对应的兼容Spark版本的列表

Hive更换Spark引擎_jar

2.1 手动编译Spark

Spark下载地址:https://archive.apache.org/dist/spark/spark-2.0.0/

源码包只有12M,下载完成后解压并进行编译(去hive模块)

# 解压
[xiaokang@hadoop ~]$ tar -zxvf spark-2.0.0.tgz
# 在spark-2.0.0主目录下进行编译
[xiaokang@hadoop01 spark-2.0.0]$ ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"

Hive更换Spark引擎_jar_02

2.2 编译好的gz包

链接:https://pan.baidu.com/s/15dkf-DMc6CB0-oifQUy9OA

提取码:6y4e

三、更换Spark引擎

3.1 hive-site.xml

在原有的配置基础上增加以下配置:

<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>hive.enable.spark.execution.engine</name>
<value>true</value>
</property>
<property>
<name>spark.home</name>
<value>/opt/software/spark-2.0.0</value>
</property>
<property>
<name>spark.master</name>
<value>yarn</value>
</property>
<property>
<name>spark.eventLog.enabled</name>
<value>true</value>
</property>
<property>
<name>spark.eventLog.dir</name>
<value>hdfs://hacluster:8020/spark-hive-jobhistory</value>
</property>
<property>
<name>spark.executor.memory</name>
<value>512m</value>
</property>
<property>
<name>spark.driver.memory</name>
<value>512m</value>
</property>
<property>
<name>spark.serializer</name>
<value>org.apache.spark.serializer.KryoSerializer</value>
</property>
<property>
<name>spark.yarn.jars</name>
<value>hdfs://hacluster:8020/spark-jars/*</value>
</property>
<property>
<name>hive.spark.client.server.connect.timeout</name>
<value>300000</value>
</property>

3.2 spark-env.sh

export JAVA_HOME=/opt/moudle/jdk1.8.0_191
export SCALA_HOME=/opt/moudle/scala-2.11.12
export HADOOP_HOME=/opt/software/hadoop-2.7.7
export HADOOP_CONF_DIR=/opt/software/hadoop-2.7.7/etc/hadoop
export HADOOP_YARN_CONF_DIR=/opt/software/hadoop-2.7.7/etc/hadoop
export SPARK_HOME=/opt/software/spark-2.0.0
export SPARK_WORKER_MEMORY=512m
export SPARK_EXECUTOR_MEMORY=512m
export SPARK_DRIVER_MEMORY=512m
export SPARK_DIST_CLASSPATH=$(/opt/software/hadoop-2.7.7/bin/hadoop classpath)
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=hadoop01:2181,hadoop02:2181,hadoop03:2181 -Dspark.deploy.zookeeper.dir=/ha-on-spark"

3.3 slaves

[xiaokang@hadoop01 conf]$ cp slaves.template slaves

hadoop01
hadoop02
hadoop03

3.4 拷贝jar包和xml文件

将Hive的lib目录下的指定jar包拷贝到Spark的jars目录下:

hive-beeline-2.3.6.jar
hive-cli-2.3.6.jar
hive-exec-2.3.6.jar
hive-jdbc-2.3.6.jar
hive-metastore-2.3.6.jar

[xiaokang@hadoop01 lib]$ cp hive-beeline-2.3.6.jar hive-cli-2.3.6.jar hive-exec-2.3.6.jar hive-jdbc-2.3.6.jar hive-metastore-2.3.6.jar /opt/software/spark-2.0.0/jars/

将Spark的jars目录下的指定jar包拷贝到Hive的lib目录下:

spark-network-common_2.11-2.0.0.jar
spark-core_2.11-2.0.0.jar
scala-library-2.11.8.jar
chill-java
chill
jackson-module-paranamer,
jackson-module-scala,
jersey-container-servlet-core
jersey-server,
json4s-ast ,
kryo-shaded,
minlog,
scala-xml,
spark-launcher
spark-network-shuffle,
spark-unsafe ,
xbean-asm5-shaded

[xiaokang@hadoop01 jars]$ cp spark-network-common_2.11-2.0.0.jar spark-core_2.11-2.0.0.jar scala-library-2.11.8.jar chill-java-0.8.0.jar chill_2.11-0.8.0.jar jackson-module-paranamer-2.6.5.jar jackson-module-scala_2.11-2.6.5.jar jersey-container-servlet-core-2.22.2.jar jersey-server-2.22.2.jar json4s-ast_2.11-3.2.11.jar kryo-shaded-3.0.3.jar minlog-1.3.0.jar scala-xml_2.11-1.0.2.jar spark-launcher_2.11-2.0.0.jar spark-network-shuffle_2.11-2.0.0.jar spark-unsafe_2.11-2.0.0.jar xbean-asm5-shaded-4.4.jar /opt/software/hive-2.3.6/lib/

将hadoop中的yarn-site.xml、hdfs-site.xml以及Hive的hive-site.xml放入spark的conf中

[xiaokang@hadoop01 ~]$ cp /opt/software/hadoop-2.7.7/etc/hadoop/hdfs-site.xml /opt/software/hadoop-2.7.7/etc/hadoop/yarn-site.xml /opt/software/hive-2.3.6/conf/hive-site.xml /opt/software/spark-2.0.0/conf/

3.5 上传至HDFS

为了使各个节点都能够使用 Spark 引擎进行计算,需要将Spark的jars目录下所有依赖包上传至HDFS

# 在HDFS上创建一个目录,用来存放spark依赖包
[xiaokang@hadoop01 ~]$ hdfs dfs -mkdir /spark-jars
# 上传所有依赖包
[xiaokang@hadoop01 ~]$ hdfs dfs -put /opt/software/spark-2.0.0/jars/*.jar /spark-jars

3.6 分发

[xiaokang@hadoop01 ~]$ distribution.sh /opt/software/spark-2.0.0

3.7 启动HA-Spark集群

# 在hadoop01上启动spark集群和任务历史服务器
[xiaokang@hadoop01 ~]$ /opt/software/spark-2.0.0/sbin/start-all.sh
[xiaokang@hadoop01 ~]$ /opt/software/spark-2.0.0/sbin/start-history-server.sh
# 在hadoop02上启动备Master
[xiaokang@hadoop02 ~]$ /opt/software/spark-2.0.0/sbin/start-master.sh

四、测试

# 启动hiveserver2服务
[xiaokang@hadoop01 ~]$ nohup hiveserver2 >/dev/null 2>&1 &

使用DBeaverEE进行SQL测试

#查询每天每人的登录次数
select userid,dt,count(*) as loginTimes
from game_login
group by dt,userid;

Spark界面如下图所示:

Yarn界面如下图所示:

Hive更换Spark引擎_大数据_03