CDH6.2.1 环境

1、开启hive on spark配置:

在hive配置中搜索 spark ,更改一下配置:

YARN 服务上的 Spark 选择spark

默认执行引擎 hive.execution.engine :spark

执行程序初始数 spark.dynamicAllocation.initialExecutors :2 ,启动sparksession时,默认初始启动的executor个数,如果配置了 set spark.executor.instances=3 (启动的executor个数),则启动sparksession时,为3个executor。

执行程序下限数 spark.dynamicAllocation.minExecutors :1 ,最少的executor个数,也就是当开启动态资源分配后,移除空闲的executor后,最少剩余1个executor存在。

执行程序上限数 spark.dynamicAllocation.maxExecutors :4 ,最多的executor个数,也就是启动sparksession时,最多扩展的executor个数。

各应用程序的 Spark 执行程序 spark.executor.instances :2 ,启动sparksession时,指定启动的executor个数。相当于初始化启动的executor个数,该值需介于minExecutors和maxExecutors之间。

cdh hive使用方法 cdh配置hive on spark_大数据


cdh hive使用方法 cdh配置hive on spark_hadoop_02

2、hue中测试hive on spark

select * from a;
select count(*) from a;
create temporary function pinjie as 'com.topnet.MyUDF' using jar 'hdfs:///user/hive/udf/hiveUDF-1.0.jar';
select pinjie(name) from a;
drop function pinjie;

由于 sql1执行不会在底层创建application,会直接返回数据,所以查询时间很快。

sql2 底层会创建app,向yarn申请资源,由于是初次创建sparksession,时间会比较久,当再次执行count查询时就会很快反回数据,是因为第一次创建完sparksession后,session不会关闭,会驻留在内存中,当再有执行计划来时,省去启动session的时间,直接执行sql语句。

cdh hive使用方法 cdh配置hive on spark_cdh hive使用方法_03

sql3、sql4、sql5 为创建使用删除临时udf函数。

3、通过beeline连接hive

[root@hadoop163 ~]# beeline -u jdbc:hive2://hadoop162:10000 -n hive
0: jdbc:hive2://hadoop162:10000> select * from a;
+-------+---------+
| a.id  | a.name  |
+-------+---------+
| 1     | 小明      |
+-------+---------+
1 row selected (1.293 seconds)
0: jdbc:hive2://hadoop162:10000> select count(*) from a;
INFO  : Compiling command(queryId=hive_20210330102334_8c2258a0-2c5f-42ab-8d47-65f033967bc8): select count(*) from a
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20210330102334_8c2258a0-2c5f-42ab-8d47-65f033967bc8); Time taken: 0.101 seconds
INFO  : Executing command(queryId=hive_20210330102334_8c2258a0-2c5f-42ab-8d47-65f033967bc8): select count(*) from a
INFO  : Query ID = hive_20210330102334_8c2258a0-2c5f-42ab-8d47-65f033967bc8
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Running with YARN Application = application_1616659825062_0057
INFO  : Kill Command = /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop/bin/yarn application -kill application_1616659825062_0057
INFO  : Hive on Spark Session Web UI URL: http://hadoop161:44976
INFO  : 
Query Hive on Spark job[0] stages: [0, 1]
INFO  : Spark job[0] status = RUNNING
INFO  : Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
INFO  : 2021-03-30 10:24:38,990	Stage-0_0: 0(+1)/1	Stage-1_0: 0/1	
INFO  : 2021-03-30 10:24:42,005	Stage-0_0: 0(+1)/1	Stage-1_0: 0/1	
INFO  : 2021-03-30 10:24:43,010	Stage-0_0: 1/1 Finished	Stage-1_0: 0/1	
INFO  : 2021-03-30 10:24:44,013	Stage-0_0: 1/1 Finished	Stage-1_0: 1/1 Finished	
INFO  : Spark job[0] finished successfully in 9.09 second(s)
INFO  : Completed executing command(queryId=hive_20210330102334_8c2258a0-2c5f-42ab-8d47-65f033967bc8); Time taken: 69.551 seconds
INFO  : OK
+------+
| _c0  |
+------+
| 1    |
+------+
1 row selected (69.711 seconds)

查看yarn,可以看到启动的APP,其type为spark,可以看到使用的相关资源。

cdh hive使用方法 cdh配置hive on spark_spark_04


cdh hive使用方法 cdh配置hive on spark_spark_05

再次运行count语句,只使用了1s多,可以看到省去启动sparksession的时间。

0: jdbc:hive2://hadoop162:10000> select count(*) from a;
INFO  : Compiling command(queryId=hive_20210330102952_17ea110d-01c4-4bba-8378-3876fbcf1cb0): select count(*) from a
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20210330102952_17ea110d-01c4-4bba-8378-3876fbcf1cb0); Time taken: 0.108 seconds
INFO  : Executing command(queryId=hive_20210330102952_17ea110d-01c4-4bba-8378-3876fbcf1cb0): select count(*) from a
INFO  : Query ID = hive_20210330102952_17ea110d-01c4-4bba-8378-3876fbcf1cb0
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : 2021-03-30 10:29:54,097	Stage-2_0: 1/1 Finished	Stage-3_0: 1/1 Finished	
INFO  : Spark job[1] finished successfully in 1.00 second(s)
INFO  : Completed executing command(queryId=hive_20210330102952_17ea110d-01c4-4bba-8378-3876fbcf1cb0); Time taken: 1.109 seconds
INFO  : OK
+------+
| _c0  |
+------+
| 1    |
+------+
1 row selected (1.271 seconds)
0: jdbc:hive2://hadoop162:10000>

4、通过java代码连接hive on spark,使用hive-jdbc

引入pom文件

<dependency>
   <groupId>org.apache.hive</groupId>
   <artifactId>hive-jdbc</artifactId>
   <version>2.1.1</version>
</dependency>
import java.sql.*;
import java.util.Scanner;

public class HiveJDBC {
    public static void main(String[] args) throws ClassNotFoundException, SQLException, InterruptedException {
        String driverName = "org.apache.hive.jdbc.HiveDriver";
        String url = "jdbc:hive2://hadoop162:10000";
        String dbName = "default";
        Connection con = null;
        Statement state = null;
        ResultSet res = null;

        Class.forName(driverName);
        //这里必须指定用户名和密码,密码可以为空字符串,如果不指定则或报错启动sparksession失败
        con= DriverManager.getConnection(url+"/"+dbName,"hive","");
        state = con.createStatement();
        Scanner scan = new Scanner(System.in);
        String sql=null;
		//创建临时udf,可以不创建
        state.execute("create temporary function pinjie as 'com.topnet.MyUDF' using jar 'hdfs:///user/hive/udf/hiveUDF-1.0.jar'");
        while (true){
            System.out.println("亲输入sql:");
            if(scan.hasNext()){
                sql=scan.nextLine();
            }
            System.out.println(sql);
            res = state.executeQuery(sql);
            while (res.next()) {
                System.out.println(res.getString(1));
            }
            Thread.sleep(100);
        }
    }
}

cdh hive使用方法 cdh配置hive on spark_hadoop_06

5、运行sql的执行日志获取

使用hive-jdbc运行时,如果想获取sql的执行日志,则可以通过这几个方法获取运行的日志信息。List<String> getQueryLog()List<String> getQueryLog(boolean incremental, int fetchSize)boolean hasMoreLogs()三个方法,在进行hive的sql查询时,有时一个sql可能需要运行很长时间,借助这三个方法,还可以实时显示sql 的查询进度。

想要实时的显示sql查询进度,则需要再开启一个线程进行日志获取打印。

public class HiveJDBC {
    public static void main(String[] args) throws ClassNotFoundException, SQLException, InterruptedException {
        Logger log = LoggerFactory.getLogger(HiveJDBC.class);
        String driverName = "org.apache.hive.jdbc.HiveDriver";
        String url = "jdbc:hive2://hadoop162:10000";
        String dbName = "default";
        Connection con = null;
        Statement state = null;
        ResultSet res = null;

        Class.forName(driverName);
        //这里必须指定用户名和密码,密码可以为空字符串,如果不指定则或报错启动sparksession失败
        con = DriverManager.getConnection(url + "/" + dbName, "hive", "");
        state = con.createStatement();
        Scanner scan = new Scanner(System.in);
        String sql = null;
        
        //开启线程获取sql执行日志
        Thread logThread = new Thread(new HiveLog((HiveStatement) state));
        logThread.setDaemon(true);
        logThread.start();

		//注册临时udf函数,可以不创建
        state.execute("create temporary function pinjie as 'com.topnet.MyUDF' using jar 'hdfs:///user/hive/udf/hiveUDF-1.0.jar'");
        while (true) {
            System.out.println("亲输入sql:");
            if (scan.hasNext()) {
                sql = scan.nextLine();
            }
            log.error("打印日志sql语句:" + sql);
            res = state.executeQuery(sql);
            while (res.next()) {
                System.out.println(res.getString(1));
            }
            Thread.sleep(100);
        }
    }

    static class HiveLog extends Thread {
        private final HiveStatement state;

        HiveLog(HiveStatement state) {
            this.state = state;
        }
        
        private void updateQueryLog() {
            try {
                List<String> queryLogs = state.getQueryLog();
                for (String log : queryLogs) {
                    System.out.println("进度信息-->" + log);
                }
            } catch (Exception e) {
            }
        }
        
        @Override
        public void run() {
            try {
                //循环不断的获取sql执行的日志
                while (true) {
                    if (state.hasMoreLogs()) {
                        updateQueryLog();
                    }
                    Thread.sleep(100);
                }
            } catch (InterruptedException e) {
                e.getStackTrace();
            }
        }
    }
}

查看sql运行日志信息

cdh hive使用方法 cdh配置hive on spark_大数据_07


查看yarn管理界面:

cdh hive使用方法 cdh配置hive on spark_spark_08


查看sparkUI:

cdh hive使用方法 cdh配置hive on spark_大数据_09