Hive开发中使用变量的两种方法

原创

peishuai1987 2023-07-06 09:38:24 ©著作权

文章标签 hive Time 测试数据 文章分类 HarmonyOS 后端开发

©著作权归作者所有：来自51CTO博客作者peishuai1987的原创作品，请联系作者获取转载授权，否则将追究法律责任

在使用hive开发数据分析代码时，经常会遇到需要改变运行参数的情况，比如select语句中对日期字段值的设定，可能不同时间想要看不同日期的数据，这就需要能动态改变日期的值。如果开发量较大、参数多的话，使用变量来替代原来的字面值非常有必要，本文总结了几种可以向hive的SQL中传入参数的方法，以满足类似的需要。

准备测试表和测试数据

第一步先准备测试表和测试数据用于后续测试：

hive > create database test ;
OK
Time taken : 2.606 seconds

然后执行建表和导入数据的sql文件：

[ czt @ www . crazyant . net testHivePara ] $ hive - f student . sql
Hive history file = / tmp / crazyant . net / hive_job_log_czt_201309131615_1720869864 . txt
OK
Time taken : 2.131 seconds
OK
Time taken : 0.878 seconds
Copying data from file : / home / users / czt / testdata_student
Copying file : file : / home / users / czt / testdata_student
Loading data to table test . student
OK
Time taken : 1.76 seconds

其中student.sql内容如下：

use test ;
 
-- -学生信息表
create table IF NOT EXISTS student (
sno bigint comment '学号' ,
sname string comment '姓名' ,
sage bigint comment '年龄' ,
pdate string comment '入学日期'
)
COMMENT '学生信息表'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE ;
 
LOAD DATA LOCAL INPATH
'/home/users/czt/testdata_student'
INTO TABLE student ;

testdata_student测试数据文件内容如下：

1 name1 21 20130901
2 name2 22 20130901
3 name3 23 20130901
4 name4 24 20130901
5 name5 25 20130902
6 name6 26 20130902
7 name7 27 20130902
8 name8 28 20130902
9 name9 29 20130903
10 name10 30 20130903
11 name11 31 20130903
12 name12 32 20130904
13 name13 33 20130904

方法1：shell中设置变量，hive -e中直接使用

测试的shell文件名：

#!/bin/bash
tablename = "student"
limitcount = "8"
 
hive - S - e "use test; select * from ${tablename} limit ${limitcount};"

运行结果：

[ czt @ www . crazyant . net testHivePara ] $ sh - x shellhive . sh
+ tablename = student
+ limitcount = 8
+ hive - S - e 'use test; select * from student limit 8;'
1        name1      21        20130901
2        name2      22        20130901
3        name3      23        20130901
4        name4      24        20130901
5        name5      25        20130902
6        name6      26        20130902
7        name7      27        20130902
8        name8      28        20130902

由于hive自身是类SQL语言，缺乏shell的灵活性和对过程的控制能力，所以采用shell+hive的开发模式非常常见，在shell中直接定义变量，在hive -e语句中就可以直接引用；

注意：使用-hiveconf定义，在hive -e中是不能使用的

修改一下刚才的shell文件，采用-hiveconf的方法定义日期参数：

#!/bin/bash
tablename = "student"
limitcount = "8"
 
hive - S \
     - hiveconf enter_school_date = "20130902" \
     - hiveconf min_age = "26" \
     - e \
     "    use test; \
        select * from ${tablename} \
        where \
            pdate='${hiveconf:enter_school_date}' \
            and \
            sage>'${hiveconf:min_age}' \
        limit ${limitcount};"

运行会失败，因为该脚本在shell环境中运行的，于是shell试图去解析${hiveconf:enter_school_date}和${hiveconf:min_age}变量，但是这两个SHELL变量并没有定义，所以会以空字符串放在这个位置。

运行时该SQL语句会被解析成下面这个样子：

+ hive -S -hiveconf enter_school_date=20130902 -hiveconf min_age=26 -e 'use test; explain select * from student where pdate='\'''\'' and sage>'\'''\'' limit 8;'

方法2：使用-hiveconf定义，在SQL文件中使用

因为换行什么的很不方便，hive -e只适合写少量的SQL代码，所以一般都会写很多hql文件，然后使用hive –f的方法来调用，这时候可以通过-hiveconf定义一些变量，然后在SQL中直接使用。

先编写调用的SHELL文件：

#!/bin/bash
 
hive - hiveconf enter_school_date = "20130902" - hiveconf min_ag = "26" - f testvar . sql

被调用的testvar.sql文件内容：

use test ;
 
select * from student
where
pdate = '${hiveconf:enter_school_date}'
and
sage > '${hiveconf:min_ag}'
limit 8 ;

执行过程：

[ czt @ www . crazyant . net testHivePara ] $ sh - x shellhive . sh
+ hive - hiveconf enter_school_date = 20130902 - hiveconf min_ag = 26 - f testvar . sql
Hive history file = / tmp / czt / hive_job_log_czt_201309131651_2035045625 . txt
OK
Time taken : 2.143 seconds
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there' s no reduce operator
Kill Command = hadoop job - kill job_20130911213659 _42303
2013 - 09 - 13 16 : 52 : 00 , 300 Stage - 1 map = 0 % ,    reduce = 0 %
2013 - 09 - 13 16 : 52 : 14 , 609 Stage - 1 map = 28 % ,    reduce = 0 %
2013 - 09 - 13 16 : 52 : 24 , 642 Stage - 1 map = 71 % ,    reduce = 0 %
2013 - 09 - 13 16 : 52 : 34 , 639 Stage - 1 map = 98 % ,    reduce = 0 %
Ended Job = job_20130911213659_42303
OK
7        name7    27        20130902
8        name8    28        20130902
Time taken : 54.268 seconds

总结

本文主要阐述了两种在hive中使用变量的方法，第一种是在shell中定义变量然后在hive -e的SQL语句中直接用${var_name}的方法调用；第二种是使用hive –hiveconf key=value –f run.sql模式使用-hiveconf来设置变量，然后在SQL文件中使用${hiveconf:varname}的方法调用。用这两种方法可以满足开发的时候向hive传递参数的需求，会很好的提升开发效率和代码质量。