hive执行sql脚本参数 hive执行sql脚本文件带日期

转载

mob64ca141275de 2023-09-18 21:28:17

文章标签 hive执行sql脚本参数 hive 字段 unix 文章分类 Hive 大数据

目前的工作需要在公司平台上通过hive导出数据，定时任务的权限没有开放给我，所以只能每天手动导，手动导数时又要天天手工改日期，很麻烦，所以想要where子句能够每天自动有对应的日期。以上是前提，有太多槽点，请忽略。

首先，要思考的是日期字段的数据类型，以及hive是否也有隐式转换

参考这篇文章 HIVE 数据类型转换

	bl	tinyint	si	int	bigint	float	double	dm	string	vc	ts	date	ba
boolean	true	false	false	false	false	false	false	false	false	false	false	false	false
tinyint	false	true	true	true	true	true	true	true	true	true	false	false	false
smallint	false	false	true	true	true	true	true	true	true	true	false	false	false
int	false	false	false	true	true	true	true	true	true	true	false	false	false
bigint	false	false	false	false	true	true	true	true	true	true	false	false	false
float	false	false	false	false	false	true	true	true	true	true	false	false	false
double	false	false	false	false	false	false	true	true	true	true	false	false	false
decimal	false	false	false	false	false	false	false	true	true	true	false	false	false
string	false	false	false	false	false	false	true	true	true	true	false	false	false
varchar	false	false	false	false	false	false	true	true	true	true	false	false	false
ts	false	false	false	false	false	false	false	false	true	true	true	false	false
date	false	false	false	false	false	false	false	false	true	true	false	true	false
binary	false	false	false	false	false	false	false	false	false	false	false	false	true

对应表中的日期字段day是以string的类型保存的，类似 '20190618',同时这个字段还是分区字段。。

1、datediff函数的问题

当执行时间为周一时，我希望日期限制的是上周五到周日，其他执行时间则仅昨日。那么加入个if判断以下执行时间是否为周一吧。

用pmod(datediff(current_date, '2012-01-01'), 7)判断，0~6对应周日至周一，没什么问题吧？

然后就出问题了，出来的星期不对。经过排查，锁定以下问题

select datediff(current_date, '2019-06-18')
        , datediff('2019-06-19', '2019-06-18')
        , current_date, current_date = '2019-06-19'

结果如下

0	1	2019-06-19	true

我百思不得其解，只能百度。

--------------------> Hive SQL中的datediff、current_date使用问题 <---------------------------

[ https://issues.apache.org/jira/browse/HIVE-18304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297824#comment-16297824
] 
 
Hengyu Dai commented on HIVE-18304:
-----------------------------------
 
SimpleDateFormat.parse(String source) method will convert String type(UTC) to java.util.Date
type(use current JVM timezone), this may lead deviations in time when JVM timezone is not
UTC, my environment is GMT+8,  8 hours is added comparing to the UTC time.
while for a  date type argument, the default JVM timezone is used. 
The patch uploaded treats String type and Date type at the same way to remove the deviations.
 
> datediff() UDF returns a wrong result when dealing with a (date, string) input
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-18304
>                 URL: https://issues.apache.org/jira/browse/HIVE-18304
>             Project: Hive
>          Issue Type: Bug
>          Components: UDF
>            Reporter: Hengyu Dai
>            Assignee: Hengyu Dai
>            Priority: Minor
>         Attachments: 0001.patch
>
>
> for date type argument, datediff() use DateConverter to convert input to a java Date
object, 
> for example, a '2017-12-18' will get 2017-12-18T00:00:00.000+0800
> for string type argument, datediff() use TextConverter to convert a string to date,
> for '2012-01-01' we will get 2012-01-01T08:00:00.000+0800
> now, datediff() will return a number less than the real date diff
> we should use TextConverter to deal with date input too.
> reproduce:
> {code:java}
> select datediff(cast('2017-12-18' as date), '2012-01-01'); --2177
> select datediff('2017-12-18', '2012-01-01'); --2178
> {code}

大体意思就是对，date类型和string类型的两个参数，hive调用了不同的函数处理，处理出来的带时间的日期数据相减。

而要想解决的话很简单，统一两个参数的数据类型就行了。

select datediff(current_date, cast('2019-06-18' as date))
        , datediff(to_date(current_date), '2019-06-18')

2、使用hiveconf变量

把这么长的式子写在where后面实在太难看，所以我决定使用变量传参的方式

找到了以下文章

hive中的参数

set end_day = '20190618';
select * where day = ${hiveconf:end_day};

另，当某些时候用参数传递给分区字段报错时，加入下面这段

SET hive.exec.dynamic.partition.mode = nonstrict;

3、避免科学计数法

开始时，我使用的是这种方法，

select from_unixtime(unix_timestamp(), 'yyyyMMdd') - 1

结果得到的是 2.0190618E7

用这个作为where分区限制时，报错

Error while compiling statement: FAILED: SemanticException [Error 10041]: No partition predicate found

我猜测又是类型的问题

所以我使用cast显式转换格式成string，但科学计数法表示的数字转换成字符，那画面实在太美。

所以，为了解决这种计算结果改变数据类型的情况，

cast(bigint(from_unixtime(unix_timestamp(), 'yyyyMMdd') - 1 )as string)

我套了一个bigint

好了，没问题了，但是仍然报错。

ps: string形式形如’20190618‘格式的分区字段，用

cast(bigint(from_unixtime(unix_timestamp(), 'yyyyMMdd') - 1 )as string) 报错

用 cast(bigint('20190619'- 1 )as string) 没事。可能还是数据类型的问题，但我已放弃治疗

3、最后的解决方案

SET start_day = if(pmod(datediff(to_date(current_date), '2012-01-01'), 7) = 1, regexp_replace(date_sub(current_date, 3), '-', ''), regexp_replace(date_sub(current_date, 1), '-', ''));
SET end_day = regexp_replace(date_sub(current_date, 1), '-', '');

hive的数据类型实在是个坑

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。