Hive的字符串转数组 hive字符串转数字

转载

mob6454cc72f29c 2023-07-10 13:30:15

文章标签 Hive的字符串转数组 hive 字符串科学计数法 文章分类 Hive 大数据

Hive中int , float , double这些数值类型在存储大额度数字时，在前端展现上总是使用科学计数法来表示，例如：

hive> select pow(10,8) from dual;
OK
1.0E8

其实无论是普通的表示方式还是科学计数法表示，只是一个习惯问题，结果都是一样的。可是不能理解的是当把数值类型转化成字符串类型以后Hive竟然把数值转换成了科学计数法表示的字符串而非数值本身的字符串，例如：

hive> select cast(pow(10,8) as string) from dual;
OK
1.0E8

这样问题就来了，很对时候业务需求要求数值必须存储为字符串类型，但是如果存储的是一个科学计数法表示的字符串，那就失去了数值本身的含义，例如再想把这个字符串转回数值类型就会出现转换错误：

hive> select cast('1.0E8' as int) from dual;
OK
1

因此，需要有一种方法在把数值类型转换成字符串类型时，强制Hive不要转换成科学计数法表示。

我查找了很多资料没有找到Hive中例如一个参数可以做这个控制。因此只能采取其他策略进行转换。

当我们要转换的数值只有整型而没有小数时，我们可以先把数值类型转换成bigint类型，使用bigint类型存储的数值不会采用科学计数法表示，例如：

hive> select cast(pow(10,8) as bigint) from dual;
OK
100000000
hive> select cast(cast(pow(10,8) as bigint) as string) from dual;
OK
100000000

但是由于bigint只能存储整型，当我们处理浮点数时这个方法就不灵了。

不得已我只能采用字符串解析这种最原始的方法：

以下是我写的将科学计数法表示的字符串转换为普通表示法表示的字符串的转换SQL：

case
处理非科学计数法表示的字符串
字符串','([0-9]+\\.)([0-9]+)(E-*[0-9]+)',2))=0
then '字符串'
处理整数
字符串','([0-9]+\\.)([0-9]+)(E[0-9]+)',2))<=cast(regexp_extract('字符串','(E)([0-9]+)',2) as int)
字符串','([^E]+)',1),'\\.',''),cast(regexp_extract('字符串','(E)([0-9]+)',2) as int)+1,'0')
处理小数
字符串','([0-9]+\\.)([0-9]+)(E[0-9]+)',2))>cast(regexp_extract('字符串','(E)([0-9]+)',2) as int)
字符串','([^E]+)',1),'\\.',''),1,cast(regexp_extract('字符串','(E)([0-9]+)',2) as int)+1),'\.',
字符串','([^E]+)',1),'\\.',''),cast(regexp_extract('字符串','(E)([0-9]+)',2) as int)+2))
处理类似“3.4E-6”这种字符串
字符串' regexp 'E-'
字符串','(E)(-)([0-9]+)',3) as int)-1),regexp_replace(regexp_extract('字符串','(.+)(E)',1),'\\.',''))
字符串'
end

insert overwrite local directory '/data/temp/temptxt/wdir27scope' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' select lon,lat,direc_scope,(case when length(regexp_extract(wdir_ratio,'([0-9]+\\.)([0-9]+)(E-*[0-9]+)',2))=0 then wdir_ratio when length(regexp_extract(wdir_ratio,'([0-9]+\\.)([0-9]+)(E[0-9]+)',2))<=cast(regexp_extract(wdir_ratio,'(E)([0-9]+)',2) as int) then rpad(regexp_replace(regexp_extract(wdir_ratio,'([^E]+)',1),'\\.',''),cast(regexp_extract(wdir_ratio,'(E)([0-9]+)',2) as int)+1,'0') when length(regexp_extract(wdir_ratio,'([0-9]+\\.)([0-9]+)(E[0-9]+)',2))>cast(regexp_extract(wdir_ratio,'(E)([0-9]+)',2) as int) then concat(substr(regexp_replace(regexp_extract(wdir_ratio,'([^E]+)',1),'\\.',''),1,cast(regexp_extract(wdir_ratio,'(E)([0-9]+)',2) as int)+1),'\.', substr(regexp_replace(regexp_extract(wdir_ratio,'([^E]+)',1),'\\.',''),cast(regexp_extract(wdir_ratio,'(E)([0-9]+)',2) as int)+2)) when wdir_ratio regexp 'E-' then concat('0.',repeat('0',cast(regexp_extract(wdir_ratio,'(E)(-)([0-9]+)',3) as int)-1),regexp_replace(regexp_extract(wdir_ratio,'(.+)(E)',1),'\\.','')) else wdir_ratio end) from wdir27_scope_result4 where direc_scope not is 'error';

当然这种方法最好是封装到 UDF 中，显得更为简洁，代码可读性也更强

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。