hive udf 函数列表 hive unicode函数

转载

mob6454cc7b19b2 2023-07-12 10:10:40

文章标签 hive udf 函数列表 hive Hive Text 文章分类 Hive 大数据

Hive函数及性能优化

Hive函数分类
内置函数

标准函数

字符函数
类型转换函数
数学函数
日期函数
集合函数
条件函数

聚合函数
表生成函数

Hive UDF：自定义标准函数

Hive UDF实现流程

Hive事务

概述
Hive事务的特点和局限
Hive事务的开启和设置
Hive PLSQL
Hive性能调优工具 - EXPLAIN
Hive性能调优工具 - ANALYZE

Hive优化设计

Job优化

本地模式运行
JVM重用（JVM Reuse)
并行执行

查询优化
压缩算法

Hive函数分类

从输入输出角度分类
标准函数：一行数据中的一列或多列为输入，结果为单一值
聚合函数：多行的零列到多列为输入，结果为单一值
表生成函数：零个或多个输入，结果为多列或多行
从实现方式分类

1、内置函数
2、自定义函数:

2.1、UDF：自定义标准函数
2.2、UDAF：自定义聚合函数
2.3、UDTF：自定义表生成函数

内置函数

Hive提供大量内置函数供开发者使用

标准函数

字符函数

hive udf 函数列表 hive unicode函数_Hive

例：

#将customers表中所有顾客姓名转换成大写
select upper(concat(customer_fname,"·",customer_lname)) from customers

hive udf 函数列表 hive unicode函数_Text_02

类型转换函数

hive udf 函数列表 hive unicode函数_hive udf 函数列表_03

例：

select customer_street,binary(customer_street,"latin")rst from customers

hive udf 函数列表 hive unicode函数_hive udf 函数列表_04

数学函数

hive udf 函数列表 hive unicode函数_hive udf 函数列表_05

#将orders表中订单金额保留两位小数
select round(order_item_subtotal,2) from order_items

hive udf 函数列表 hive unicode函数_hive udf 函数列表_06

日期函数

hive udf 函数列表 hive unicode函数_Text_07

select from_unixtime(1600740000,"yyyy-MM-dd HH:mm:ss.S")rst1, unix_timestamp()rst2, unix_timestamp("1970-01-01 08:00:00")rst3, 
to_date("2020-09-22 09:43:20")rst4, datediff("2020-09-22 09:43:20","2020-09-22 23:43:20")rst5, 
date_add("2020-09-22 09:43:20",-1)rst6, date_format("2020-09-22 09:43:20","yyyy/MM/dd HH:mm:ss")rst7

集合函数

hive udf 函数列表 hive unicode函数_hive_08

条件函数

hive udf 函数列表 hive unicode函数_Text_09

select if(1=2,1,2)rst1, nvl(null,"abc")rst2, isnull(null)rst3, isnotnull("null")rst4

hive udf 函数列表 hive unicode函数_hive udf 函数列表_10

聚合函数

count、sum、max、min、avg、var_samp等
例：

#统计orders表中月度订单数量
select date_format(order_date,"yyyy-MM"),count(distinct order_id) from orders group by date_format(order_date,"yyyy-MM")

hive udf 函数列表 hive unicode函数_hive_11

表生成函数

输出可以作为表使用

hive udf 函数列表 hive unicode函数_hive_12

例：

select explode(str_to_map(customer_street,":"," ")) from customers

hive udf 函数列表 hive unicode函数_Text_13

Hive UDF：自定义标准函数

Hive UDF实现流程

1、创建Java Maven工程，用Java继承UDF类编写UDF函数(evaluate()方法)(一个类一个方法）
配置文件内容：

<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>2.6.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>1.2.1</version>
    </dependency>

hive udf 函数列表 hive unicode函数_Text_14

Java代码部分:

TestUDF类:

package cn.kgc.kb09.testudf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
/**
 * @Qianchun
 * @Date 2020-09-22
 * @Description
 */
public class TestUDF extends UDF {
    public Text evaluate(Text str) {
        if (null == str) {
            return null;
        }
        return new Text(str.toString().toUpperCase());
    }
    //    public static void main(String[] args) {
//        TestUDF tu=new TestUDF();
//        Text rst=tu.evaluate(new Text(args[0]));
//        System.out.println(rst);
//    }
}

AddHour类：

package cn.kgc.kb09.testudf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import java.text.SimpleDateFormat;
import java.util.Date;
/**
 * @Qianchun
 * @Date 2020-09-22
 * @Description
 */
public class AddHour extends UDF {
    public Text evaluate(Text beforeDate, IntWritable hours) throws Exception {
        //把接收到的字符串和整形转成Java的类型
        String d=beforeDate.toString();
        String[] dAndT=d.split(" ");
        String[] day=dAndT[0].split("-");
        String[] time=dAndT[1].split(":");
        SimpleDateFormat sdf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
        Date date=sdf.parse(d);
        String rst=sdf.format(new Date(date.getTime()+hours.get()*60*60*1000));
        return new Text(rst);
    }
//    public static void main(String[] args) throws Exception {
//        AddHour ah=new AddHour();
//        Text t=new Text("2020-09-22 10:20:00");
//        IntWritable iw=new IntWritable(3);
//        System.out.println(ah.evaluate(t, iw));
//    }
}

HourDiff类：

package cn.kgc.kb09.testudf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import java.text.SimpleDateFormat;
import java.util.Date;
/**
 * @Qianchun
 * @Date 2020-09-22
 * @Description
 */
public class HourDiff extends UDF  {
    public IntWritable evaluate(Text date1,Text date2) throws Exception{
        String d1=date1.toString();
        String d2=date2.toString();
        SimpleDateFormat sdf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
        Date dt1=sdf.parse(d1);
        Date dt2=sdf.parse(d2);
        long diff=dt1.getTime()-dt2.getTime();
        int rst=(int)(diff/1000/60/60);
        return new IntWritable(rst);
    }
//    public static void main(String[] args) throws Exception{
//        HourDiff hd=new HourDiff();
//        System.out.println(hd.evaluate(new Text("2020-09-21 12:00:00"),
//                new Text("2020-09-22 23:00:00")));
//    }
}

2、打fat包（包括所有依赖文件）
3、把jar包上传到Linux上
(前提准备：

yum install -y zip

hive udf 函数列表 hive unicode函数_hive udf 函数列表_15

zip -d testudf.jar 'META-INF/.SF' 'META-INF/.RSA' 'META-INF/*SF'

hive udf 函数列表 hive unicode函数_Hive_16

–临时udf函数

4、在hive命令行中使用add jar jar包路径即可加载到临时系统中

add jar /root/testudf.jar

5、create temporary function 函数名 as ‘方法的全类名’;

create temporary function add_hour as 'cn.kgc.kb09.testudf.AddHour';

–永久udf函数

4、在Linux命令行使用hdfs命令把jar上传到hdfs的路径

hdfs dfs -mkdir -p /apps/hive/functions
hdfs dfs -put testudf.jar /apps/hive/functions

5、create function 函数名 as ‘方法的全类名’ using jar ‘jar包的hdfs路径’;

create function demo as 'cn.kgc.kb09.testudf.TestUDF' using jar 'hdfs:apps/hive/functions/testudf.jar';

create function hour_diff as 'cn.kgc.kb09.testudf.HourDiff' using jar 'hdfs:apps/hive/functions/testudf.jar';

Hive事务

概述

事务（Transaction ）指一组单元化操作，这些操作要么都执行，要么都不执行
ACID特性
Atomicity：原子性
Consistency：一致性
Isolation：隔离性
Durability：持久性

Hive事务的特点和局限

V0.14版本开始支持行级事务
支持INSERT、DELETE、UPDATE(v2.2.0开始支持Merge)
文件格式只支持ORC
局限
表必须是bucketed表
需要消耗额外的时间、资源和空间
不支持开始、提交、回滚、桶或分区列上的更新
锁可以为共享锁或排它锁(串联的而不是并发)
不允许从一个非ACID连接读写ACID表
使用较少

Hive事务的开启和设置

通过Hive命令行方式设置,当前session有效

#通过命令行方式开启事务
set hive.support.concurrency = true;
set hive.enforce.bucketing = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads = 1;

通过配置文件设置，全局有效

#通过配置文件hive-site.xml
<property> 
<name>hive.support.concurrency</name> 
<value>true</value>
 </property>
 <property> 
<name>hive.txn.manager</name> <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>

通过UI工具（如Ambari)设置

Hive PLSQL

Hive PLSQL：Hive存储过程（v2.0之后）
支持SparkSQL和Impala
兼容Oracle、DB2、MySQL、TSQL标准
使将现有的过程迁移到Hive变得简单和高效
使编写UDF不需要Java技能
它的性能比Java UDF稍微慢一些
功能较新
在Hive2 bin目录下运行./hplsql

hive udf 函数列表 hive unicode函数_hive_17

Hive性能调优工具 - EXPLAIN

hive udf 函数列表 hive unicode函数_Text_18

Hive性能调优工具 - ANALYZE

ANALYZE：分析表数据，用于执行计划选择的参考
收集表的统计信息，如行数、最大值等
使用时调用该信息加速查询
语法

ANALYZE TABLE employee COMPUTE STATISTICS; 
ANALYZE TABLE employee_partitioned 
PARTITION(year=2014, month=12) COMPUTE STATISTICS;
ANALYZE TABLE employee_id COMPUTE STATISTICS 
FOR COLUMNS employee_id;

Hive优化设计

使用分区表、桶表
使用索引
使用适当的文件格式，如orc, avro, parquet
使用适当的压缩格式，如snappy
考虑数据本地化 - 增加一些副本
避免小文件
使用Tez引擎代替MapReduce
使用Hive LLAP(在内存中读取缓存)
考虑在不需要时关闭并发

Job优化

本地模式运行

Hive支持将作业自动转换为本地模式运行
当要处理的数据很小时，完全分布式模式的启动时间比作业处理时间要长

#通过以下设置开启本地模式
SET hive.exec.mode.local.auto=true; --default false 
SET hive.exec.mode.local.auto.inputbytes.max=50000000; 
SET hive.exec.mode.local.auto.input.files.max=5; --default 4

Job必须满足以下条件才能在本地模式下运行
Job总输入大小小于 hive.exec.mode.local.auto. inputbytes.max
map任务总数小于 hive.exec.mode.local.auto. input.files.max
所需的Reduce任务总数为1或0

JVM重用（JVM Reuse)

通过JVM重用减少JVM启动的消耗
默认每个Map或Reduce启动一个新的JVM
Map或Reduce运行时间很短时，JVM启动过程占很大开销
通过共享JVM来重用JVM，以串行方式运行MapReduce Job
适用于同一个Job中的Map或Reduce任务
对于不同Job的任务，总是在独立的JVM中运行

#通过以下设置开启JVM重用
set mapred.job.reuse.jvm.num.tasks = 5;  -- 默认值为1

并行执行

并行执行可提高集群利用率
Hive查询通常被转换成许多按默认顺序执行的阶段
这些阶段并不总是相互依赖的
它们可以并行运行以节省总体作业运行时间
如果集群的利用率已经很高，并行执行帮助不大

#通过以下设置开启并行执行
SET hive.exec.parallel=true;  -- default false 
SET hive.exec.parallel.thread.number=16;  -- default 8,定义并行运行的最大数量

查询优化

自动启动Map端Join
防止数据倾斜

set hive.optimize.skewjoin=true;

启用CBO(Cost based Optimizer)

set hive.cbo.enable=true; 
set hive.compute.query.using.stats=true; 
set hive.stats.fetch.column.stats=true;//计算每个列的使用情况 
set hive.stats.fetch.partition.stats=true;//分区的负载均衡

使用CTE、临时表、窗口函数等正确的编码约定

压缩算法

减少传输数据量，会极大提升MapReduce性能
采用数据压缩是减少数据量的很好的方式
常用压缩方法对比

hive udf 函数列表 hive unicode函数_hive udf 函数列表_19

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：gp参数估计 python python非参数估计

下一篇：click python 实例 python click库

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯