hive 用户自定义函数查询查看hive自定义函数信息

转载

技术博客达人 2024-05-31 09:16:19

文章标签 hive 用户自定义函数查询 hive hadoop 数据仓库 Text 文章分类 Hive 大数据

1.为啥使用用户自定义函数

当Hive提供的内置函数无法满足业务处理需求的时候，就要使用用户自定义函数(user-defined-function)，那么UDF在Hive里是如何执行的：
打包成jar包上传到集群，注册自定义函数，通过类加载器载入系统，在sql解析的过程中调用函数
换个俺能听得懂的大概说法是，我需要自己写代码，然后我把我写的代码打成jar包，在Hive的客户端client里面创建函数create function，之后就能使用写好的函数。

2.Hive的三种自定义函数（区别）：
UDF（user defined function）
用户自定义函数：对一行数据处理得到一行记录；
UDAF（user defined aggregation function）
用户自定义聚合函数：对多行数据处理得到一行记录；
UDTF（user defined table-generating functions）
用户自定义表生成函数：对一行数据处理得到多行记录；
3.Hive的用户自定义函数的实现步骤与流程：

(1)UDF编写流程：
自定义一个Java类——继承类GenericUDF——重写继承类的方法（实现核心业务逻辑）——代码打成jar包——在Hive客户端里创建函数——HiveQL中使用自定义的函数

实现需求，将国家编码转成中文国家名的UDF函数：
idea、设置maven......、pom.xml导入依赖...... 、先上传code对应名称的数据country.dict放到resource目录下
创建包，创建类：

import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
public class CountryCode2CountryNameUDF extends GenericUDF{
    //--准备map用来存放国家码和中文名
    static Map<String,String> map=new HashMap<String, String>();
    //读取文件resources目录下的国家码文件(准备一个输入流、用当前类的字节码对象拿到资源路径下，找到country.dict文件进行读取),getResourceAsStream这个api的返回值就是一个输入流InputStream
    static{
        try{
            InputStream in=CountryCode2CountryNameUDF.class.getResourceAsStream("/country.dict");
            //将上面的字节输入流封装成字符流，方便按行读取数据
            BufferedReader reader=new BufferedReader(new InputStreamReader(in));
            //--while循环读取文件，准备变量用来存放读取到的数据
            String line=null;
            while((line=reader.readLine())!=null){
                String[] arr=line.split("\t");
                String code=arr[0];
                String name=arr[1];
                map.put(code,name);
                //类在加载的时候，数据已经被读取到map中
            }
        }catch(Exception e){
            e.printStackTrace();
        } 
    } 
    //--重写方法
    //--(1)initialize初始化方法用于定义输入的参数类型和输出的参数类型
    public ObjectInspector initialize(ObjectInspector[] objectInspectors) throws UDFArgumentException{
        //--校验参数的个数
        if(objectInspectors.length!=1){
            throw new UDFArgumentException("输入的参数必须是一个");
        }
        ObjectInspector inspector=objectInspectors[0];
        //--输入参数类型进行校验--校验大类(获取分类，判断是否是基本类型)
        //--Category里的类型PRIMITIVE，LIST，MAP，STRUCT，UNION
        if(! inspector.getCategory().equals(ObjectInspector.Category.PRIMITIVE)){
            throw new UDFArgumentException("输入的参数必须是PRIMITIVE类型");
        }
        //--校验小类，PRIMITIVE基本类型中有String，int，bigint，double...
        if(! inspector.getTypeName().equalsIgnoreCase(PrimitiveObjectInspector.PrimitiveCategory.STRING.name())){
            throw new UDFArgumentException("输入的参数必须是PRIMITIVE类型下的String类型");
        
        }
        //--确定输出类型是string类型
        return PrimitiveObjectInspectorFactory.writableStringObjectInspector;

    }
    Text output=new Text();
    //--(2)evaluate核心方法，将来输入一行数据就会调用一次这个方法，来编写核心业务逻辑
    public Object evaluate(DeferredObject[] deferredObjects) throws HiveException{
        //获取输入进来的参数,判断参数类型
        Object obj=deferredObjects[0].get();
        //--在hive中String类型有三种形式Lazystring  Text  string
        String code=null;
        if(obj instanceof LazyString){
            LazyString lz=(LazyString)obj;//obj强转成LazyString
            Text t=lz.getWritableObject();
            //--code CN
            code=t.toString();
        }else if(obj instanceof Text){
            Text t=(Text)obj;
            code=t.toString(); 
        }else{
            code=(String)obj;
        }
        //--本质上是LazyString
        //--翻译国家码(底层走的mapreduce，不能直接返回字符串，需要封装成Text)
        String countryName=map.get(code);
        output.set(countryName);
        return output;
    }
    //--(3)getDisplayString打印辅助信息的方法
    public String getDisplayString(String[] strings){
        return Arrays.toString(strings);
    }
}

打成jar包上传到服务器指定位置，然后在hive客户端中add jar（将jar包添加进去）:

//add jar linux_jar_path
add jar /home/hadoop/Hive_Pro.jar

创建函数（有临时函数和全局函数）：

//设置为临时函数temporary，class_name:类的全路径名
//create temporary function function_name as class_name;
create temporary function fanyi as 'com.cd.CountryCode2CountryNameUDF';

查询验证：

//select function_name from table_name;
select fanyi(country) from xxx;

(2)UDTF

需求：

id name_nickname
1 name1#n1;name2#n2
2 name3#n3;name4#n4;name5#n5

将以上文件数据多行多列输出：

id name nickname
1 name1 n1
1 name2 n2
2 name3 n3
2 name4 n4
2 name5 n5

hive端流程：

//根据数据创建表
create table udtf_table(
    id int,
    name_nickname string
)row format delimited fields terminated by '\t';
//把数据上传到表中
load data local inpath '/home/hadoop/udtf' into table udtf_table;
//定义一个udtf函数，输出多行多列形式
select splitname(name_nickname) from udtf_table;

UDTF编写流程：

①继承org.apache.hadoop.hive.ql.udf.generic.GenericUDTF，重写initialize，process，close三个方法

②首先调用initialize方法，进行参数的校验

③初始化后会调用process方法，处理核心业务逻辑，在process中，每调用一次forward()产生一行；如果产生多列可以将多个列的值放在一个数组中，然后将该数组传入到forward()函数

④最后close()方法调用，对需要清理的方法进行清理

然后打成jar包，hive中创建函数

（代码略）

UDTF两种使用方法：

①直接select后使用

//可以自定义表头字段名称
select udtf_func(properties) as (col1,col2) from tablename;
//用UDTF代码里写的字段名称
select udtf_func(properties) from tablename;

注：

UDTF函数不可以使用的场景：

//UDTF不可以和其他字段连起来使用,UDF可以和其他字段连起来使用
select id,udtfsplit(name_nickname) from udtf_table;

//不可以嵌套使用
select udtfsplit(udtfsplit(name_nickname)) from udtf_table;

//不可以和 group by/cluster by/distribute by/sort by 一起使用
select udtfsplit(name_nickname) from udtf_table group by name,nickname;

②和lateral view一起使用（好像有个什么炸裂函数、行转列、列转行）

通过lateral view可以方便的将UDTF得到的行转列的结果集，合在一起提供服务；

lateral view和UDTF一起使用可以解决UDTF不可以使用的场景

//①用lateral view查看表的数据
//--首先table_name表每行调用udtf_func，会把一行拆分成一/多行
//--再把结果组合，产生一个支持别名表tableAlias的虚拟表
//select table_name.id, tableAlias.col1, tableAlias.col2 from table_name lateral view udtf_func(properties) tableAlias as col1,col2;
select t1.id,t2.name,t2.nickname from udtf_table t1 lateral view udtf_split(name_nickname) t2 as name,nickname;

//②分组查询每个id有多少条记录
select t3.id,count(*) from (select t1.id,t2.name,t2.nickname from udtf_table t1 lateral view udtf_split(name_nickname) t2 as name,nickname) t3 group by t3.id;

lateral view本质上就是把udtf的结果看成一张虚拟表，和其他表中的字段拼在一起使用。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。