文章目录

  • 一、窗口函数
  • 1、window as
  • 2、with as
  • 3、常用函数
  • 4、窗口帧
  • 二、自定义函数
  • 1、UDF:一进一出
  • 2、UDTF:一进多出



I know, i know

地球另一端有你陪我






一、窗口函数



1、window as

select *,rank() over(partition by clazz order by score desc) as scorerank
from students;

这里的 over() 中的语句大多相似,可以独立拆下来,用变量名代替

window o as(partition by clazz order by score desc)

原句可以简化为

select *,rank() o as scorerank 
from students;



2、with as

查询语句同样可以用变量名代替

select id,max(continuity)
from
    (select id,count(origin) as continuity
    from
        (select id,day,row1,date_sub(day,row1) as origin
        from
            (select 
            id,day,row_number() over(partition by id order by day) as row1
            from login) as t1
        ) as t2
    group by id,origin
    )as t3
group by id;

一层层剥下来

with t1 as 
(select id,day,row_number() over(partition by id order by day) as row1
from login)
with t2 as
(select id,day,row1,date_sub(day,row1) as origin from t1)
with t3 as
(select id,count(origin) as continuity from t2 group by id,origin)

最终简化为
(多个 with as 之间用 , 隔开,且省略后面的 with)

with t1 as
(select id,day,row_number() over(partition by id order by day) as row1 from login),
t2 as
(select id,day,row1,date_sub(day,row1) as origin from t1),
t3 as
(select id,count(origin) as continuity from t2 group by id,origin)

select id,max(continuity) from t3 group by id;



3、常用函数

row_number:
无并列排名
1 2 3

dense_rank:
有并列排名,并且依次递增
1 1 2 2 3 3

rank:
有并列排名,不依次递增
1 1 3 3 5 5

LAG(col,n):
往前第n行数据

LEAD(col,n):
往后第n行数据

FIRST_VALUE:
取分组内排序后,截止到当前行,第一个值

LAST_VALUE:
取分组内排序后,截止到当前行,最后一个值,对于并列的排名,取最后一个

4、窗口帧

窗口函数允许一次读取特定范围的数据,通过窗口帧来实现
窗口帧会一行行扫过数据,根据条件来选择读取的范围

Hive 提供了两种定义窗口帧的形式:ROWSRANGE

只能运用在max、min、avg、count、sum、FIRST_VALUE、LAST_VALUE这几个窗口函数上

1 前一行、当前行和后一行

over rows between 1 preceding and 1 following

2 从起始行到现在行

over rows between unbounded preceding and current row

3 范围 [当前值减去3 和当前值加上11]
   例:当前读取的值为 22,即寻找表中数据范围是[19,11]的

over range between 3 PRECEDING and 11 FOLLOWING



二、自定义函数

hive 允许使用者自定义函数进行使用
大致分为三类UDFUDTFUDAF

导个包先

<dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>1.2.1</version>
        </dependency>



1、UDF:一进一出

编写代码,继承org.apache.hadoop.hive.ql.exec.UDF
实现evaluate方法,在evaluate方法中实现自己的逻辑

用于一次只进入一个对象,并且是依赖 java 的 return
所以叫一进一出

import org.apache.hadoop.hive.ql.exec.UDF;

public class HiveUDF extends UDF {
    // hadoop => #hadoop$
    public String evaluate(String col1) {
    // 给传进来的数据 左边加上 # 号 右边加上 $
        String result = "#" + col1 + "$";
        return result;
    }
}

java 中打包,传到 linux 中,再导入到 hive

add jar /usr/local/data/hive-1.0-SNAPSHOT.jar;

注册函数名,指名使用的类

create temporary function funudf as 'day47.UDF.Test';

玩一下

select funudf(name) from students;



2、UDTF:一进多出

通过自己的方法输出对象
可以实现一进多出

import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive
.PrimitiveObjectInspectorFactory;

import java.util.ArrayList;

public class Test extends GenericUDTF {


    // 指定输出的列名 及 类型
    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) 
    throws UDFArgumentException {
        ArrayList<String> filedNames = new ArrayList<String>();
        
        ArrayList<ObjectInspector> filedObj 
        = new ArrayList<ObjectInspector>();
        
        filedNames.add("col1");
        
        filedObj.add(PrimitiveObjectInspectorFactory
        .javaStringObjectInspector);
        
        filedNames.add("col2");
        
        filedObj
        .add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        
        return ObjectInspectorFactory
        .getStandardStructObjectInspector(filedNames, filedObj);
    }


    // 处理逻辑 my_udtf(col1,col2,col3)
    // "key1:value1,key2:value2,key3:value3"
    // my_udtf("key1:value1,key2:value2,key3:value3")

    public void process(Object[] objects) throws HiveException {
        // objects 表示传入的N列
        String col = objects[0].toString();
        // key1:value1  key2:value2  key3:value3
        String[] splits = col.split(",");
        for (String str : splits) {
            String[] cols = str.split(":");
            // 将数据输出
            forward(cols);
        }
    }

    public void close() throws HiveException {
    }
}

一样要导包和注册函数名,玩一下

select funudtf('key1:value1,key2:value2,key3:value3');

输出结果

key1	value1
	key2	value2
	key3	value3