文章目录
- 一、窗口函数
- 1、window as
- 2、with as
- 3、常用函数
- 4、窗口帧
- 二、自定义函数
- 1、UDF:一进一出
- 2、UDTF:一进多出
I know, i know
地球另一端有你陪我
一、窗口函数
1、window as
select *,rank() over(partition by clazz order by score desc) as scorerank
from students;
这里的 over() 中的语句大多相似,可以独立拆下来,用变量名代替
window o as(partition by clazz order by score desc)
原句可以简化为
select *,rank() o as scorerank
from students;
2、with as
查询语句同样可以用变量名代替
select id,max(continuity)
from
(select id,count(origin) as continuity
from
(select id,day,row1,date_sub(day,row1) as origin
from
(select
id,day,row_number() over(partition by id order by day) as row1
from login) as t1
) as t2
group by id,origin
)as t3
group by id;
一层层剥下来
with t1 as
(select id,day,row_number() over(partition by id order by day) as row1
from login)
with t2 as
(select id,day,row1,date_sub(day,row1) as origin from t1)
with t3 as
(select id,count(origin) as continuity from t2 group by id,origin)
最终简化为
(多个 with as 之间用 , 隔开,且省略后面的 with)
with t1 as
(select id,day,row_number() over(partition by id order by day) as row1 from login),
t2 as
(select id,day,row1,date_sub(day,row1) as origin from t1),
t3 as
(select id,count(origin) as continuity from t2 group by id,origin)
select id,max(continuity) from t3 group by id;
3、常用函数
row_number:
无并列排名
1 2 3
dense_rank:
有并列排名,并且依次递增
1 1 2 2 3 3
rank:
有并列排名,不依次递增
1 1 3 3 5 5
LAG(col,n):
往前第n行数据
LEAD(col,n):
往后第n行数据
FIRST_VALUE:
取分组内排序后,截止到当前行,第一个值
LAST_VALUE:
取分组内排序后,截止到当前行,最后一个值,对于并列的排名,取最后一个
4、窗口帧
窗口函数允许一次读取特定范围的数据,通过窗口帧来实现
窗口帧会一行行扫过数据,根据条件来选择读取的范围
Hive 提供了两种定义窗口帧的形式:ROWS
和 RANGE
只能运用在max、min、avg、count、sum、FIRST_VALUE、LAST_VALUE这几个窗口函数上
1 前一行、当前行和后一行
over rows between 1 preceding and 1 following
2 从起始行到现在行
over rows between unbounded preceding and current row
3 范围 [当前值减去3 和当前值加上11]
例:当前读取的值为 22,即寻找表中数据范围是[19,11]的
over range between 3 PRECEDING and 11 FOLLOWING
二、自定义函数
hive 允许使用者自定义函数进行使用
大致分为三类UDF
、 UDTF
和UDAF
导个包先
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>1.2.1</version>
</dependency>
1、UDF:一进一出
编写代码,继承org.apache.hadoop.hive.ql.exec.UDF
实现evaluate方法,在evaluate方法中实现自己的逻辑
用于一次只进入一个对象,并且是依赖 java 的 return
所以叫一进一出
import org.apache.hadoop.hive.ql.exec.UDF;
public class HiveUDF extends UDF {
// hadoop => #hadoop$
public String evaluate(String col1) {
// 给传进来的数据 左边加上 # 号 右边加上 $
String result = "#" + col1 + "$";
return result;
}
}
java 中打包,传到 linux 中,再导入到 hive
add jar /usr/local/data/hive-1.0-SNAPSHOT.jar;
注册函数名,指名使用的类
create temporary function funudf as 'day47.UDF.Test';
玩一下
select funudf(name) from students;
2、UDTF:一进多出
通过自己的方法输出对象
可以实现一进多出
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive
.PrimitiveObjectInspectorFactory;
import java.util.ArrayList;
public class Test extends GenericUDTF {
// 指定输出的列名 及 类型
@Override
public StructObjectInspector initialize(StructObjectInspector argOIs)
throws UDFArgumentException {
ArrayList<String> filedNames = new ArrayList<String>();
ArrayList<ObjectInspector> filedObj
= new ArrayList<ObjectInspector>();
filedNames.add("col1");
filedObj.add(PrimitiveObjectInspectorFactory
.javaStringObjectInspector);
filedNames.add("col2");
filedObj
.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
return ObjectInspectorFactory
.getStandardStructObjectInspector(filedNames, filedObj);
}
// 处理逻辑 my_udtf(col1,col2,col3)
// "key1:value1,key2:value2,key3:value3"
// my_udtf("key1:value1,key2:value2,key3:value3")
public void process(Object[] objects) throws HiveException {
// objects 表示传入的N列
String col = objects[0].toString();
// key1:value1 key2:value2 key3:value3
String[] splits = col.split(",");
for (String str : splits) {
String[] cols = str.split(":");
// 将数据输出
forward(cols);
}
}
public void close() throws HiveException {
}
}
一样要导包和注册函数名,玩一下
select funudtf('key1:value1,key2:value2,key3:value3');
输出结果
key1 value1
key2 value2
key3 value3