1.UDAF
前两节分别介绍了基础UDF和UDTF,这一节我们将介绍最复杂的用户自定义聚合函数(UDAF)。用户自定义聚合函数(UDAF)接受从零行到多行的零个到多个列,然后返回单一值,如sum()、count()。要实现UDAF,我们需要实现下面的类:
- org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver
- org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator
AbstractGenericUDAFResolver检查输入参数,并且指定使用哪个resolver。在AbstractGenericUDAFResolver里,只需要实现一个方法:
public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters) throws SemanticException;
但是,主要的逻辑处理还是在Evaluator中。我们需要继承GenericUDAFEvaluator,并且实现下面几个方法:
// 输入输出都是Object inspectors
public ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException;
// AggregationBuffer保存数据处理的临时结果
abstract AggregationBuffer getNewAggregationBuffer() throws HiveException;
// 重新设置AggregationBuffer
public void reset(AggregationBuffer agg) throws HiveException;
// 处理输入记录
public void iterate(AggregationBuffer agg, Object[] parameters) throws HiveException;
// 处理全部输出数据中的部分数据
public Object terminatePartial(AggregationBuffer agg) throws HiveException;
// 把两个部分数据聚合起来
public void merge(AggregationBuffer agg, Object partial) throws HiveException;
// 输出最终结果
public Object terminate(AggregationBuffer agg) throws HiveException;
在处理之前,先看下UADF的Enum GenericUDAFEvaluator.Mode。Mode有4中情况:
- PARTIAL1:Mapper阶段。从原始数据到部分聚合,会调用iterate()和terminatePartial()。
- PARTIAL2:Combiner阶段,在Mapper端合并Mapper的结果数据。从部分聚合到部分聚合,会调用merge()和terminatePartial()。
- FINAL:Reducer阶段。从部分聚合数据到完全聚合,会调用merge()和terminate()。
- COMPLETE:出现这个阶段,表示MapReduce中只用Mapper没有Reducer,所以Mapper端直接输出结果了。从原始数据到完全聚合,会调用iterate()和terminate()。
2.示例
下面我们看一个例子,把某一列的值合并,然后和concat_ws()函数一起实现MySQL中group_concat()函数的功能,代码如下:
@Description(
name = "collect",
value = "_FUNC_(col) - The parameter is a column name. "
+ "The return value is a set of the column.",
extended = "Example:\n"
+ " > SELECT _FUNC_(col) from src;"
)
public class GenericUDAFCollect extends AbstractGenericUDAFResolver {
private static final Log LOG = LogFactory.getLog(GenericUDAFCollect.class.getName());
public GenericUDAFCollect() {
// TODO Auto-generated constructor stub
}
@Override
public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters)
throws SemanticException {
if(parameters.length != 1){
throw new UDFArgumentTypeException(parameters.length - 1,
"Exactly one argument is expected.");
}
if(parameters[0].getCategory() != ObjectInspector.Category.PRIMITIVE){
throw new UDFArgumentTypeException(0,
"Only primitive type arguments are accepted but "
+ parameters[0].getTypeName() + " was passed as parameter 1.");
}
return new GenericUDAFCollectEvaluator();
}
@SuppressWarnings("deprecation")
public static class GenericUDAFCollectEvaluator extends GenericUDAFEvaluator{
private PrimitiveObjectInspector inputOI;
private StandardListObjectInspector internalMergeOI;
private StandardListObjectInspector loi;
@Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters)
throws HiveException {
super.init(m, parameters);
if(m == Mode.PARTIAL1 || m == Mode.COMPLETE){
inputOI = (PrimitiveObjectInspector) parameters[0];
return ObjectInspectorFactory.getStandardListObjectInspector(
(PrimitiveObjectInspector) ObjectInspectorUtils
.getStandardObjectInspector(inputOI));
}
else if(m == Mode.PARTIAL2 || m == Mode.FINAL){
internalMergeOI = (StandardListObjectInspector) parameters[0];
inputOI = (PrimitiveObjectInspector) internalMergeOI.getListElementObjectInspector();
loi = ObjectInspectorFactory.getStandardListObjectInspector(inputOI);
return loi;
}
return null;
}
static class ArrayAggregationBuffer implements AggregationBuffer{
List<Object> container;
}
@Override
public AggregationBuffer getNewAggregationBuffer() throws HiveException {
ArrayAggregationBuffer ret = new ArrayAggregationBuffer();
reset(ret);
return ret;
}
@Override
public void reset(AggregationBuffer agg) throws HiveException {
((ArrayAggregationBuffer) agg).container = new ArrayList<Object>();
}
@Override
public void iterate(AggregationBuffer agg, Object[] param)
throws HiveException {
Object p = param[0];
if(p != null){
putIntoList(p, (ArrayAggregationBuffer)agg);
}
}
@Override
public void merge(AggregationBuffer agg, Object partial)
throws HiveException {
ArrayAggregationBuffer myAgg = (ArrayAggregationBuffer) agg;
ArrayList<Object> partialResult = (ArrayList<Object>) this.internalMergeOI.getList(partial);
for(Object obj : partialResult){
putIntoList(obj, myAgg);
}
}
@Override
public Object terminate(AggregationBuffer agg) throws HiveException {
ArrayAggregationBuffer myAgg = (ArrayAggregationBuffer) agg;
ArrayList<Object> list = new ArrayList<Object>();
list.addAll(myAgg.container);
return list;
}
@Override
public Object terminatePartial(AggregationBuffer agg)
throws HiveException {
ArrayAggregationBuffer myAgg = (ArrayAggregationBuffer) agg;
ArrayList<Object> list = new ArrayList<Object>();
list.addAll(myAgg.container);
return list;
}
public void putIntoList(Object param, ArrayAggregationBuffer myAgg){
Object pCopy = ObjectInspectorUtils.copyToStandardObject(param, this.inputOI);
myAgg.container.add(pCopy);
}
}
}
然后我们把代码编译打包后的jar文件添加到CLASSPATH,然后创建函数collect(),最后仍然使用第一节的数据表employee:
hive (mydb)> ADD jar /root/experiment/hive/hive-0.0.1-SNAPSHOT.jar;
hive (mydb)> CREATE TEMPORARY FUNCTION collect AS "edu.wzm.hive.udaf.GenericUDAFCollect";
hive (mydb)> SELECT collect(name) FROM employee;
Query ID = root_20160117221111_c8b88dc9-170c-4957-b665-15b99eb9655a
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1453096763931_0001, Tracking URL = http://master:8088/proxy/application_1453096763931_0001/
Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job -kill job_1453096763931_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-01-17 22:11:49,360 Stage-1 map = 0%, reduce = 0%
2016-01-17 22:12:01,388 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.76 sec
2016-01-17 22:12:16,830 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.95 sec
MapReduce Total cumulative CPU time: 2 seconds 950 msec
Ended Job = job_1453096763931_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.95 sec HDFS Read: 1040 HDFS Write: 80 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 950 msec
OK
["John Doe","Mary Smith","Todd Jones","Bill King","Boss Man","Fred Finance","Stacy Accountant"]
Time taken: 44.302 seconds, Fetched: 1 row(s)
然后,把concat_ws(',', collect(name)),还有GROUP BY结合使用达到MySQL中group_concat()函数的效果,下面查询相同工资的员工:
hive (mydb)> SELECT salary,concat_ws(',', collect(name)) FROM employee GROUP BY salary;
Query ID = root_20160117222121_dedd4981-e050-4aac-81cb-c449639c721b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1453096763931_0003, Tracking URL = http://master:8088/proxy/application_1453096763931_0003/
Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job -kill job_1453096763931_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-01-17 22:21:59,627 Stage-1 map = 0%, reduce = 0%
2016-01-17 22:22:07,207 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.2 sec
2016-01-17 22:22:14,700 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.8 sec
MapReduce Total cumulative CPU time: 2 seconds 800 msec
Ended Job = job_1453096763931_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.8 sec HDFS Read: 1040 HDFS Write: 131 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 800 msec
OK
60000.0 Bill King,Stacy Accountant
70000.0 Todd Jones
80000.0 Mary Smith
100000.0 John Doe
150000.0 Fred Finance
200000.0 Boss Man
Time taken: 24.928 seconds, Fetched: 6 row(s)
3.UDAF模式
在实现UDAF时,主要实现下面几个方法:
- init():当实例化UDAF的Evaluator时执行,并且指定输入输出数据的类型。
- iterate():把输入数据处理后放入到内存聚合块中(AggregationBuffer),典型的Mapper。
- terminatePartial():其为iterate()轮转结束后,返回轮转数据,类似于Combiner。
- merge():介绍terminatePartial()的结果,然后把这些partial结果数据merge到一起。
- terminate():返回最终的结果。
iterate()和terminatePartial()都在Mapper端。
merge()和terminate()都在Reducer端。
AggregationBuffer存储中间或最终结果。通过我们定义自己的Aggregation Buffer,可以处理任何类型的数据。