Hive谓词解析过程分析

转载

wx63118e2bb7416 2022-09-16 23:08:28 博主文章分类：Java

文章标签 hive java 大数据 lua 语法树 文章分类 虚拟化云计算

where col1 = 100 and abs(col2) > 0在Hive中的处理过程

where过滤条件称为谓词predicate。

以上where过滤条件在经过Hive的语法解析后，生成如下的语法树：

TOK_WHERE
AND
=
TOK_TABLE_OR_COL
c1
100
>
TOK_FUNCTION
ABS
TOK_TABLE_OR_COL
c2
0
有了语法树之后，最终的目的是生成predicate每个节点对应的ExprNodeDesc，即描述对应的节点：

public Map<ASTNode, ExprNodeDesc> genAllExprNodeDesc(ASTNode expr, RowResolver input,
TypeCheckCtx tcCtx) throws SemanticException {
…
Map<ASTNode, ExprNodeDesc> nodeOutputs =
TypeCheckProcFactory.genExprNode(expr, tcCtx);
…
生成的过程是对上述语法树的一个深度优先遍历的过程，Hive中大量对树的遍历的代码，在遍历过程中根据指定的规则或对语法树进行修改，或输出相应的结果。

Hive中有一个默认的深度优先遍历的实现DefaultGraphWalker。

这个遍历器的实现部分代码如下：

public void walk(Node nd) throws SemanticException {
// Push the node in the stack
opStack.push(nd);

// While there are still nodes to dispatch...
while (!opStack.empty()) {
  Node node = opStack.peek();

  if (node.getChildren() == null ||
          getDispatchedList().containsAll(node.getChildren())) {
    // Dispatch current node
    if (!getDispatchedList().contains(node)) {
      dispatch(node, opStack);
      opQueue.add(node);
    }
    opStack.pop();
    continue;
  }

  // Add a single child and restart the loop
  for (Node childNode : node.getChildren()) {
    if (!getDispatchedList().contains(childNode)) {
      opStack.push(childNode);
      break;
    }
  }
} // end while

}
先将当前节点放到待处理的栈opStack中，然后从opStack取节点出来，如果取出来的节点没有Children，或者Children已经全部处理完毕，才对当前节点进行处理（dispatch），如果当前节点有Children且还没有处理完，则将当前节点的Children放到栈顶，然后重新从栈中取节点进行处理。这是很基础的深度优先遍历的实现。

那在遍历的过程中，如何针对不同的节点进行不同的处理呢？

在遍历之前，先预置一些针对不同的节点不同规则的处理器，然后在遍历过程中，通过分发器Dispatcher选择最合适的处理器进行处理。

生成ExprNodeDesc的遍历中一共先预置了8个规则Rule，每个规则对应一个处理器NodeProcessor：

Map<Rule, NodeProcessor> opRules = new LinkedHashMap<Rule, NodeProcessor>();

opRules.put(new RuleRegExp("R1", HiveParser.TOK_NULL + "%"),
    tf.getNullExprProcessor());
opRules.put(new RuleRegExp("R2", HiveParser.Number + "%|" +
    HiveParser.TinyintLiteral + "%|" +
    HiveParser.SmallintLiteral + "%|" +
    HiveParser.BigintLiteral + "%|" +
    HiveParser.DecimalLiteral + "%"),
    tf.getNumExprProcessor());
opRules
    .put(new RuleRegExp("R3", HiveParser.Identifier + "%|"
    + HiveParser.StringLiteral + "%|" + HiveParser.TOK_CHARSETLITERAL + "%|"
    + HiveParser.TOK_STRINGLITERALSEQUENCE + "%|"
    + "%|" + HiveParser.KW_IF + "%|" + HiveParser.KW_CASE + "%|"
    + HiveParser.KW_WHEN + "%|" + HiveParser.KW_IN + "%|"
    + HiveParser.KW_ARRAY + "%|" + HiveParser.KW_MAP + "%|"
    + HiveParser.KW_STRUCT + "%|" + HiveParser.KW_EXISTS + "%|"
    + HiveParser.KW_GROUPING + "%|"
    + HiveParser.TOK_SUBQUERY_OP_NOTIN + "%"),
    tf.getStrExprProcessor());
opRules.put(new RuleRegExp("R4", HiveParser.KW_TRUE + "%|"
    + HiveParser.KW_FALSE + "%"), tf.getBoolExprProcessor());
opRules.put(new RuleRegExp("R5", HiveParser.TOK_DATELITERAL + "%|"
    + HiveParser.TOK_TIMESTAMPLITERAL + "%"), tf.getDateTimeExprProcessor());
opRules.put(new RuleRegExp("R6",
    HiveParser.TOK_INTERVAL_YEAR_MONTH_LITERAL + "%|"
    + HiveParser.TOK_INTERVAL_DAY_TIME_LITERAL + "%|"
    + HiveParser.TOK_INTERVAL_YEAR_LITERAL + "%|"
    + HiveParser.TOK_INTERVAL_MONTH_LITERAL + "%|"
    + HiveParser.TOK_INTERVAL_DAY_LITERAL + "%|"
    + HiveParser.TOK_INTERVAL_HOUR_LITERAL + "%|"
    + HiveParser.TOK_INTERVAL_MINUTE_LITERAL + "%|"
    + HiveParser.TOK_INTERVAL_SECOND_LITERAL + "%"), tf.getIntervalExprProcessor());
opRules.put(new RuleRegExp("R7", HiveParser.TOK_TABLE_OR_COL + "%"),
    tf.getColumnExprProcessor());
opRules.put(new RuleRegExp("R8", HiveParser.TOK_SUBQUERY_OP + "%"),
    tf.getSubQueryExprProcessor());

这里使用的分发器Dispatcher是DefaultRuleDispatcher，DefaultRuleDispatcher选择处理器的逻辑如下：

// find the firing rule
// find the rule from the stack specified
Rule rule = null;
int minCost = Integer.MAX_VALUE;
for (Rule r : procRules.keySet()) {
  int cost = r.cost(ndStack);
  if ((cost >= 0) && (cost <= minCost)) {
    minCost = cost;
    rule = r;
  }
}

NodeProcessor proc;

if (rule == null) {
  proc = defaultProc;
} else {
  proc = procRules.get(rule);
}

// Do nothing in case proc is null
if (proc != null) {
  // Call the process function
  return proc.process(nd, ndStack, procCtx, nodeOutputs);
} else {
  return null;
}

遍历所有的规则Rule，调用每个规则的cost方法计算cost，找其中cost最小的规则对应的处理器，如果没有找到，则使用默认处理器，如果没有设置默认处理器，则不做任何事情。

那么每个规则的cost是如何计算的？

– 没太看懂==|| (后续再理理)

WHERE条件语法树每个节点对应的处理器如下：

TOK_WHERE
AND --> TypeCheckProcFactory.DefaultExprProcessor
= --> TypeCheckProcFactory.DefaultExprProcessor
TOK_TABLE_OR_COL --> TypeCheckProcFactory.ColumnExprProcessor
c1 --> TypeCheckProcFactory.StrExprProcessor
100 --> TypeCheckProcFactory.NumExprProcessor
> --> TypeCheckProcFactory.DefaultExprProcessor
TOK_FUNCTION --> TypeCheckProcFactory.DefaultExprProcessor
ABS --> TypeCheckProcFactory.StrExprProcessor
TOK_TABLE_OR_COL --> TypeCheckProcFactory.ColumnExprProcessor
c2 --> TypeCheckProcFactory.StrExprProcessor
0 --> TypeCheckProcFactory.NumExprProcessor

TypeCheckProcFactory.StrExprProcessor 生成ExprNodeConstantDesc
TypeCheckProcFactory.ColumnExprProcessor 处理column，生成ExprNodeColumnDesc
TypeCheckProcFactory.NumExprProcessor生成ExprNodeConstantDesc
TypeCheckProcFactory.DefaultExprProcessor生成ExprNodeGenericFuncDesc
在深度优先遍历完WHERE语法树后，每个节点都会生成一个ExprNodeDesc，但是其实除了最顶层的AND节点生成的ExprNodeDesc有用，其他的节点生成的都是中间结果，最终都会包含在AND节点生成的ExprNodeDesc中。所以在遍历WHERE树后，通过AND节点生成的ExprNodeDesc构造FilterDesc：

new FilterDesc(genExprNodeDesc(condn, inputRR, useCaching), false)
有了FilterDesc后，就能够构造出FilterOperator了，然后再将生成的FilterOperator加入到Operator树中：

Operator ret = get((Class) conf.getClass());
ret.setConf(conf);
至此，where过滤条件对应的FilterOperator构造完毕。

接下来仔细看下AND生成的ExprNodeDesc，它其实是一个ExprNodeGenericFuncDesc：

// genericUDF是GenericUDFOPAnd，就是对应AND操作符
private GenericUDF genericUDF;
// AND是一个二元操作符，children里存的是对应的操作符
// 根据WHERE语法树，可以知道children[0]肯定又是一个ExprNodeGenericFuncDesc，而且是一个=函 // 数，而children[1]也是一个肯定又是一个ExprNodeGenericFuncDesc，而且是一个>函数，以此类 // 推，每个ExprNodeGenericFuncDesc都有对应的children
private List chidren;
// UDF的名字，这里是and
private transient String funcText;
/**

This class uses a writableObjectInspector rather than a TypeInfo to store
the canonical type information for this NodeDesc.
*/
private transient ObjectInspector writableObjectInspector;
每个ExprNodeDesc都对应有一个ExprNodeEvaluator，来对每个ExprNodeDesc进行实际的计算。看下ExprNodeEvaluator类的基本方法：

public abstract class ExprNodeEvaluator {
// 对应的ExprNodeDesc
protected final T expr;
// 在经过这个Evaluator计算后，它的输出值该如何解析的ObjectInspector
protected ObjectInspector outputOI;
…
/**

Initialize should be called once and only once. Return the ObjectInspector
for the return value, given the rowInspector.
初始化方法，传入一个ObjectInspector，即传入的数据应该如何解析的ObjectInspector
而需要返回经过这个Evaluator计算后的输出值的解析ObjectInspector
*/
public abstract ObjectInspector initialize(ObjectInspector rowInspector) throws HiveException;

// evaluate方法，调用来对row数据进行解析
public Object evaluate(Object row) throws HiveException {
return evaluate(row, -1);
}

/**

Evaluate the expression given the row. This method should use the
rowInspector passed in from initialize to inspect the row object. The
return value will be inspected by the return value of initialize.
If this evaluator is referenced by others, store it for them
*/
protected Object evaluate(Object row, int version) throws HiveException {
if (version < 0 || version != this.version) {
this.version = version;
return evaluation = _evaluate(row, version);
}
return evaluation;
}
// 由各个子类实现的方法的_evaluate方法，结合上面的evaluate方法，这里实际使用了设计模式的模板 // 方法模式
protected abstract Object _evaluate(Object row, int version) throws HiveException;
…
}
通过ExprNodeEvaluatorFactory获取到每个ExprNodeDesc对应的ExprNodeEvaluator：

public static ExprNodeEvaluator get(ExprNodeDesc desc) throws HiveException {
// Constant node
if (desc instanceof ExprNodeConstantDesc) {
return new ExprNodeConstantEvaluator((ExprNodeConstantDesc) desc);
}

// Column-reference node, e.g. a column in the input row
if (desc instanceof ExprNodeColumnDesc) {
  return new ExprNodeColumnEvaluator((ExprNodeColumnDesc) desc);
}
// Generic Function node, e.g. CASE, an operator or a UDF node
if (desc instanceof ExprNodeGenericFuncDesc) {
  return new ExprNodeGenericFuncEvaluator((ExprNodeGenericFuncDesc) desc);
}
// Field node, e.g. get a.myfield1 from a
if (desc instanceof ExprNodeFieldDesc) {
  return new ExprNodeFieldEvaluator((ExprNodeFieldDesc) desc);
}
throw new RuntimeException(
    "Cannot find ExprNodeEvaluator for the exprNodeDesc = " + desc);

}
看下FilterOperator中如何使用ExprNodeEvaluator对数据进行过滤的。

首先在FilterOperator的initializeOp方法中，获取到ExprNodeEvaluator：

conditionEvaluator = ExprNodeEvaluatorFactory.get(conf.getPredicate());
然后在process方法中，调用initialize方法后，调用eveluate方法获取到整个where过滤的结果：

conditionInspector = (PrimitiveObjectInspector) conditionEvaluator
.initialize(rowInspector);
…
Object condition = conditionEvaluator.evaluate(row);
…
Boolean ret = (Boolean) conditionInspector
.getPrimitiveJavaObject(condition);
// 如果结果是true，则forward到下一个operator继续处理
if (Boolean.TRUE.equals(ret)) {
forward(row, rowInspector);
}
再来看下GenericUDFOPAnd的evaluate方法实现：

@Override
public Object evaluate(DeferredObject[] arguments) throws HiveException {
boolean bool_a0 = false, bool_a1 = false;
Object a0 = arguments[0].get();
if (a0 != null) {
bool_a0 = boi0.get(a0);
if (bool_a0 == false) {
result.set(false);
return result;
}
}

Object a1 = arguments[1].get();
if (a1 != null) {
  bool_a1 = boi1.get(a1);
  if (bool_a1 == false) {
    result.set(false);
    return result;
  }
}

if ((a0 != null && bool_a0 == true) && (a1 != null && bool_a1 == true)) {
  result.set(true);
  return result;
}

return null;

}
从以上代码知道，在进行AND的计算时，如果左边条件返回false，则不会进行右边条件的计算，所以AND的顺序其实是影响实际的效率的。类似的还有OR也是一样的，如果左边条件返回true，则不会进行右边条件的计算。