使用impala查询引擎的时候,如果count(disticnt column)数据量过大会报错,报错内容

正在提取遇到以下错误的结果:

org.apache.hive.service.cli.HiveSQLException: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT (if(url = 'grouponList', user_id, NULL))); deviating function: count(DISTINCT if(url = 'grouponDetail', user_id, NULL)) Consider using NDV() instead of COUNT(DISTINCT) if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT query option to perform this rewrite automatically.

简单来说就是count(distinct column)让你换成NDV(column),换了之后查询会快很多,正常使用需要设置参数,

set APPX_COUNT_DISTINCT=true;

但是这个去重不是准确的,这个去重原理我也看不懂,下面是谷歌翻译的文档。

NDV功能
一个聚合函数,返回一个类似于COUNT(DISTINCT col )结果的近似值,即“不同值的数量” 。 它比COUNT和DISTINCT的组合快得多,并且使用恒定的内存量,因此对于具有高基数的列而言,内存密集度较低。

句法:

 NDV([DISTINCT | ALL] expression ) 
使用说明:

这是COMPUTE STATS语句在内部使用的机制,用于计算列中不同值的数量。

由于此数字是估计值,因此可能无法反映列中不同值的精确数量,尤其是如果基数非常低或非常高。 如果估计的数量高于表中的行数,Impala会在查询计划期间在内部调整值。

返回类型: Impala 2.0及更高版本中的DOUBLE ; 早期版本中的STRING

复杂类型考虑:

要在聚合函数中访问具有复杂类型( ARRAY , STRUCT或MAP )的列,请使用查询中的连接表示法解压缩单个元素,然后将该函数应用于最终标量项,字段,键或值位于列中任何嵌套类型层次结构的底部。 有关在Impala中使用复杂类型的详细信息,请参阅复杂类型(仅限Impala 2.3或更高版本) 。

下面的示例演示如何使用包含嵌套复杂类型( STRUCT项的ARRAY )的列中的值调用多个聚合函数。 使用连接表示法在查询内解压缩数组。 使用ITEM伪列引用数组元素,并使用点表示法引用数组元素内的结构字段。 使用数字R_NATIONKEY字段计算诸如SUM()和AVG()的数值,并且从字符串N_NAME字段计算通用MAX()和MIN()值。
 describe region; +-------------+-------------------------+---------+ | name | type | comment | +-------------+-------------------------+---------+ | r_regionkey | smallint | | | r_name | string | | | r_comment | string | | | r_nations | array<struct< | | | | n_nationkey:smallint, | | | | n_name:string, | | | | n_comment:string | | | | >> | | +-------------+-------------------------+---------+ select r_name, r_nations.item.n_nationkey from region, region.r_nations as r_nations order by r_name, r_nations.item.n_nationkey; +-------------+------------------+ | r_name | item.n_nationkey | +-------------+------------------+ | AFRICA | 0 | | AFRICA | 5 | | AFRICA | 14 | | AFRICA | 15 | | AFRICA | 16 | | AMERICA | 1 | | AMERICA | 2 | | AMERICA | 3 | | AMERICA | 17 | | AMERICA | 24 | | ASIA | 8 | | ASIA | 9 | | ASIA | 12 | | ASIA | 18 | | ASIA | 21 | | EUROPE | 6 | | EUROPE | 7 | | EUROPE | 19 | | EUROPE | 22 | | EUROPE | 23 | | MIDDLE EAST | 4 | | MIDDLE EAST | 10 | | MIDDLE EAST | 11 | | MIDDLE EAST | 13 | | MIDDLE EAST | 20 | +-------------+------------------+ select r_name, count(r_nations.item.n_nationkey) as count, sum(r_nations.item.n_nationkey) as sum, avg(r_nations.item.n_nationkey) as avg, min(r_nations.item.n_name) as minimum, max(r_nations.item.n_name) as maximum, ndv(r_nations.item.n_nationkey) as distinct_vals from region, region.r_nations as r_nations group by r_name order by r_name; +-------------+-------+-----+------+-----------+----------------+---------------+ | r_name | count | sum | avg | minimum | maximum | distinct_vals | +-------------+-------+-----+------+-----------+----------------+---------------+ | AFRICA | 5 | 50 | 10 | ALGERIA | MOZAMBIQUE | 5 | | AMERICA | 5 | 47 | 9.4 | ARGENTINA | UNITED STATES | 5 | | ASIA | 5 | 68 | 13.6 | CHINA | VIETNAM | 5 | | EUROPE | 5 | 77 | 15.4 | FRANCE | UNITED KINGDOM | 5 | | MIDDLE EAST | 5 | 58 | 11.6 | EGYPT | SAUDI ARABIA | 5 | +-------------+-------+-----+------+-----------+----------------+---------------+ 
限制:

此功能无法在分析上下文中使用。 也就是说,此函数根本不允许使用OVER()子句。

例子:

以下示例查询十亿行表以说明COUNT(DISTINCT)和NDV()的相对性能。 它显示了COUNT(DISTINCT)如何给出精确答案,但对于近似结果足够的大规模数据来说效率很低。 NDV()函数给出了近似结果,但速度更快。

 select count(distinct col1) from sample_data; +---------------------+ | count(distinct col1)| +---------------------+ | 100000 | +---------------------+ Fetched 1 row(s) in 20.13s select cast(ndv(col1) as bigint) as col1 from sample_data; +----------+ | col1 | +----------+ | 139017 | +----------+ Fetched 1 row(s) in 8.91s 
以下示例显示了如何在单个查询中编写多个NDV()调用,以便轻松了解哪些列具有更多或更少的不同值。 此技术比使用COUNT(DISTINCT)调用运行一系列查询更快。

 select cast(ndv(col1) as bigint) as col1, cast(ndv(col2) as bigint) as col2, cast(ndv(col3) as bigint) as col3, cast(ndv(col4) as bigint) as col4 from sample_data; +----------+-----------+------------+-----------+ | col1 | col2 | col3 | col4 | +----------+-----------+------------+-----------+ | 139017 | 282 | 46 | 145636240 | +----------+-----------+------------+-----------+ Fetched 1 row(s) in 34.97s select count(distinct col1) from sample_data; +---------------------+ | count(distinct col1)| +---------------------+ | 100000 | +---------------------+ Fetched 1 row(s) in 20.13s select count(distinct col2) from sample_data; +----------------------+ | count(distinct col2) | +----------------------+ | 278 | +----------------------+ Fetched 1 row(s) in 20.09s select count(distinct col3) from sample_data; +-----------------------+ | count(distinct col3) | +-----------------------+ | 46 | +-----------------------+ Fetched 1 row(s) in 19.12s select count(distinct col4) from sample_data; +----------------------+ | count(distinct col4) | +----------------------+ | 147135880 | +----------------------+ Fetched 1 row(s) in 266.95s 
父主题: Impala聚合函数