drill 学习 四 drill参考案例

具体案例来自官方文档,只是简单的说明下,详细的可以参考官方文档

环境准备

  • 运行drill
    测试环境可以通过docker运行,简单方便

简单查询

  • 查询json文件
 
SELECT * FROM cp.`employee.json` LIMIT 3;
  • 查询parquet文件
SELECT * FROM dfs.`Users/drilluser/apache-drill/sample-data/region.parquet`;

分析Yelp Academic Dataset

注意数据下载地址好像不存在或者变动了

  • 查看业务数据
select * from dfs.`<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json
  • 总数统计
select sum(review_count) as totalreviews from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json`;
  • top 10 统计
select state, city, count(*) totalreviews from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` group by state, city order by count(*) desc limit 10
  • 统计平均值
    trunc 函数的意思是返回包含小数点.0 而不是返回一个整数
 
select stars,trunc(avg(review_count)) reviewsavg 
from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json`
group by stars order by stars desc;
  • 查询json 数组包含Restaurants的统计
select name,repeated_count(categories) as categorycount, categories from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` where true=repeated_contains(categories,'Restaurants') order by repeated_count(categories) desc limit 10;
  • json 数据的扁平化
    json 数组转为table
 
select name, flatten(categories) as category 
from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json`  limit 20;

分析动态数据

核心还是一个分析json文件的,只是我们的json是一个比较复杂的,我们需要基于drill提供的一些函数方便的解析数据

  • 参考数据格式
check-in
{
    'type': 'checkin',
    'business_id': (encrypted business id),
    'checkin_info': {
        '0-0': (number of checkins from 00:00 to 01:00 on all Sundays),
        '1-0': (number of checkins from 01:00 to 02:00 on all Sundays),
        ...
        '14-4': (number of checkins from 14:00 to 15:00 on all Thursdays),
        ...
        '23-6': (number of checkins from 23:00 to 00:00 on all Saturdays)
    }, # if there was no checkin for a hour-day block it will not be in the dataset
}
  • key-value 映射
    基于KVGEN函数
 
SELECT KVGEN(checkin_info) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` LIMIT 2;
  • 统计总数
    集合KVGEN,FLATTEN 以及SUM函数
 
SELECT SUM(checkintbl.checkins.`value`) AS TotalCheckins FROM (
SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl;
  • 统计一天每个小时的总数
SELECT SUBSTR(checkintbl.checkins.key,1,strpos(checkintbl.checkins.key,'-')-1) AS HourOfTheDay ,SUM(checkintbl.checkins.`value`) TotalCheckins FROM (
SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl GROUP BY SUBSTR(checkintbl.checkins.key,1,strpos(checkintbl.checkins.key,'-')-1) ORDER BY TotalCheckins;

窗口函数使用

drill 支持的窗口函数:
使用over 从句的partition by
sum,max,min,count,avg sql 函数
分析函数first_value,last_value,Lead,Lag,Ntitle,Row_number,Rank

  • 基于window函数的简单查询
    统计每个城市基于浏览数的信息
 
SELECT name, city, review_count, row_number()
OVER (PARTITION BY city ORDER BY review_count DESC) AS rownum 
FROM `business.json` LIMIT 15; 
  • 复杂高级查询
    基于with 语句
    按评论数列出前10个城市及其排名最高的企业
 
WITH X
AS
(SELECT name, city, review_count,
RANK()
OVER (PARTITION BY city
ORDER BY review_count DESC) AS review_rank
FROM `business.json`)
SELECT X.name, X.city, X.review_count
FROM X
WHERE X.review_rank =1 ORDER BY review_count DESC LIMIT 10;
  • 将每个企业的评论数与城市中排名最高和最低的评论数进行比较
SELECT name, city, review_count,
FIRST_VALUE(review_count)
OVER(PARTITION BY city ORDER BY review_count DESC) AS top_review_count,
LAST_VALUE(review_count)
OVER(PARTITION BY city ORDER BY review_count DESC) AS bottom_review_count
FROM `business.json` limit 15;
  • 将评论数与上一个和下一个业务的评论数进行比较
SELECT city, review_count, name,
LAG(review_count, 1) OVER(PARTITION BY city ORDER BY review_count DESC) 
AS preceding_count,
LEAD(review_count, 1) OVER(PARTITION BY city ORDER BY review_count DESC) 
AS following_count
FROM `business.json` limit 15;

说明

drill 提供的sql函数还是很强大的,丰富而且灵活,更多的我们可以参考官方文档

参考资料

https://drill.apache.org/docs/tutorials-introduction/

posted on 2020-10-04 11:00  荣锋亮  阅读(220)  评论(0编辑  收藏  举报