drill 学习 四 drill参考案例
具体案例来自官方文档,只是简单的说明下,详细的可以参考官方文档
环境准备
- 运行drill
测试环境可以通过docker运行,简单方便
简单查询
- 查询json文件
SELECT * FROM cp.`employee.json` LIMIT 3;
- 查询parquet文件
SELECT * FROM dfs.`Users/drilluser/apache-drill/sample-data/region.parquet`;
分析Yelp Academic Dataset
注意数据下载地址好像不存在或者变动了
- 查看业务数据
select * from dfs.`<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json
- 总数统计
select sum(review_count) as totalreviews from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json`;
- top 10 统计
select state, city, count(*) totalreviews from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` group by state, city order by count(*) desc limit 10
- 统计平均值
trunc 函数的意思是返回包含小数点.0
而不是返回一个整数
select stars,trunc(avg(review_count)) reviewsavg
from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json`
group by stars order by stars desc;
- 查询json 数组包含Restaurants的统计
select name,repeated_count(categories) as categorycount, categories from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` where true=repeated_contains(categories,'Restaurants') order by repeated_count(categories) desc limit 10;
- json 数据的扁平化
json 数组转为table
select name, flatten(categories) as category
from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` limit 20;
分析动态数据
核心还是一个分析json文件的,只是我们的json是一个比较复杂的,我们需要基于drill提供的一些函数方便的解析数据
- 参考数据格式
check-in
{
'type': 'checkin',
'business_id': (encrypted business id),
'checkin_info': {
'0-0': (number of checkins from 00:00 to 01:00 on all Sundays),
'1-0': (number of checkins from 01:00 to 02:00 on all Sundays),
...
'14-4': (number of checkins from 14:00 to 15:00 on all Thursdays),
...
'23-6': (number of checkins from 23:00 to 00:00 on all Saturdays)
}, # if there was no checkin for a hour-day block it will not be in the dataset
}
- key-value 映射
基于KVGEN函数
SELECT KVGEN(checkin_info) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` LIMIT 2;
- 统计总数
集合KVGEN,FLATTEN 以及SUM函数
SELECT SUM(checkintbl.checkins.`value`) AS TotalCheckins FROM (
SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl;
- 统计一天每个小时的总数
SELECT SUBSTR(checkintbl.checkins.key,1,strpos(checkintbl.checkins.key,'-')-1) AS HourOfTheDay ,SUM(checkintbl.checkins.`value`) TotalCheckins FROM (
SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl GROUP BY SUBSTR(checkintbl.checkins.key,1,strpos(checkintbl.checkins.key,'-')-1) ORDER BY TotalCheckins;
窗口函数使用
drill 支持的窗口函数:
使用over 从句的partition by
sum,max,min,count,avg sql 函数
分析函数first_value,last_value,Lead,Lag,Ntitle,Row_number,Rank
- 基于window函数的简单查询
统计每个城市基于浏览数的信息
SELECT name, city, review_count, row_number()
OVER (PARTITION BY city ORDER BY review_count DESC) AS rownum
FROM `business.json` LIMIT 15;
- 复杂高级查询
基于with 语句
按评论数列出前10个城市及其排名最高的企业
WITH X
AS
(SELECT name, city, review_count,
RANK()
OVER (PARTITION BY city
ORDER BY review_count DESC) AS review_rank
FROM `business.json`)
SELECT X.name, X.city, X.review_count
FROM X
WHERE X.review_rank =1 ORDER BY review_count DESC LIMIT 10;
- 将每个企业的评论数与城市中排名最高和最低的评论数进行比较
SELECT name, city, review_count,
FIRST_VALUE(review_count)
OVER(PARTITION BY city ORDER BY review_count DESC) AS top_review_count,
LAST_VALUE(review_count)
OVER(PARTITION BY city ORDER BY review_count DESC) AS bottom_review_count
FROM `business.json` limit 15;
- 将评论数与上一个和下一个业务的评论数进行比较
SELECT city, review_count, name,
LAG(review_count, 1) OVER(PARTITION BY city ORDER BY review_count DESC)
AS preceding_count,
LEAD(review_count, 1) OVER(PARTITION BY city ORDER BY review_count DESC)
AS following_count
FROM `business.json` limit 15;
说明
drill 提供的sql函数还是很强大的,丰富而且灵活,更多的我们可以参考官方文档