1,查询某一日期是否在当月的最后三天
cast(updated_at as timestamp)
between date_add('day',-3,date_trunc('month',date_add('month',1,cast(updated_at as timestamp))))
and date_add('day',-1,date_trunc('month',date_add('month',1,cast(updated_at as timestamp))))
updated_at是varchar类型,date_add与date_trunc函数都是需要date,time类型,不支持字符串格式。
date_trunc日期截取函数
2,同一业务ID下,最后一次流转后员工ID在历史记录中出现过,则该业务ID不出现表结构如上,业务ID=2即为所需,1需排除
select * from table where dt = '2019-09-30'
and rank = 1
and carid not in --反向选取 (select distinct t01.carid from--找到同一carid下,最后一次员工ID出现在其他记录的carid (select * from table where dt = '2019-09-30'
)t01
inner join
(select * from table where dt = '2019-09-30'
)t02
on t02.carid = t01.carid
and t01.rank < t02.rank
and t01.rank = 1
and (t01.after_id = t02.after_id or t01.after_id = t02.before_id))
经验:正向走不通时,反向处理可能是一种解法
3,python 向SQL中传递参数
conn = presto_client.connect(,,,)
supplier= ['江苏出版社','人民出版社']
sql = '''select distinct name as "名称",address as "地址"from tablewhere name in%(supplier)s'''
df = pd.read_sql_query(sql,conn,params = {'supplier':supplier[1]})
read_sql_query :在sql引用%(var_name)s,调用时使用params参数,格式为params = {'name':'value'}
4,presto SQL:多条件模糊匹配
多条件模糊匹配:regexp_like(字段名, '(str1|str2|...)')
示例如下:
where talbe.name not like '%测试%'
and talbe.name not like '%test%'
-------------可转换为------where not regexp_like(talbe.name, '(测试|test)')
右模糊匹配:regexp_like(col, '^(str1|str2|...)')
左模糊匹配:regexp_like(col, '(str1|str2|...)$')
5,SQL函数总结
1.1 字符
substr(staff_leave_date,1,10)
concat(string1, ..., stringN) → varchar--拼接多个字符串 length(string) → bigint
length(cast(start_time as varchar))
replace(string, search, replace) → varchar
trim(string) → varchar--删除左右两侧的空格 split_part(ip,':',1)--以冒号分割取第二部分 upper/lower(string) → varchar--大小写 ###mid(str,pos,len)--从指定位置截取指定长度的字符串 --错误 substr(str,start,length)--截取指定长度字符串1.2 日期
current_date -> date
now() → timestamp
from_unixtime(create_time)--int/bigint→ timestamp substr(staff_leave_date,1,10)
date_format(from_unixtime(status_update_time),'%Y-%m-%d')--调整时期格式'2017-11-25' date_format(cast(audit_time as timestamp),'%Y-%m-%d %H:%i:%s')--2018-04-13 22:09:16 date_add('day', -1, CURRENT_DATE)--当前日期-1 date_trunc('month',from_unixtime(create_time))--当前月份的第一天,2001-08-22 03:04:05.321→2001-08-01 00:00:00.000
date_diff(unit, timestamp1, timestamp2) → bigint
date_diff('day',cast(substr(audit_at,1,10) as date),cast(current_timestamp as date)) as "库存时间"
day_of_week(x) → bigint
Returns the ISO day of the week from x. The value ranges from 1 (Monday) to 7 (Sunday).
week_of_year(x) → bigint
This is an alias for week()
1.3 json
json_extract(json, json_path) → json
json_extract_scalar(json, json_path) → varchar
{"chang_status_remark":["ChangeApply.id = 56798765","Apply.id = 345876"],"source_status":[3,0],"status_update_time":[98765,234589]}
json_extract(change_info,'$.source_status') --[3,0] json_extract_scalar(change_info,'$.source_status[1]')--0 json_extract_scalar(json, '$.store.book[0].author')
1.4 关联
表间关联
t01
inner/left/right join
t02
on t01.id = t02.id
left/right 未关联到的字段置为null
多部分数据拼接 union:对结果去重且排序 union all:直接返回合并的结果,速度快
union中的每个查询必须包含相同的列,表达式及聚合函数;
只能使用一条order by字句,且必须放在最后一个查询后面
1.5 条件
两分类:
if(staff_status =1,'在职','离职') as "在职状态"
等价于
case when staff_status=1 then '在职'
else '离职'
end as "在职状态"
多分类:
case when level=1 then 'A'
when level=2 then 'B'
when level=3 then 'C'
else 'D'
end as "等级"
coalesce(value1, value2[, ...])--返回第一个非NULL的值 Coalesce(vehicle_finance_loan_order_is_micro_loan, 0)
1.6 正则
store_main_store_name not like '%测试%'
and store_main_store_name not like '%test%'
等价于
not regexp_like(store_main_store_name, '(测试|test)')
regexp_like(location_name, '(徐州|宿迁|盐城|苏州|南京|淮安|常州|无锡|南通|扬州|泰州|镇江|连云港)')
regexp_like(字段名, '(str1|str2|...)')--全模糊匹配 regexp_like(字段名, '^(str1|str2|...)')--右模糊匹配 regexp_like(字段名, '(str1|str2|...)$')--左模糊匹配 regexp_extract_all(string, pattern) → array
regexp_extract_all('1a 2b 14m', '\d+'); -- [1, 2, 14] regexp_extract(string, pattern) → varchar
regexp_extract('1a 2b 14m', '\d+'); -- 11.7 转换
cast(value AS type) → type
cast('2017-01-01' as timestamp)--2017-01-01 00:00:00.0 cast(from_unixtime(appoint_actual_time) as timestamp)--2018-04-13 15:55:55.0 cast(1.22222 as decimal(10,2))--1.22 cast(substr(audit_at,1,10) as date)--2018-04-13 cast(business_key_ as int)--32971656 cast(start_time as varchar)
try_cast(value AS type) → type
将value转换为type指定的类型,如果转换错误,则返回NULL。
typeof(expr) → varchar
返回表达式expr数据类型的名称。
typeof(123); -- integer1.8 聚合统计
group by user_id having
count()→ bigint
sum()
max()
min()
mean()
distinct id
variance
stddev(x) → double
variance(x) → double
1.9 排序
order by score asc/desc
dense_rank() → bigint
percent_rank() → bigint
rank() → bigint
row_number
row_number() over(partition by user_id order by score asc)
sum(totalprice) over (partition by clerk order by orderdate) as rolling_sum
6,创建时间的下一个整点
date_trunc('hour', date_add('hour', 1, cast(created_at as timestamp)))--created_at为string
7,记录个问题,from_unixtime(create_time/1000)--单位为秒,如何取到毫秒?
create_time from_unixtime(create_time/1000)
1602381660211 2020-10-11 10:01:00.0
1602381660381 2020-10-11 10:01:00.0
找到了解决办法!!!format_datetime
select create_time as bigint
,from_unixtime(create_time/1000) as "秒"--ms与s进制是1000,format_datetime(from_unixtime(action_time/1000),'yyyy-MM-dd HH:mm:ss.mmm') as "毫秒"bigint秒毫秒
16032362589922020-10-21 07:24:18.02020-10-21 07:24:18.024
create_time 秒 毫秒
1602381660211 2020-10-11 10:01:00.0 2020-10-11 10:01:00.001
1602381660381 2020-10-11 10:01:00.0 2020-10-11 10:01:00.001
好像没起作用。。。
8,presto sql 如何将sql脚本作为where查询条件
select *
from table
where query = 'select aa from tt where name ='北京' limit 1'
答:当作普通字符串处理就好,单引号记得用 ''转义一下,举例如下
select *
from table
where query = 'select aa from tt where name = ''北京'' limit 1'
9,分组排序,选TOP
需求:如何在系统日志中找到每个员工访问量前三的IP,限定每个IP下访问量1K及以上
select user_id
,ip
,num
from
(
select user_id
,ip
,count(*) as num
,row_number () over(partition by user_id order by count(*) desc) rank
from table table
where dt = '2020-11-30'
group by user_id,ip
having count(*) >= 1000
) tt
where rank <=3
order by user_id,num desc
此时rank 是对每个员工下不同IP分组计数降序排列,再筛选前三即可
这里注意,窗口函数分组不可以写为
row_number ()over(partitionby user_id,iporderbycount(*)desc) rank
因为它的结果是对每个员工每个IP统计排序,rank 只有一种取值,等于1。
10,between..and ..注解
看网上提不同数据库对between..and ..的操作方式存在不同,便自己去试了下
select distinct dt from table
where dt between '2020-11-01' and '2020-11-03'
结果如下:
dt
2020-11-02
2020-11-01
2020-11-03
说明presto sql对于between.and ..是按照闭区间方式处理的([2020-11-01,2020-11-03])
mysql也是包含边界的
另between还有取反的操作
where dt not between 1 and 3
等价于dt<1或dt>3