1,查询某一日期是否在当月的最后三天

cast(updated_at as timestamp)

between date_add('day',-3,date_trunc('month',date_add('month',1,cast(updated_at as timestamp))))

and date_add('day',-1,date_trunc('month',date_add('month',1,cast(updated_at as timestamp))))

updated_at是varchar类型,date_add与date_trunc函数都是需要date,time类型,不支持字符串格式。

date_trunc日期截取函数

2,同一业务ID下,最后一次流转后员工ID在历史记录中出现过,则该业务ID不出现表结构如上,业务ID=2即为所需,1需排除

select * from table where dt = '2019-09-30'

and rank = 1

and carid not in --反向选取 (select distinct t01.carid from--找到同一carid下,最后一次员工ID出现在其他记录的carid (select * from table where dt = '2019-09-30'

)t01

inner join

(select * from table where dt = '2019-09-30'

)t02

on t02.carid = t01.carid

and t01.rank < t02.rank

and t01.rank = 1

and (t01.after_id = t02.after_id or t01.after_id = t02.before_id))

经验:正向走不通时,反向处理可能是一种解法

3,python 向SQL中传递参数

conn = presto_client.connect(,,,)

supplier= ['江苏出版社','人民出版社']

sql = '''select distinct name as "名称",address as "地址"from tablewhere name in%(supplier)s'''

df = pd.read_sql_query(sql,conn,params = {'supplier':supplier[1]})

read_sql_query :在sql引用%(var_name)s,调用时使用params参数,格式为params = {'name':'value'}

4,presto SQL:多条件模糊匹配

多条件模糊匹配:regexp_like(字段名, '(str1|str2|...)')

示例如下:

where talbe.name not like '%测试%'

and talbe.name not like '%test%'

-------------可转换为------where not regexp_like(talbe.name, '(测试|test)')

右模糊匹配:regexp_like(col, '^(str1|str2|...)')

左模糊匹配:regexp_like(col, '(str1|str2|...)$')

5,SQL函数总结

1.1 字符

substr(staff_leave_date,1,10)

concat(string1, ..., stringN) → varchar--拼接多个字符串 length(string) → bigint

length(cast(start_time as varchar))

replace(string, search, replace) → varchar

trim(string) → varchar--删除左右两侧的空格 split_part(ip,':',1)--以冒号分割取第二部分 upper/lower(string) → varchar--大小写 ###mid(str,pos,len)--从指定位置截取指定长度的字符串 --错误 substr(str,start,length)--截取指定长度字符串1.2 日期

current_date -> date

now() → timestamp

from_unixtime(create_time)--int/bigint→ timestamp substr(staff_leave_date,1,10)

date_format(from_unixtime(status_update_time),'%Y-%m-%d')--调整时期格式'2017-11-25' date_format(cast(audit_time as timestamp),'%Y-%m-%d %H:%i:%s')--2018-04-13 22:09:16 date_add('day', -1, CURRENT_DATE)--当前日期-1 date_trunc('month',from_unixtime(create_time))--当前月份的第一天,2001-08-22 03:04:05.321→2001-08-01 00:00:00.000

date_diff(unit, timestamp1, timestamp2) → bigint

date_diff('day',cast(substr(audit_at,1,10) as date),cast(current_timestamp as date)) as "库存时间"

day_of_week(x) → bigint

Returns the ISO day of the week from x. The value ranges from 1 (Monday) to 7 (Sunday).

week_of_year(x) → bigint

This is an alias for week()

1.3 json

json_extract(json, json_path) → json

json_extract_scalar(json, json_path) → varchar

{"chang_status_remark":["ChangeApply.id = 56798765","Apply.id = 345876"],"source_status":[3,0],"status_update_time":[98765,234589]}

json_extract(change_info,'$.source_status') --[3,0] json_extract_scalar(change_info,'$.source_status[1]')--0 json_extract_scalar(json, '$.store.book[0].author')

1.4 关联

表间关联

t01

inner/left/right join

t02

on t01.id = t02.id

left/right 未关联到的字段置为null

多部分数据拼接 union:对结果去重且排序 union all:直接返回合并的结果,速度快

union中的每个查询必须包含相同的列,表达式及聚合函数;

只能使用一条order by字句,且必须放在最后一个查询后面

1.5 条件

两分类:

if(staff_status =1,'在职','离职') as "在职状态"

等价于

case when staff_status=1 then '在职'

else '离职'

end as "在职状态"

多分类:

case when level=1 then 'A'

when level=2 then 'B'

when level=3 then 'C'

else 'D'

end as "等级"

coalesce(value1, value2[, ...])--返回第一个非NULL的值 Coalesce(vehicle_finance_loan_order_is_micro_loan, 0)

1.6 正则

store_main_store_name not like '%测试%'

and store_main_store_name not like '%test%'

等价于

not regexp_like(store_main_store_name, '(测试|test)')

regexp_like(location_name, '(徐州|宿迁|盐城|苏州|南京|淮安|常州|无锡|南通|扬州|泰州|镇江|连云港)')

regexp_like(字段名, '(str1|str2|...)')--全模糊匹配 regexp_like(字段名, '^(str1|str2|...)')--右模糊匹配 regexp_like(字段名, '(str1|str2|...)$')--左模糊匹配 regexp_extract_all(string, pattern) → array

regexp_extract_all('1a 2b 14m', '\d+'); -- [1, 2, 14] regexp_extract(string, pattern) → varchar

regexp_extract('1a 2b 14m', '\d+'); -- 11.7 转换

cast(value AS type) → type

cast('2017-01-01' as timestamp)--2017-01-01 00:00:00.0 cast(from_unixtime(appoint_actual_time) as timestamp)--2018-04-13 15:55:55.0 cast(1.22222 as decimal(10,2))--1.22 cast(substr(audit_at,1,10) as date)--2018-04-13 cast(business_key_ as int)--32971656 cast(start_time as varchar)

try_cast(value AS type) → type

将value转换为type指定的类型,如果转换错误,则返回NULL。

typeof(expr) → varchar

返回表达式expr数据类型的名称。

typeof(123); -- integer1.8 聚合统计

group by user_id having

count()→ bigint

sum()

max()

min()

mean()

distinct id

variance

stddev(x) → double

variance(x) → double

1.9 排序

order by score asc/desc

dense_rank() → bigint

percent_rank() → bigint

rank() → bigint

row_number

row_number() over(partition by user_id order by score asc)

sum(totalprice) over (partition by clerk order by orderdate) as rolling_sum

6,创建时间的下一个整点

date_trunc('hour', date_add('hour', 1, cast(created_at as timestamp)))--created_at为string

7,记录个问题,from_unixtime(create_time/1000)--单位为秒,如何取到毫秒?

create_time from_unixtime(create_time/1000)

1602381660211 2020-10-11 10:01:00.0

1602381660381 2020-10-11 10:01:00.0

找到了解决办法!!!format_datetime

select create_time as bigint

,from_unixtime(create_time/1000) as "秒"--ms与s进制是1000,format_datetime(from_unixtime(action_time/1000),'yyyy-MM-dd HH:mm:ss.mmm') as "毫秒"bigint秒毫秒

16032362589922020-10-21 07:24:18.02020-10-21 07:24:18.024

create_time 秒 毫秒

1602381660211 2020-10-11 10:01:00.0 2020-10-11 10:01:00.001

1602381660381 2020-10-11 10:01:00.0 2020-10-11 10:01:00.001

好像没起作用。。。

8,presto sql 如何将sql脚本作为where查询条件

select *

from table

where query = 'select aa from tt where name ='北京' limit 1'

答:当作普通字符串处理就好,单引号记得用 ''转义一下,举例如下

select *

from table

where query = 'select aa from tt where name = ''北京'' limit 1'

9,分组排序,选TOP

需求:如何在系统日志中找到每个员工访问量前三的IP,限定每个IP下访问量1K及以上

select user_id

,ip

,num

from

(

select user_id

,ip

,count(*) as num

,row_number () over(partition by user_id order by count(*) desc) rank

from table table

where dt = '2020-11-30'

group by user_id,ip

having count(*) >= 1000

) tt

where rank <=3

order by user_id,num desc

此时rank 是对每个员工下不同IP分组计数降序排列,再筛选前三即可

这里注意,窗口函数分组不可以写为

row_number ()over(partitionby user_id,iporderbycount(*)desc) rank

因为它的结果是对每个员工每个IP统计排序,rank 只有一种取值,等于1。

10,between..and ..注解

看网上提不同数据库对between..and ..的操作方式存在不同,便自己去试了下

select distinct dt from table

where dt between '2020-11-01' and '2020-11-03'

结果如下:

dt

2020-11-02

2020-11-01

2020-11-03

说明presto sql对于between.and ..是按照闭区间方式处理的([2020-11-01,2020-11-03])

mysql也是包含边界的

另between还有取反的操作

where dt not between 1 and 3

等价于dt<1或dt>3