SQL是一种冗长的语言,其中最冗长的特性之一是窗口函数.
在.最近遇到的堆栈溢出问题,有人要求计算某一特定日期的时间序列中的第一个值和最后一个值之间的差额:
输入
volume tstamp
---------------------------
29011 2012-12-28 09:00:00
28701 2012-12-28 10:00:00
28830 2012-12-28 11:00:00
28353 2012-12-28 12:00:00
28642 2012-12-28 13:00:00
28583 2012-12-28 14:00:00
28800 2012-12-29 09:00:00
28751 2012-12-29 10:00:00
28670 2012-12-29 11:00:00
28621 2012-12-29 12:00:00
28599 2012-12-29 13:00:00
28278 2012-12-29 14:00:00
期望输出
first last difference date
------------------------------------
29011 28583 428 2012-12-28
28800 28278 522 2012-12-29
如何编写查询
请注意,值和时间戳级数可能不相关。所以,没有一条规定如果Timestamp2 > Timestamp1
然后Value2 < Value1
。否则,这个简单的查询就能工作(使用PostgreSQL语法):
SELECT
max(volume) AS first,
min(volume) AS last,
max(volume) - min(volume) AS difference,
CAST(tstamp AS DATE) AS date
FROM t
GROUP BY CAST(tstamp AS DATE);
有几种方法可以在不涉及窗口函数的组中找到第一个和最后一个值。例如:
- 在Oracle中,可以使用第一和最后函数,由于某些神秘原因,这些函数没有编写。
FIRST(...) WITHIN GROUP (ORDER BY ...)
或LAST(...) WITHIN GROUP (ORDER BY ...)
,与其他排序集聚合函数一样,但是some_aggregate_function(...) KEEP (DENSE_RANK FIRST ORDER BY ...)
。围棋数字 - 在PostgreSQL中,可以使用
DISTINCT ON
语法与ORDER BY
和LIMIT
有关各种方法的更多细节可以在这里找到:
https://blog.jooq.org/2017/09/22/how-to-write-efficient-top-n-queries-in-sql
最好的方法是使用像Oracle这样的聚合函数,但是很少有数据库具有这种功能。所以,我们将使用FIRST_VALUE
和LAST_VALUE
窗口函数:
SELECT DISTINCT
first_value(volume) OVER (
PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS first,
last_value(volume) OVER (
PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS last,
first_value(volume) OVER (
PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
- last_value(volume) OVER (
PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS diff,
CAST(tstamp AS DATE) AS date
FROM t
ORDER BY CAST(tstamp AS DATE)
哎呀。
看上去不太容易读。但它将产生正确的结果。当然,我们可以包装列的定义。FIRST
和LAST
在派生表中,但这仍然会给我们留下两次窗口定义的重复:
PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
援救窗口条款
幸运的是,至少有3个数据库实现了sql标准。WINDOW
条款:
- MySQL
- PostgreSQL
- Sybase SQL Anywhere
上面的查询可以重构为这个查询:
SELECT DISTINCT
first_value(volume) OVER w AS first,
last_value(volume) OVER w AS last,
first_value(volume) OVER w
- last_value(volume) OVER w AS diff,
CAST(tstamp AS DATE) AS date
FROM t
WINDOW w AS (
PARTITION BY CAST(tstamp AS DATE)
ORDER BY tstamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
ORDER BY CAST(tstamp AS DATE)
请注意,如何使用窗口规范来指定窗口名称,就像定义公共表达式一样(WITH
条款):
WINDOW
<window-name> AS (<window-specification>)
{ ,<window-name> AS (<window-specification>)... }
我不仅可以重用整个规范,还可以根据部分规范构建规范,并且只重用部分规范。我以前的查询可以这样重写:
SELECT DISTINCT
first_value(volume) OVER w3 AS first,
last_value(volume) OVER w3 AS last,
first_value(volume) OVER w3
- last_value(volume) OVER w3 AS diff,
CAST(tstamp AS DATE) AS date
FROM t
WINDOW
w1 AS (PARTITION BY CAST(tstamp AS DATE)),
w2 AS (w1 ORDER BY tstamp),
w3 AS (w2 ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
ORDER BY CAST(tstamp AS DATE)
每个窗口规范可以从头创建,也可以基于先前定义的窗口规范。注在引用窗口定义时也是如此。如果我想重用PARTITION BY
条款和ORDER BY
子句,但请更改FRAME
条款(ROWS ...
),那么我就可以这样写了:
SELECT DISTINCT
first_value(volume) OVER (
w2 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS first,
last_value(volume) OVER (
w2 ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
) AS last,
first_value(volume) OVER (
w2 ROWS UNBOUNDED PRECEDING
) - last_value(volume) OVER (
w2 ROWS BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING
) AS diff,
CAST(tstamp AS DATE) AS date
FROM t
WINDOW
w1 AS (PARTITION BY CAST(tstamp AS DATE)),
w2 AS (w1 ORDER BY tstamp)
ORDER BY CAST(tstamp AS DATE)