SQL案例分析:移动平均值与累计求和

许多常见的聚合函数也可以作为窗口函数使用,包括AVG()、SUM()、COUNT()、MAX()以及MIN()函数等。

关于聚合函数的语法可以参考这篇文章。

示例表sales_monthly中存储了不同产品(苹果、香蕉以及桔子)每个月份的销量情况,以下是该表的创建脚本和数据:

复制
-- 创建销量表sales_monthly -- product表示产品名称,ym表示年月,amount表示销售金额(元) CREATE TABLE sales_monthly(product VARCHAR(20), ym VARCHAR(10), amount NUMERIC(10, 2)); -- 生成测试数据 INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201801,10159.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201802,10211.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201803,10247.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201804,10376.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201805,10400.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201806,10565.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201807,10613.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201808,10696.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201809,10751.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201810,10842.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201811,10900.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201812,10972.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201901,11155.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201902,11202.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201903,11260.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201904,11341.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201905,11459.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (苹果,201906,11560.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201801,10138.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201802,10194.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201803,10328.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201804,10322.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201805,10481.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201806,10502.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201807,10589.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201808,10681.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201809,10798.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201810,10829.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201811,10913.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201812,11056.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201901,11161.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201902,11173.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201903,11288.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201904,11408.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201905,11469.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (香蕉,201906,11528.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201801,10154.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201802,10183.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201803,10245.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201804,10325.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201805,10465.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201806,10505.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201807,10578.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201808,10680.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201809,10788.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201810,10838.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201811,10942.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201812,10988.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201901,11099.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201902,11181.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201903,11302.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201904,11327.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201905,11423.00); INSERT INTO sales_monthly (product,ym,amount) VALUES (桔子,201906,11524.00);1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48.49.50.51.52.53.54.55.56.57.58.59.
移动平均值

AVG()函数作为窗口函数使用时,可以用于计算随着当前行移动的窗口内数据行的平均值。例如,以下语句用于查找不同产品截止到每个月份为止、最近3个月的平均销量:

复制
SELECT product AS "产品", ym "年月", amount "销量", AVG(amount) OVER ( PARTITION BY product ORDER BY ym ROWS BETWEEN 2PRECEDING AND CURRENT ROW ) AS "最近平均销量" FROM sales_monthly ORDER BY product, ym;1.2.3.4.5.6.7.8.

AVG()函数OVER子句中的PARTITION BY选项表示按照产品进行分区,ORDERBY选项表示按照月份进行排序,ROWS BETWEEN 2 PRECEDING AND CURRENT ROW表示窗口从当前行的前2行开始直到当前行结束。

该查询返回的结果如下:

复制
产品|年月 |销量 |最近平均销量 ---|------|--------|------------ 桔子|201801|10154.00|10154.000000 桔子|201802|10183.00|10168.500000 桔子|201803|10245.00|10194.000000 桔子|201804|10325.00|10251.000000 桔子|201805|10465.00|10345.000000 桔子|201806|10505.00|10431.666667 ...1.2.3.4.5.6.7.8.9.10.

对于“桔子”,第一个月份的分析窗口只有1行数据,因此平均销量为10154。第二个月份的分析窗口为第1行和第2行数据,因此平均销量为10168.5((10154+10183)/2)。第三个月份的分析窗口为第1行到第3行数据,因此平均销量为10194((10154+10183+10245)/3)。

依此类推,直到计算完“桔子”所有月份的平均销量,然后开始计算其他产品的平均销量。

累计求和

SUM()函数作为窗口函数时,可以用于统计指定窗口内的累计值。例如,以下语句用于查找不同产品截止到当前月份为止的累计销量:

复制
SELECT product AS "产品", ym "年月", amount "销量", SUM(amount) OVER ( PARTITION BY product ORDER BY ym ROWS BETWEENUNBOUNDED PRECEDING AND CURRENT ROW ) AS "累计销量" FROM sales_monthly ORDER BY product, ym;1.2.3.4.5.6.7.8.9.

SUM()函数OVER子句中的PARTITION BY选项表示按照产品进行分区,ORDERBY选项表示按照月份进行排序,ROWS BETWEENUNBOUNDED PRECEDING AND CURRENT ROW表示窗口从当前分区第1行开始直到当前行结束。

该查询返回的结果如下:

复制
产品|年月 |销量 |累计销量 ---|------|--------|--------- 桔子|201801|10154.00| 10154.00 桔子|201802|10183.00| 20337.00 桔子|201803|10245.00| 30582.00 桔子|201804|10325.00| 40907.00 桔子|201805|10465.00| 51372.00 桔子|201806|10505.00| 61877.00 ...1.2.3.4.5.6.7.8.9.10.

对于“桔子”,第一个月份的分析窗口只有1行数据,因此累计销量为10154。第二个月份的分析窗口为第1行和第2行数据,因此累计销量为20337(10154+10183)。第三个月份的分析窗口为第1行到第3行数据,因此累计销量为30582(10154+10183+10245)。

依此类推,直到计算完“桔子”所有月份的累计销量,然后开始计算其他产品的累计销量。

提示:对于聚合窗口函数,如果我们没有指定ORDER BY选项,默认的窗口大小就是整个分区。如果我们指定了ORDERBY选项,默认的窗口大小就是分区的第一行直到当前行。因此,以上示例语句的中ROWSBETWEEN UNBOUNDED PRECEDING AND CURRENT ROW选项可以省略。

除了使用ROWS关键字以数据行为单位指定窗口的偏移量之外,我们也可以使用RANGE关键字以数值为单位指定窗口的偏移量。

示例表transfer_log中记录了一些银行账号的交易日志,以下是该表创建脚本:

复制
-- 创建银行交易日志表transfer_log -- Oracle、MySQL、PostgreSQL以及SQLite CREATE TABLE transfer_log ( log_id INTEGER NOT NULL PRIMARY KEY, -- 交易日志编号 log_ts TIMESTAMP NOT NULL, -- 交易时间 from_user VARCHAR(50) NOT NULL, -- 交易发起账号 to_user VARCHAR(50), -- 交易接收账号 type VARCHAR(10) NOT NULL, -- 交易类型 amount NUMERIC(10) NOT NULL -- 交易金额(元) ); -- SQL Server CREATE TABLE transfer_log ( log_id INTEGER NOT NULL PRIMARY KEY, -- 交易日志编号 log_ts DATETIME2 NOT NULL, -- 交易时间 from_user VARCHAR(50) NOT NULL, -- 交易发起账号 to_user VARCHAR(50), -- 交易接收账号 type VARCHAR(10) NOT NULL, -- 交易类型 amount NUMERIC(10) NOT NULL -- 交易金额(元) ); -- 生成测试数据 -- Oracle 需要执行以下ALTER语句 -- ALTER SESSION SET nls_timestamp_format = YYYY-MM-DD HH24:MI:SS; INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (1,2021-01-02 10:31:40,62221234567890,NULL,存款,50000); INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (2,2021-01-02 10:32:15,62221234567890,NULL,存款,100000); INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (3,2021-01-03 08:14:29,62221234567890,62226666666666,转账,200000); INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (4,2021-01-05 13:55:38,62221234567890,62226666666666,转账,150000); INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (5,2021-01-07 20:00:31,62221234567890,62227777777777,转账,300000); INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (6,2021-01-09 17:28:07,62221234567890,62227777777777,转账,500000); INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (7,2021-01-10 07:46:02,62221234567890,62227777777777,转账,100000); INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (8,2021-01-11 09:36:53,62221234567890,NULL,存款,40000); INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (9,2021-01-12 07:10:01,62221234567890,62228888888881,转账,10000); INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (10,2021-01-12 07:11:12,62221234567890,62228888888882,转账,8000); INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (11,2021-01-12 07:12:36,62221234567890,62228888888883,转账,5000); INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (12,2021-01-12 07:13:55,62221234567890,62228888888884,转账,6000); INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (13,2021-01-12 07:14:24,62221234567890,62228888888885,转账,7000); INSERT INTO transfer_log (log_id,log_ts,from_user,to_user,type,amount) VALUES (14,2021-01-21 12:11:16,62221234567890,62228888888885,转账,70000);1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.

以下语句用于查找短期之内(5天)累计转账超过一百万元的账号:

复制
-- Oracle、MySQL以及PostgreSQL SELECT log_ts, from_user,total_amount FROM ( SELECT log_ts, from_user, SUM(amount) OVER ( PARTITION BY from_user ORDER BY log_ts RANGE INTERVAL 5 DAYPRECEDING ) AS total_amount FROM transfer_log WHERE TYPE = 转账 ) t WHERE total_amount >= 1000000;1.2.3.4.5.6.7.8.9.10.11.12.13.14.

其中,SUM()函数OVER子句中的RANGE选项指定了一个5天之内的时间窗口。该查询返回的结果如下。

复制
log_ts |from_user |total_amount -------------------|--------------|------------ 2021-01-10 07:46:02|62221234567890| 10500001.2.3.4.

账号“62221234567890”截止2021年01月10日07点46份02秒在最近5天之内累计转账105万。

SQLite不支持INTERVAL时间常量,我们可以将时间戳数据转换为整数后使用。例如:

复制
-- SQLite WITH tl(log_ts, unix, from_user,amount) AS ( SELECT log_ts, CAST(STRFTIME(%s,log_ts) AS INT), from_user, amount FROM transfer_log WHERE type = 转账 ) SELECT log_ts, from_user, total_amount FROM ( SELECT log_ts, from_user, SUM(amount) OVER ( PARTITION BY from_user ORDER BY unix RANGE 5 * 86400PRECEDING ) AS total_amount FROM tl ) t WHERE total_amount >= 1000000;1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.

我们首先定义了一个CTE,字段unix是将log_ts转换为1970年1月1日以来的整数秒。然后我们在SUM()函数中通过RANGE选项指定了一个5天(5*86400秒)之内的时间窗口。

Microsoft SQL Server中的RANGE窗口大小选项只能指定UNBOUNDED PRECEDING、UNBOUNDED FOLLOWING或者CURRENT ROW,不能指定一个具体的数值,因此无法实现以上查询。

阅读剩余
THE END