Sliding 1-hour periods aggregation query

Answer a question In Postgres 9.2, I have a table containing measures taken at specific points in time: CREATE TABLE measures (dt timestamptz, measure integer); INSERT INTO measures VALUES ('2015-01-1

Postgredaxiang

24人浏览 · 2022-09-30 10:59:28

Postgredaxiang · 2022-09-30 10:59:28 发布

Answer a question

In Postgres 9.2, I have a table containing measures taken at specific points in time:

CREATE TABLE measures (dt timestamptz, measure integer);

INSERT INTO measures VALUES
('2015-01-13 12:05', 10), 
('2015-01-13 12:30', 8), 
('2015-01-13 13:02', 16), 
('2015-01-13 13:30', 12), 
('2015-01-13 14:15', 7);

I want to calculate the average and count of lines for 1-hour periods, which I can do this way:

SELECT date_trunc('hour', dt) as d, max(measure), count(*)
FROM measures group by d order by d;

But instead of 1-hour periods starting at 12:00, 13:00 etc. I want 1-hour periods following an event. That's one period from 12:05 to 13:05 and the next from 13:30 to 14:30 in this case.

Is this possible in PostgreSQL?

Answers

Recursive CTE

Plain SQL with a recursive CTE works:

WITH RECURSIVE cte AS (
   SELECT t.dt, m.measure
   FROM  (SELECT dt FROM measures ORDER BY 1 LIMIT 1) t -- no lower bound
   JOIN   measures m ON m.dt < t.dt + interval '1h'  -- excl. upper bound

   UNION ALL
   SELECT t.dt, m.measure
   FROM  (
      SELECT m.dt
      FROM  (SELECT dt FROM cte LIMIT 1) c
      JOIN   measures m ON m.dt >= c.dt + interval '1h'
      ORDER  BY 1
      LIMIT  1
      ) t
   JOIN   measures m ON m.dt >= t.dt                 -- incl. lower bound
                    AND m.dt <  t.dt + interval '1h' -- excl. upper bound
   )
SELECT dt AS hour_start
     , round(avg(measure), 2) AS avg_measure, count(*) AS ct
FROM   cte
GROUP  BY 1
ORDER  BY 1;

Returns:

hour_start          | avg_measure | ct
--------------------+-------------+----
2015-01-13 13:05:00 | 11.33       | 3
2015-01-13 14:30:00 | 9.50        | 2

db<>fiddle here (with added test on big table with index and selected time frame)
Old sqlfiddle

It performs decently with an index on dt - or better a multicolumn index to allow index-only scans in Postgres 9.2+:

CREATE INDEX measures_foo_idx ON measures (dt, measure);

That's standard SQL including the recursive CTE except for LIMIT. Postgres supports the standard keywords FETCH FIRST as well, if you need it all standard SQL.

Window function?

Not possible with a single window function

While the result of a window function is an aggregate over the window frame, the frame definition itself cannot reference other rows. In your case, granularity is dynamically determined by considering all rows from first to last. This is not possible with a single window function.

But!

We can still get rolling hourly averages for every row using a window frame with the RANGE clause bounded by an interval - requires Postgres 11 or later.

SELECT *, avg(measure) OVER (ORDER BY dt
                             RANGE BETWEEN CURRENT ROW AND '1 hour' FOLLOWING)
FROM   measures;

That generates aggregates for every row cheaply. Then we need to filter every start of a new period dynamically. We can use the row count and skip ahead by the number of rows in each hourly period - a PL/pgSQL cursor lends itself to the task naturally:

CREATE OR REPLACE FUNCTION f_dynamic_hourly_avg()
  RETURNS TABLE(hour_start timestamp, avg_measure numeric, ct int)
  LANGUAGE plpgsql AS
$func$
DECLARE
    _cursor CURSOR FOR
      SELECT dt, round(avg(measure) OVER w, 2), count(*) OVER w 
      FROM   measures
      WINDOW w AS (ORDER BY dt RANGE BETWEEN CURRENT ROW AND '1 hour' FOLLOWING);
BEGIN
    OPEN _cursor;
    FETCH _cursor INTO hour_start, avg_measure, ct;
    WHILE FOUND
    LOOP
      RETURN NEXT;
      FETCH RELATIVE ct FROM _cursor INTO hour_start, avg_measure, ct;
    END LOOP;
END
$func$;

Call:

SELECT * FROM f_dynamic_hourly_avg();

This turns out to be very efficient with only few rows per period. It will fall of with too many rows per period. Hard to pin down a number. Turns out to be 1000x faster in a quick benchmark with < 10 rows per period each.

db<>fiddle here

We could even work with a dynamic cursor and pass table and columns names to make it work for any table ...

Optimize performance

You basically need to iterate over all rows, which can be faster with a procedural solution: a FOR loop in a plpgsql function. Which will be faster?

The recursive query for few hours with many rows each.
The function for many hours with few rows each.
Update: the added function with a cursor over a query with window functions beats the rest by a long shot (while there are not too many rows per period?)

Related PL/pgSQL solutions:

Group by repeating attribute
GROUP BY and aggregate sequential numeric values
SQL Query where I get most recent rows from timestamp from another table

PostgreSQL

PostgreSQL社区为您提供最前沿的新闻资讯和知识内容

更多推荐

PostgreSQL 计数查询效率,物化视图 [重复]

问题:PostgreSQL 计数查询效率,物化视图 [重复] 可能重复: PostgreSQL 计数查询优化使用 PostgreSQL 9.2,我们试图弄清楚是否有一种方法可以跟踪查询的结果数量,并以有效的方式返回该数字。这个查询应该每秒执行几次(可能几十到几百甚至几千次)。我们现在的查询看起来像这样,但我们想知道这是否效率低下: -- Get # of rows that do not hav

PostgreSQL

多对多中的唯一性

问题:多对多中的唯一性我无法弄清楚谷歌的哪些术语,所以帮助标记这个问题或只是以相关问题的方式向我指出会有所帮助。我相信我有一个典型的多对多关系: CREATE TABLE groups ( id integer PRIMARY KEY); CREATE TABLE elements ( id integer PRIMARY KEY); CREATE TABLE groups_elements

PostgreSQL

Django 与 postgresql - manage.py syncdb 返回错误

问题:Django 与 postgresql - manage.py syncdb 返回错误我从 Django 开始。我设置了一些使用 SQLite 工作的站点,但是在将 DB 引擎更改为 postgresql manage.py syncdb 后返回错误。我已经用谷歌搜索了 2 天,但对我仍然没有任何作用。Postgres 用户 'joe' 具有超级用户权限和本地 'joe ' 数据库存在。