Answer a question

In Postgres 9.2, I have a table containing measures taken at specific points in time:

CREATE TABLE measures (dt timestamptz, measure integer);

INSERT INTO measures VALUES
('2015-01-13 12:05', 10), 
('2015-01-13 12:30', 8), 
('2015-01-13 13:02', 16), 
('2015-01-13 13:30', 12), 
('2015-01-13 14:15', 7);

I want to calculate the average and count of lines for 1-hour periods, which I can do this way:

SELECT date_trunc('hour', dt) as d, max(measure), count(*)
FROM measures group by d order by d;

But instead of 1-hour periods starting at 12:00, 13:00 etc. I want 1-hour periods following an event. That's one period from 12:05 to 13:05 and the next from 13:30 to 14:30 in this case.

Is this possible in PostgreSQL?

Answers

Recursive CTE

Plain SQL with a recursive CTE works:

WITH RECURSIVE cte AS (
   SELECT t.dt, m.measure
   FROM  (SELECT dt FROM measures ORDER BY 1 LIMIT 1) t -- no lower bound
   JOIN   measures m ON m.dt < t.dt + interval '1h'  -- excl. upper bound

   UNION ALL
   SELECT t.dt, m.measure
   FROM  (
      SELECT m.dt
      FROM  (SELECT dt FROM cte LIMIT 1) c
      JOIN   measures m ON m.dt >= c.dt + interval '1h'
      ORDER  BY 1
      LIMIT  1
      ) t
   JOIN   measures m ON m.dt >= t.dt                 -- incl. lower bound
                    AND m.dt <  t.dt + interval '1h' -- excl. upper bound
   )
SELECT dt AS hour_start
     , round(avg(measure), 2) AS avg_measure, count(*) AS ct
FROM   cte
GROUP  BY 1
ORDER  BY 1;

Returns:

hour_start          | avg_measure | ct
--------------------+-------------+----
2015-01-13 13:05:00 | 11.33       | 3
2015-01-13 14:30:00 | 9.50        | 2

db<>fiddle here (with added test on big table with index and selected time frame)
Old sqlfiddle

It performs decently with an index on dt - or better a multicolumn index to allow index-only scans in Postgres 9.2+:

CREATE INDEX measures_foo_idx ON measures (dt, measure);

That's standard SQL including the recursive CTE except for LIMIT. Postgres supports the standard keywords FETCH FIRST as well, if you need it all standard SQL.

Window function?

Not possible with a single window function

While the result of a window function is an aggregate over the window frame, the frame definition itself cannot reference other rows. In your case, granularity is dynamically determined by considering all rows from first to last. This is not possible with a single window function.

But!

We can still get rolling hourly averages for every row using a window frame with the RANGE clause bounded by an interval - requires Postgres 11 or later.

SELECT *, avg(measure) OVER (ORDER BY dt
                             RANGE BETWEEN CURRENT ROW AND '1 hour' FOLLOWING)
FROM   measures;

That generates aggregates for every row cheaply. Then we need to filter every start of a new period dynamically. We can use the row count and skip ahead by the number of rows in each hourly period - a PL/pgSQL cursor lends itself to the task naturally:

CREATE OR REPLACE FUNCTION f_dynamic_hourly_avg()
  RETURNS TABLE(hour_start timestamp, avg_measure numeric, ct int)
  LANGUAGE plpgsql AS
$func$
DECLARE
    _cursor CURSOR FOR
      SELECT dt, round(avg(measure) OVER w, 2), count(*) OVER w 
      FROM   measures
      WINDOW w AS (ORDER BY dt RANGE BETWEEN CURRENT ROW AND '1 hour' FOLLOWING);
BEGIN
    OPEN _cursor;
    FETCH _cursor INTO hour_start, avg_measure, ct;
    WHILE FOUND
    LOOP
      RETURN NEXT;
      FETCH RELATIVE ct FROM _cursor INTO hour_start, avg_measure, ct;
    END LOOP;
END
$func$;

Call:

SELECT * FROM f_dynamic_hourly_avg();

This turns out to be very efficient with only few rows per period. It will fall of with too many rows per period. Hard to pin down a number. Turns out to be 1000x faster in a quick benchmark with < 10 rows per period each.

db<>fiddle here

We could even work with a dynamic cursor and pass table and columns names to make it work for any table ...

Optimize performance

You basically need to iterate over all rows, which can be faster with a procedural solution: a FOR loop in a plpgsql function. Which will be faster?

  • The recursive query for few hours with many rows each.
  • The function for many hours with few rows each.
  • Update: the added function with a cursor over a query with window functions beats the rest by a long shot (while there are not too many rows per period?)

Related PL/pgSQL solutions:

  • Group by repeating attribute
  • GROUP BY and aggregate sequential numeric values
  • SQL Query where I get most recent rows from timestamp from another table
Logo

PostgreSQL社区为您提供最前沿的新闻资讯和知识内容

更多推荐