Sliding 1-hour periods aggregation query
Answer a question In Postgres 9.2, I have a table containing measures taken at specific points in time: CREATE TABLE measures (dt timestamptz, measure integer); INSERT INTO measures VALUES ('2015-01-1
Answer a question
In Postgres 9.2, I have a table containing measures taken at specific points in time:
CREATE TABLE measures (dt timestamptz, measure integer);
INSERT INTO measures VALUES
('2015-01-13 12:05', 10),
('2015-01-13 12:30', 8),
('2015-01-13 13:02', 16),
('2015-01-13 13:30', 12),
('2015-01-13 14:15', 7);
I want to calculate the average and count of lines for 1-hour periods, which I can do this way:
SELECT date_trunc('hour', dt) as d, max(measure), count(*)
FROM measures group by d order by d;
But instead of 1-hour periods starting at 12:00, 13:00 etc. I want 1-hour periods following an event. That's one period from 12:05 to 13:05 and the next from 13:30 to 14:30 in this case.
Is this possible in PostgreSQL?
Answers
Recursive CTE
Plain SQL with a recursive CTE works:
WITH RECURSIVE cte AS (
SELECT t.dt, m.measure
FROM (SELECT dt FROM measures ORDER BY 1 LIMIT 1) t -- no lower bound
JOIN measures m ON m.dt < t.dt + interval '1h' -- excl. upper bound
UNION ALL
SELECT t.dt, m.measure
FROM (
SELECT m.dt
FROM (SELECT dt FROM cte LIMIT 1) c
JOIN measures m ON m.dt >= c.dt + interval '1h'
ORDER BY 1
LIMIT 1
) t
JOIN measures m ON m.dt >= t.dt -- incl. lower bound
AND m.dt < t.dt + interval '1h' -- excl. upper bound
)
SELECT dt AS hour_start
, round(avg(measure), 2) AS avg_measure, count(*) AS ct
FROM cte
GROUP BY 1
ORDER BY 1;
Returns:
hour_start | avg_measure | ct
--------------------+-------------+----
2015-01-13 13:05:00 | 11.33 | 3
2015-01-13 14:30:00 | 9.50 | 2
db<>fiddle here (with added test on big table with index and selected time frame)
Old sqlfiddle
It performs decently with an index on dt
- or better a multicolumn index to allow index-only scans in Postgres 9.2+:
CREATE INDEX measures_foo_idx ON measures (dt, measure);
That's standard SQL including the recursive CTE except for LIMIT
. Postgres supports the standard keywords FETCH FIRST
as well, if you need it all standard SQL.
Window function?
Not possible with a single window function
While the result of a window function is an aggregate over the window frame, the frame definition itself cannot reference other rows. In your case, granularity is dynamically determined by considering all rows from first to last. This is not possible with a single window function.
But!
We can still get rolling hourly averages for every row using a window frame with the RANGE
clause bounded by an interval - requires Postgres 11 or later.
SELECT *, avg(measure) OVER (ORDER BY dt
RANGE BETWEEN CURRENT ROW AND '1 hour' FOLLOWING)
FROM measures;
That generates aggregates for every row cheaply. Then we need to filter every start of a new period dynamically. We can use the row count and skip ahead by the number of rows in each hourly period - a PL/pgSQL cursor lends itself to the task naturally:
CREATE OR REPLACE FUNCTION f_dynamic_hourly_avg()
RETURNS TABLE(hour_start timestamp, avg_measure numeric, ct int)
LANGUAGE plpgsql AS
$func$
DECLARE
_cursor CURSOR FOR
SELECT dt, round(avg(measure) OVER w, 2), count(*) OVER w
FROM measures
WINDOW w AS (ORDER BY dt RANGE BETWEEN CURRENT ROW AND '1 hour' FOLLOWING);
BEGIN
OPEN _cursor;
FETCH _cursor INTO hour_start, avg_measure, ct;
WHILE FOUND
LOOP
RETURN NEXT;
FETCH RELATIVE ct FROM _cursor INTO hour_start, avg_measure, ct;
END LOOP;
END
$func$;
Call:
SELECT * FROM f_dynamic_hourly_avg();
This turns out to be very efficient with only few rows per period. It will fall of with too many rows per period. Hard to pin down a number. Turns out to be 1000x faster in a quick benchmark with < 10 rows per period each.
db<>fiddle here
We could even work with a dynamic cursor and pass table and columns names to make it work for any table ...
Optimize performance
You basically need to iterate over all rows, which can be faster with a procedural solution: a FOR
loop in a plpgsql function. Which will be faster?
- The recursive query for few hours with many rows each.
- The function for many hours with few rows each.
- Update: the added function with a cursor over a query with window functions beats the rest by a long shot (while there are not too many rows per period?)
Related PL/pgSQL solutions:
- Group by repeating attribute
- GROUP BY and aggregate sequential numeric values
- SQL Query where I get most recent rows from timestamp from another table
更多推荐
所有评论(0)