Another reason to use generate_series for time series queries

2015-03-16

Whenever I have to write an aggregate SQL query over time-series data, I start with generate_series. Mostly that’s to avoid missing rows in the output. For instance, if I’m writing a query to show total sales in every hour, I don’t want to write this:

SELECT  extract(hour FROM order_placed_at) h,
        SUM(total_price)
FROM    sales
WHERE   order_placed_at BETWEEN ? AND ?
GROUP BY h
ORDER BY h

because then if there was an hour with no sales, that row will be simply missing from the result.

Instead I use generate_series to generate all the hours, and join my real table to that:

SELECT  h,
        COALESCE(SUM(total_price), 0) AS revenue_per_hour
FROM    generate_series(0, 23) s(h)
LEFT OUTER JOIN sales
ON      extract(hour FROM order_placed_at) = h
WHERE   order_placed_at BETWEEN ? AND ?
GROUP BY h
ORDER BY h

That way I get 24 rows every time.

But another interesting use case for generate_series appeared on the Postgres mailing list: Suppose you have a table of events with start_time and end_time, and you want to report how many events were “ongoing” for each hour of the day. That means a row of data can be counted in multiple output rows. Without generate_series this is hard to write, but if you use that as your “core” table then join to your data table, it just works:

SELECT  h,
        COUNT(events.id) events_per_hour,
        array_remove(array_agg(events.id), NULL) active_events
FROM    generate_series(0, 23) s(h)
LEFT OUTER JOIN events
ON      h BETWEEN extract(hour FROM events.start_time) AND extract(hour FROM events.end_time)
GROUP BY h
ORDER BY h

That gives you one output row per hour, and events can be counted multiple times, within each hour that includes them. I’ve thrown in an array_agg so you can see the IDs of each event, just to double-check my work.

But there are still problems. What if our query is supposed to include multiple days? How will it handle events that cross midnight? What about events that start on Monday and run until Wednesday? For these kind of things, we should avoid extract in our BETWEEN, and instead of generating a series of integers, we should generate a series of timestamps. This is also a great chance to use Postgres’s new(ish) range datatype, to avoid our own bug-prone interval comparisons:

SELECT  extract(hour FROM s.h) AS hour,
        COUNT(DISTINCT events.id) events_per_hour,
        array_remove(array_agg(events.id), NULL) active_events
FROM    generate_series(
          date_trunc('hour', '2015-03-01 00:00:00'::timestamp),
          date_trunc('hour', '2015-03-31 23:00:00'::timestamp),
          '1 hour') s(h)
LEFT OUTER JOIN events
ON      tsrange(events.start_time, events.end_time) && tsrange(s.h, s.h + interval '1 hour')
GROUP BY hour
ORDER BY hour

Here we are using the && (overlaps) operator to check whether each event overlaps with each window. If we didn’t want to use ranges, we could also use the SQL OVERLAPS operator, which accomplishes the same thing.

Now suppose we also want to report on the total time each event was “active” during each window. That is just a SUM() we can add to our query, using the * (intersection) range operator. But be careful about NULL values coming from our LEFT JOIN! In ranges, a null at either end represents unbounded, so tsrange(NULL, NULL) is not NULL, but all time from negative infinity to positive infinity! Here is a query that works around that:

SELECT  extract(hour FROM s.h) AS hour,
        COUNT(DISTINCT events.id) events_per_hour,
        array_remove(array_agg(events.id), NULL) active_events,
        SUM(CASE WHEN events.id IS NULL
            THEN NULL
            ELSE date_trunc('minute', range_to_interval(
                tsrange(s.h, s.h + interval '1 hour') *
                tsrange(events.start_time, events.end_time)
              ))
            END) active_hours
FROM    generate_series(
          date_trunc('hour', '2015-03-01 00:00:00'::timestamp),
          date_trunc('hour', '2015-03-31 23:00:00'::timestamp),
          '1 hour') s(h)
LEFT OUTER JOIN events
ON      tsrange(events.start_time, events.end_time) && tsrange(s.h, s.h + interval '1 hour')
GROUP BY hour
ORDER BY hour
;

Oh one more thing… . There is no such thing as range_to_interval. A surprising omission! But here is a function that defines it:

CREATE OR REPLACE FUNCTION range_to_interval(tsrange) 
RETURNS interval
AS
$$
SELECT upper($1) - lower($1)
$$
LANGUAGE sql
STABLE;

It might be even more proper to create a user-defined cast here, but I’ll leave that as an excerise for the reader. :-)

blog comments powered by Disqus Prev: Paperclip expiring_url and "Request has expired" Next: Lateral Join Alternatives