Another reason to use generate_series for time series queries

2015-03-16

Whenever I have to write an aggregate SQL query over time-series data, I start with generate_series. Mostly that’s to avoid missing rows in the output. For instance, if I’m writing a query to show total sales in every hour, I don’t want to write this:

SELECT  extract(hour FROM order_placed_at) h,
        SUM(total_price)
FROM    sales
WHERE   order_placed_at BETWEEN ? AND ?
GROUP BY h
ORDER BY h

because then if there was an hour with no sales, that row will be simply missing from the result.

Instead I use generate_series to generate all the hours, and join my real table to that:

SELECT  h,
        COALESCE(SUM(total_price), 0) AS revenue_per_hour
FROM    generate_series(0, 23) s(h)
LEFT OUTER JOIN sales
ON      extract(hour FROM order_placed_at) = h
WHERE   order_placed_at BETWEEN ? AND ?
GROUP BY h
ORDER BY h

That way I get 24 rows every time.

But another interesting use case for generate_series appeared on the Postgres mailing list: Suppose you have a table of events with start_time and end_time, and you want to report how many events were “ongoing” for each hour of the day. That means a row of data can be counted in multiple output rows. Without generate_series this is hard to write, but if you use that as your “core” table then join to your data table, it just works:

SELECT  h,
        COUNT(events.id) events_per_hour,
        array_remove(array_agg(events.id), NULL) active_events
FROM    generate_series(0, 23) s(h)
LEFT OUTER JOIN events
ON      h BETWEEN extract(hour FROM events.start_time) AND extract(hour FROM events.end_time)
GROUP BY h
ORDER BY h

That gives you one output row per hour, and events can be counted multiple times, within each hour that includes them. I’ve thrown in an array_agg so you can see the IDs of each event, just to double-check my work.

But there are still problems. What if our query is supposed to include multiple days? How will it handle events that cross midnight? What about events that start on Monday and run until Wednesday? For these kind of things, we should avoid extract in our BETWEEN, and instead of generating a series of integers, we should generate a series of timestamps. This is also a great chance to use Postgres’s new(ish) range datatype, to avoid our own bug-prone interval comparisons:

SELECT  extract(hour FROM s.h) AS hour,
        COUNT(DISTINCT events.id) events_per_hour,
        array_remove(array_agg(events.id), NULL) active_events
FROM    generate_series(
          date_trunc('hour', '2015-03-01 00:00:00'::timestamp),
          date_trunc('hour', '2015-03-31 23:00:00'::timestamp),
          '1 hour') s(h)
LEFT OUTER JOIN events
ON      tsrange(events.start_time, events.end_time) && tsrange(s.h, s.h + interval '1 hour')
GROUP BY hour
ORDER BY hour

Here we are using the && (overlaps) operator to check whether each event overlaps with each window. If we didn’t want to use ranges, we could also use the SQL OVERLAPS operator, which accomplishes the same thing.

Now suppose we also want to report on the total time each event was “active” during each window. That is just a SUM() we can add to our query, using the * (intersection) range operator. But be careful about NULL values coming from our LEFT JOIN! In ranges, a null at either end represents unbounded, so tsrange(NULL, NULL) is not NULL, but all time from negative infinity to positive infinity! Here is a query that works around that:

SELECT  extract(hour FROM s.h) AS hour,
        COUNT(DISTINCT events.id) events_per_hour,
        array_remove(array_agg(events.id), NULL) active_events,
        SUM(CASE WHEN events.id IS NULL
            THEN NULL
            ELSE date_trunc('minute', range_to_interval(
                tsrange(s.h, s.h + interval '1 hour') *
                tsrange(events.start_time, events.end_time)
              ))
            END) active_hours
FROM    generate_series(
          date_trunc('hour', '2015-03-01 00:00:00'::timestamp),
          date_trunc('hour', '2015-03-31 23:00:00'::timestamp),
          '1 hour') s(h)
LEFT OUTER JOIN events
ON      tsrange(events.start_time, events.end_time) && tsrange(s.h, s.h + interval '1 hour')
GROUP BY hour
ORDER BY hour
;

Oh one more thing… . There is no such thing as range_to_interval. A surprising omission! But here is a function that defines it:

CREATE OR REPLACE FUNCTION range_to_interval(tsrange) 
RETURNS interval
AS
$$
SELECT upper($1) - lower($1)
$$
LANGUAGE sql
STABLE;

It might be even more proper to create a user-defined cast here, but I’ll leave that as an excerise for the reader. :-)

Lateral Join Alternatives

2015-02-23

In the comments to my post about Postgres lateral joins, someone asked about getting the same results with better performance using a GROUP BY query. Actually you can’t get the same results with a straight GROUP BY clause, but there are three more ways to write the query that are worth considering. So I’d like to talk first about why (top-level) GROUP BY doesn’t work, and then look at the other ways to do it (including a subquery GROUP BY).

Recall that the problem was: for each restaurant, show its info, plus the score of its most recent inspection (if any), plus the number of violations from that inspection. When you hear “most recent” it’s natural to think about GROUP BY. Maybe you want to write something like this:

SELECT  r.id, r.name, MAX(i.inspected_at)
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
GROUP BY r.id
;

The problem is that whenever you want to pick a row using one column but then select additional columns from that same row, GROUP BY lets you down. In the above query, you can’t add i.score to the SELECT list, because that’s an error. You have to do something fancier.

One way to get there is a lateral join, as we’ve seen already. Three other ways are:

  1. Use GROUP BY in a subquery to get the latest inspection date, then use that date to get the rest of the inspection info.

  2. Use a window function.

  3. Use DISTINCT ON.

A couple things to keep in mind:

  • We want to show all the restaurants, even if they have no inspections yet. That’s why I didn’t write the above query FROM inspections GROUP BY restaurant_id, and that’s why all our queries below are FROM restaurants with an outer join to something else.

  • We also want to include a count of the violations from whatever inspection we choose. I’ve left that out of all my queries because it mostly just adds noise, but remember that whatever solution we consider must be compatible with that requirement. It must be possible to add one more outer join to violations and get the count. One approach this rules out is putting a whole subquery into the SELECT list, because that won’t be available for joining to violations.

Using GROUP BY in a subquery

For the first alternative, we might write the query like this:

SELECT  r.id, r.name, i.score, i.inspected_at
FROM    restaurants r
LEFT OUTER JOIN (
        SELECT  restaurant_id, MAX(inspected_at) AS max_inspection_date
        FROM    inspections
        GROUP BY restaurant_id) i2
ON      i2.restaurant_id = r.id
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
AND     i.inspected_at = i2.max_inspection_date
;

Here we’re using GROUP BY to get the most recent inspection of each restaurant, and then we add a separate join based on that inspection date to get the rest of the inspection info.

I think this is borderline incorrect. There’s no guarantee that inspected_at is unique. If we have millisecond precision it might be, but what if it’s just a date? Probably no restaurant gets inspected twice in the same day, but it’d be nice if we didn’t have to rely on that assumption. Also pedagogically it’s a bad approach because who knows if your application will have the same guarantees? I’d rather find a pattern that is valid always, not just for this restaurants example.

Also it’s hard to believe this approach is very fast. Indeed, here is the query plan:

Hash Right Join  (cost=1162.54..2645.04 rows=1000 width=20)
  Hash Cond: ((i.restaurant_id = r.id) AND (i.inspected_at = (max(inspections.inspected_at))))
  ->  Seq Scan on inspections i  (cost=0.00..1256.00 rows=30000 width=16)
  ->  Hash  (cost=1147.54..1147.54 rows=1000 width=16)
        ->  Hash Right Join  (cost=45.79..1147.54 rows=1000 width=16)
              Hash Cond: (inspections.restaurant_id = r.id)
              ->  GroupAggregate  (cost=0.29..1078.29 rows=1000 width=12)
                    ->  Index Only Scan using idx_inspections_rest_and_date on inspections  (cost=0.29..918.29 rows=30000 width=12)
              ->  Hash  (cost=33.00..33.00 rows=1000 width=8)
                    ->  Seq Scan on restaurants r  (cost=0.00..33.00 rows=1000 width=8)

On my machine I get query times around 20ms, so it’s more than twice as slow as the lateral join. We already have an index on inspections for restaurant_id + inspected_at, so I don’t think there’s an easy way to speed things up here.

Using a window function

One great feature in Postgres is window functions. This lets you compute aggregates (and more!) without turning your whole query into a GROUP BY. I’ve written about window functions a bit here, but it’s a big topic, so if they’re new to you, the Postgres docs are the place to start. With window functions, we can get what we want like this:

SELECT  r.id, r.name, i.score, i.inspected_at
FROM    restaurants r
LEFT OUTER JOIN (
        SELECT  *,
                row_number() OVER (PARTITION BY restaurant_id ORDER BY inspected_at DESC) AS row_number
        FROM    inspections) i
ON      i.restaurant_id = r.id
AND     i.row_number = 1
;

This approach is more correct than the previous one. There are no risky assumptions about unique inspection dates. But it performs a little worse. Here is the query plan:

Merge Right Join  (cost=0.56..4096.38 rows=1000 width=20)
  Merge Cond: (i.restaurant_id = r.id)
  ->  Subquery Scan on i  (cost=0.29..4030.72 rows=150 width=16)
        Filter: (i.row_number = 1)
        ->  WindowAgg  (cost=0.29..3655.72 rows=30000 width=20)
              ->  Index Scan using idx_inspections_rest_and_date on inspections  (cost=0.29..3130.72 rows=30000 width=20)
  ->  Index Scan using restaurants_pkey on restaurants r  (cost=0.28..61.27 rows=1000 width=8)

Results for me take about 25ms. I’ve gotta say I love window functions, but it looks like in this case at least lateral join is a better choice.

Using DISTINCT ON

There is a special feature in Postgres called DISTINCT ON, which is another way to get grouping-like behavior without a GROUP BY clause. If you’ve heard of DISTINCT but not DISTINCT ON, you should check it out. Basically you name one or more columns whose combination should be unique in the result, and then you give an ORDER BY that includes those columns plus (usually) something else, and for each distinct combo, Postgres will include only the first row. That lets us write our query like so:

SELECT  r.id, r.name, i.score, i.inspected_at
FROM    restaurants r
LEFT OUTER JOIN (
        SELECT  DISTINCT ON (restaurant_id)
                *
        FROM    inspections i2
        ORDER BY restaurant_id, inspected_at DESC) i
ON      i.restaurant_id = r.id
;

Here is the query plan I get with that query:

Merge Right Join  (cost=0.56..3292.00 rows=1000 width=20)
  Merge Cond: (i2.restaurant_id = r.id)
  ->  Unique  (cost=0.29..3205.72 rows=1000 width=20)
        ->  Index Scan using idx_inspections_rest_and_date on inspections i2  (cost=0.29..3130.72 rows=30000 width=20)
  ->  Index Scan using restaurants_pkey on restaurants r  (cost=0.28..61.27 rows=1000 width=8)

And results take about 13ms. That is pretty nice! Not quite as fast as the lateral join, but maybe close enough. It is arguably easier to read too. DISTINCT ON is really a nice feature that is insufficiently known. I believe it is also available in MS SQL Server, but not in Oracle or MySQL—which may or may not matter to you. Personally I’m fine committing to one database vendor and leveraging it to the hilt, especially when that vendor is free, open source, and as mature as Postgres.

So anyway, I think this is an interesting lineup of alternatives to a lateral join, at least for my specific example. It’s fascinating to me that lateral join wound up the fastest of all, as that wasn’t why I originally selected it. But if you’re worried about performance, I wouldn’t depend too much on this blog post. I’d test the alternatives against your own real situation, because you might get different results. To me the lesson that is worthwhile is that these are all options (except maybe GROUP BY).

What do you think? Could I have written any of these queries in a more performant way? Is there yet another approach that would work?

Btw, if you want a real hard-core example of a lateral join that is hard to write any other way, here is an example used in funnel analysis.

Too Many Outer Joins

2015-02-16

My last post about lateral joins reminded me of a rule I have when using LEFT OUTER JOIN. Recall that in an outer join, every row in the original table will yield one or more rows in the result, depending on how many rows match in the new table. So if you say this:

SELECT  *
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
;

then you’ll get one or more rows for each restaurant.

You could say this to show the average inspection score for each restaurant:

SELECT  r.name, AVG(i.score) avg_score
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
GROUP BY r.name
;

Now suppose you also had an employees table like this:

restaurants -< employees
-----------    ---------
id             id
name           restaurant_id
               name

You can count the employees in each restaurant like so:

SELECT  r.name, COUNT(e.id) employees
FROM    restaurants r
LEFT OUTER JOIN employees e
ON      e.restaurant_id = r.id
GROUP BY r.name
;

Easy! So can you combine those two queries? How about like this:

SELECT  r.name, AVG(i.score) avg_score, COUNT(e.id) AS employees
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
LEFT OUTER JOIN employees e
ON      e.restaurant_id = r.id
GROUP BY r.name
;

Unfortunately that is totally wrong! Whenever you have one central table, and you want to outer join it to two separate tables, you are in danger. To see why, think through what happens step-by-step as these tables get joined. All the joins happen before any grouping, so first we make a bunch of rows combining restaurants and inspections. Now for each of those rows, we join to the employees table. That means if we have a restaurant with two inspections, we’ll create rows for all its employees for both inspections. That means when we handle the grouping, our aggregate functions will compute wrong results. For the COUNT, we can solve the problem with COUNT(DISTINCT e.id), but it’s not so easy to solve for the AVG.

My rule is that a query can only have one outer join that produces more than one match. If two outer joins both produce multiple matches, then you’ll have trouble.

The only way I know to generate reports like this is to run a subquery for each outer-joined table, so that you can ensure one row per restaurant in the subquery results:

SELECT  r.name, x.avg_score, y.employees
FROM    restaurants r
LEFT OUTER JOIN (SELECT restaurant_id, AVG(score) avg_score
                 FROM   inspections
                 GROUP BY restaurant_id) x
ON      r.id = x.restaurant_id
LEFT OUTER JOIN (SELECT restaurant_id, COUNT(id) employees
                 FROM   employees
                 GROUP BY restaurant_id) y
ON      r.id = y.restaurant_id
;

That approach ensures that your aggregate functions will compute correct results, unpolluted by duplicate rows caused by outer joins to the sibling table(s).

Postgres LATERAL JOIN

2015-02-15

A question came up on the pdxruby mailing list that is a great example for Postgres’s new LATERAL join feature. Lateral joins can be incredibly useful when you need them, but it’s hard to grok their “shape” without a concrete example.

The original question is about an app that shows restaurant inspection results. It has three tables:

restaurants -< inspections -< violations
-----------    -----------    ----------
id             id             id
name           restaurant_id  inspection_id
               inspected_at   reason
               score

So each restaurant has a history of inspections, and each inspection can have zero or more violations.

First, you want to show a list of restaurants with the date and score of their most recent inspection. This is the kind of thing that is easy to express in English, and is a completely reasonable request for a report, but is hard to implement.

With no lateral joins, my thought process goes like this: Since we want one row per restaurant, I’m going to say FROM restaurants. Then I’ll join to inspections and include any inspections than which no inspection is later (i.e. the most recent one). I can achieve that with a correlated sub-query:

SELECT  r.name, i.inspected_at, i.score
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
AND     NOT EXISTS (SELECT  1
                    FROM    inspections i2
                    WHERE   i2.restaurant_id = i.restaurant_id
                    AND     i2.inspected_at > i.inspected_at)
;

But that is kind of tricky, and lateral joins make it so much easier. In a lateral join, we can join to a subquery that is allowed to “reach out” and use rows from other tables in the query:

SELECT  r.name, i.inspected_at, i.score
FROM    restaurants r
LEFT OUTER JOIN LATERAL (SELECT *
                         FROM   inspections i2
                         WHERE  i2.restaurant_id = r.id
                         ORDER BY i2.inspected_at DESC
                         LIMIT 1) i
ON true
;

This is a lot like a correlated sub-query, except the sub-query is in a FROM or JOIN rather than a condition. I think it expresses what we want much more directly: for each restaurant, order the inspections by most recent first, and take the first one.

The second part of this report is to show a count of the violations from whatever inspection we’re using. That’s pretty easy to add:

SELECT  r.name, i.inspected_at, i.score, COUNT(v.id) violations
FROM    restaurants r
LEFT OUTER JOIN LATERAL (SELECT *
                         FROM   inspections i2
                         WHERE  i2.restaurant_id = r.id
                         ORDER BY i2.inspected_at DESC
                         LIMIT 1) i
ON true
LEFT OUTER JOIN violations v
ON  v.inspection_id = i.id
GROUP BY r.name, i.inspected_at, i.score
;

UPDATE:

So radical_3do asked in the comments about the performance of LATERAL vs the correlated sub-query. The answer will always depend on your specific situation, but I decided to test my example for “reasonable” numbers. To keep it simple I’ll leave out the violations table. First I filled up my tables with 1000 restaurants and 30 inspections each:

DELETE FROM inspections;
DELETE FROM restaurants;

INSERT INTO restaurants
(id, name)
SELECT  s, 'Foo'
FROM    generate_series(1, 1000) s
;                   
INSERT INTO inspections     
(id, restaurant_id, score, inspected_at)
SELECT  30 * r + m + 1,
        r + 1,
        (30 * r + m + 1) % 10 + 1,
        '2015-01-01'::date - concat(m, ' MONTHS')::interval
FROM    generate_series(0, 999) r
CROSS JOIN generate_series(0, 29) m
;

CREATE INDEX idx_inspections_biz_and_date ON inspections (restaurant_id ASC, inspected_at DESC);
ANALYZE restaurants;
ANALYZE inspections;

Now when I EXPLAIN my queries, I get this plan for the correlated sub-query:

Sort  (cost=5246.56..5296.56 rows=20000 width=20)
  Sort Key: i.score
  ->  Hash Right Join  (cost=1485.79..3817.79 rows=20000 width=20)
        Hash Cond: (i.restaurant_id = r.id)
        ->  Hash Anti Join  (cost=1440.29..3497.29 rows=20000 width=16)
              Hash Cond: (i.restaurant_id = i2.restaurant_id)
              Join Filter: (i.inspected_at < i2.inspected_at)
              ->  Seq Scan on inspections i  (cost=0.00..1256.00 rows=30000 width=16)
              ->  Hash  (cost=918.29..918.29 rows=30000 width=12)
                    ->  Index Only Scan using idx_inspections_rest_and_date on inspections i2  (cost=0.29..918.29 rows=30000 width=12)
        ->  Hash  (cost=33.00..33.00 rows=1000 width=8)
              ->  Seq Scan on restaurants r  (cost=0.00..33.00 rows=1000 width=8)

I get this plan for the LATERAL join:

Sort  (cost=2366.16..2368.66 rows=1000 width=20)
  Sort Key: i2.score
  ->  Nested Loop Left Join  (cost=0.29..2316.33 rows=1000 width=20)
        ->  Seq Scan on restaurants r  (cost=0.00..33.00 rows=1000 width=8)
        ->  Limit  (cost=0.29..2.26 rows=1 width=20)
              ->  Index Scan using idx_inspections_rest_and_date on inspections i2  (cost=0.29..59.56 rows=30 width=20)
                    Index Cond: (restaurant_id = r.id)

So it looks like the LATERAL version will be faster. (Note that these plans both include an ORDER BY i.score DESC that wasn’t in the original queries above. It doesn’t look like that makes much difference though.) If I run both queries with \timing on, I consistently get about 80ms for the first one and 8 for the second one. That’s a 10x difference! But of course this could all change depending on your actual use case.

UPDATE 2: There are a few more ways to write this query without using lateral joins. Here is a performance comparison of lateral join alternatives.

Postgres \copy from stdin vs pstdin

2015-02-14

Sometimes in Postgres you want to initialize a table by reading a bunch of data from stdin, like this:

psql "$database" -f init.sql < /usr/share/dict/american-english

using an init.sql like this:

DROP TABLE IF EXISTS dictionary;
CREATE TABLE dictionary (id SERIAL PRIMARY KEY, word VARCHAR NOT NULL);
\copy dictionary (word) from stdin
CREATE UNIQUE INDEX idx_dictionary_word ON dictionary (word);

The problem is that with the -f flag, psql ignores its stdin and treats the file you name with -f as if it were stdin. So running the above would add just one word to your database, spelled “CREATE UNIQUE INDEX …”.

The solution to this is to use \copy dictionary (word) from pstdin instead. That tells psql to copy from its real stdin rather than from the -f file.

As far as I know there is no way to do this using COPY FROM rather than \copy from if you want COPY FROM to appear inside a larger .sql file. So this works:

psql "$database" -c "COPY dictionary (word) FROM STDIN" < /usr/share/dict/american-english

but this does not:

psql "$database" -f init.sql < /usr/share/dict/american-english

and neither does this:

psql "$database" -c '\i init.sql' < /usr/share/dict/american-english

But I’d love to be corrected if someone knows a way!

Rails acts_as_list with Soft Delete

2014-12-23

Two really useful gems for a Rails project are acts_as_list and permanent_records (or the somewhat more common acts_as_paranoid). The first one, acts_as_list, adds a position column to your model to provide natural ordering. Usually you don’t want global ordering, but only within an owning model, e.g. the Page objects inside a Book have position 1, 2, 3, and the Page objects inside a different Book also go 1, 2, 3. You can use the scope option to achieve that:

class Page
  belongs_to :book
  acts_as_list scope: :book
end

The problem is that if you are soft-deleting pages using a deleted_at column, acts_as_list doesn’t know to exclude deleted pages from the list. This gets annoying if you want to move pages up and down: Imagine you have page 1, then 5 deleted pages, and now you add a new page p to the end. You want to move your new page to the beginning. You have a button that calls p.move_higher, but the page won’t appear to move until the user presses the button five times. That is pretty confusing!

Fortunately acts_as_list lets you pass a string to scope that it will evaluate when necessary, and you can even embed Ruby code in it. Look at this, and notice the single quotes!:

class Page
  belongs_to :book
  acts_as_list scope: 'book_id = #{book_id} AND deleted_at IS NULL'
end

With that code, acts_as_list will only consider non-deleted pages, so moving a page works as it should.

Next: Postgres CTE for Threaded Comments