Lateral Join Alternatives

2015-02-23

In the comments to my post about Postgres lateral joins, someone asked about getting the same results with better performance using a GROUP BY query. Actually you can’t get the same results with a straight GROUP BY clause, but there are three more ways to write the query that are worth considering. So I’d like to talk first about why (top-level) GROUP BY doesn’t work, and then look at the other ways to do it (including a subquery GROUP BY).

Recall that the problem was: for each restaurant, show its info, plus the score of its most recent inspection (if any), plus the number of violations from that inspection. When you hear “most recent” it’s natural to think about GROUP BY. Maybe you want to write something like this:

SELECT  r.id, r.name, MAX(i.inspected_at)
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
GROUP BY r.id
;

The problem is that whenever you want to pick a row using one column but then select additional columns from that same row, GROUP BY lets you down. In the above query, you can’t add i.score to the SELECT list, because that’s an error. You have to do something fancier.

One way to get there is a lateral join, as we’ve seen already. Three other ways are:

  1. Use GROUP BY in a subquery to get the latest inspection date, then use that date to get the rest of the inspection info.

  2. Use a window function.

  3. Use DISTINCT ON.

A couple things to keep in mind:

  • We want to show all the restaurants, even if they have no inspections yet. That’s why I didn’t write the above query FROM inspections GROUP BY restaurant_id, and that’s why all our queries below are FROM restaurants with an outer join to something else.

  • We also want to include a count of the violations from whatever inspection we choose. I’ve left that out of all my queries because it mostly just adds noise, but remember that whatever solution we consider must be compatible with that requirement. It must be possible to add one more outer join to violations and get the count. One approach this rules out is putting a whole subquery into the SELECT list, because that won’t be available for joining to violations.

Using GROUP BY in a subquery

For the first alternative, we might write the query like this:

SELECT  r.id, r.name, i.score, i.inspected_at
FROM    restaurants r
LEFT OUTER JOIN (
        SELECT  restaurant_id, MAX(inspected_at) AS max_inspection_date
        FROM    inspections
        GROUP BY restaurant_id) i2
ON      i2.restaurant_id = r.id
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
AND     i.inspected_at = i2.max_inspection_date
;

Here we’re using GROUP BY to get the most recent inspection of each restaurant, and then we add a separate join based on that inspection date to get the rest of the inspection info.

I think this is borderline incorrect. There’s no guarantee that inspected_at is unique. If we have millisecond precision it might be, but what if it’s just a date? Probably no restaurant gets inspected twice in the same day, but it’d be nice if we didn’t have to rely on that assumption. Also pedagogically it’s a bad approach because who knows if your application will have the same guarantees? I’d rather find a pattern that is valid always, not just for this restaurants example.

Also it’s hard to believe this approach is very fast. Indeed, here is the query plan:

Hash Right Join  (cost=1162.54..2645.04 rows=1000 width=20)
  Hash Cond: ((i.restaurant_id = r.id) AND (i.inspected_at = (max(inspections.inspected_at))))
  ->  Seq Scan on inspections i  (cost=0.00..1256.00 rows=30000 width=16)
  ->  Hash  (cost=1147.54..1147.54 rows=1000 width=16)
        ->  Hash Right Join  (cost=45.79..1147.54 rows=1000 width=16)
              Hash Cond: (inspections.restaurant_id = r.id)
              ->  GroupAggregate  (cost=0.29..1078.29 rows=1000 width=12)
                    ->  Index Only Scan using idx_inspections_rest_and_date on inspections  (cost=0.29..918.29 rows=30000 width=12)
              ->  Hash  (cost=33.00..33.00 rows=1000 width=8)
                    ->  Seq Scan on restaurants r  (cost=0.00..33.00 rows=1000 width=8)

On my machine I get query times around 20ms, so it’s more than twice as slow as the lateral join. We already have an index on inspections for restaurant_id + inspected_at, so I don’t think there’s an easy way to speed things up here.

Using a window function

One great feature in Postgres is window functions. This lets you compute aggregates (and more!) without turning your whole query into a GROUP BY. I’ve written about window functions a bit here, but it’s a big topic, so if they’re new to you, the Postgres docs are the place to start. With window functions, we can get what we want like this:

SELECT  r.id, r.name, i.score, i.inspected_at
FROM    restaurants r
LEFT OUTER JOIN (
        SELECT  *,
                row_number() OVER (PARTITION BY restaurant_id ORDER BY inspected_at DESC) AS row_number
        FROM    inspections) i
ON      i.restaurant_id = r.id
AND     i.row_number = 1
;

This approach is more correct than the previous one. There are no risky assumptions about unique inspection dates. But it performs a little worse. Here is the query plan:

Merge Right Join  (cost=0.56..4096.38 rows=1000 width=20)
  Merge Cond: (i.restaurant_id = r.id)
  ->  Subquery Scan on i  (cost=0.29..4030.72 rows=150 width=16)
        Filter: (i.row_number = 1)
        ->  WindowAgg  (cost=0.29..3655.72 rows=30000 width=20)
              ->  Index Scan using idx_inspections_rest_and_date on inspections  (cost=0.29..3130.72 rows=30000 width=20)
  ->  Index Scan using restaurants_pkey on restaurants r  (cost=0.28..61.27 rows=1000 width=8)

Results for me take about 25ms. I’ve gotta say I love window functions, but it looks like in this case at least lateral join is a better choice.

Using DISTINCT ON

There is a special feature in Postgres called DISTINCT ON, which is another way to get grouping-like behavior without a GROUP BY clause. If you’ve heard of DISTINCT but not DISTINCT ON, you should check it out. Basically you name one or more columns whose combination should be unique in the result, and then you give an ORDER BY that includes those columns plus (usually) something else, and for each distinct combo, Postgres will include only the first row. That lets us write our query like so:

SELECT  r.id, r.name, i.score, i.inspected_at
FROM    restaurants r
LEFT OUTER JOIN (
        SELECT  DISTINCT ON (restaurant_id)
                *
        FROM    inspections i2
        ORDER BY restaurant_id, inspected_at DESC) i
ON      i.restaurant_id = r.id
;

Here is the query plan I get with that query:

Merge Right Join  (cost=0.56..3292.00 rows=1000 width=20)
  Merge Cond: (i2.restaurant_id = r.id)
  ->  Unique  (cost=0.29..3205.72 rows=1000 width=20)
        ->  Index Scan using idx_inspections_rest_and_date on inspections i2  (cost=0.29..3130.72 rows=30000 width=20)
  ->  Index Scan using restaurants_pkey on restaurants r  (cost=0.28..61.27 rows=1000 width=8)

And results take about 13ms. That is pretty nice! Not quite as fast as the lateral join, but maybe close enough. It is arguably easier to read too. DISTINCT ON is really a nice feature that is insufficiently known. I believe it is also available in MS SQL Server, but not in Oracle or MySQL—which may or may not matter to you. Personally I’m fine committing to one database vendor and leveraging it to the hilt, especially when that vendor is free, open source, and as mature as Postgres.

So anyway, I think this is an interesting lineup of alternatives to a lateral join, at least for my specific example. It’s fascinating to me that lateral join wound up the fastest of all, as that wasn’t why I originally selected it. But if you’re worried about performance, I wouldn’t depend too much on this blog post. I’d test the alternatives against your own real situation, because you might get different results. To me the lesson that is worthwhile is that these are all options (except maybe GROUP BY).

What do you think? Could I have written any of these queries in a more performant way? Is there yet another approach that would work?

Btw, if you want a real hard-core example of a lateral join that is hard to write any other way, here is an example used in funnel analysis.

Too Many Outer Joins

2015-02-16

My last post about lateral joins reminded me of a rule I have when using LEFT OUTER JOIN. Recall that in an outer join, every row in the original table will yield one or more rows in the result, depending on how many rows match in the new table. So if you say this:

SELECT  *
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
;

then you’ll get one or more rows for each restaurant.

You could say this to show the average inspection score for each restaurant:

SELECT  r.name, AVG(i.score) avg_score
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
GROUP BY r.name
;

Now suppose you also had an employees table like this:

restaurants -< employees
-----------    ---------
id             id
name           restaurant_id
               name

You can count the employees in each restaurant like so:

SELECT  r.name, COUNT(e.id) employees
FROM    restaurants r
LEFT OUTER JOIN employees e
ON      e.restaurant_id = r.id
GROUP BY r.name
;

Easy! So can you combine those two queries? How about like this:

SELECT  r.name, AVG(i.score) avg_score, COUNT(e.id) AS employees
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
LEFT OUTER JOIN employees e
ON      e.restaurant_id = r.id
GROUP BY r.name
;

Unfortunately that is totally wrong! Whenever you have one central table, and you want to outer join it to two separate tables, you are in danger. To see why, think through what happens step-by-step as these tables get joined. All the joins happen before any grouping, so first we make a bunch of rows combining restaurants and inspections. Now for each of those rows, we join to the employees table. That means if we have a restaurant with two inspections, we’ll create rows for all its employees for both inspections. That means when we handle the grouping, our aggregate functions will compute wrong results. For the COUNT, we can solve the problem with COUNT(DISTINCT e.id), but it’s not so easy to solve for the AVG.

My rule is that a query can only have one outer join that produces more than one match. If two outer joins both produce multiple matches, then you’ll have trouble.

The only way I know to generate reports like this is to run a subquery for each outer-joined table, so that you can ensure one row per restaurant in the subquery results:

SELECT  r.name, x.avg_score, y.employees
FROM    restaurants r
LEFT OUTER JOIN (SELECT restaurant_id, AVG(score) avg_score
                 FROM   inspections
                 GROUP BY restaurant_id) x
ON      r.id = x.restaurant_id
LEFT OUTER JOIN (SELECT restaurant_id, COUNT(id) employees
                 FROM   employees
                 GROUP BY restaurant_id) y
ON      r.id = y.restaurant_id
;

That approach ensures that your aggregate functions will compute correct results, unpolluted by duplicate rows caused by outer joins to the sibling table(s).

Postgres LATERAL JOIN

2015-02-15

A question came up on the pdxruby mailing list that is a great example for Postgres’s new LATERAL join feature. Lateral joins can be incredibly useful when you need them, but it’s hard to grok their “shape” without a concrete example.

The original question is about an app that shows restaurant inspection results. It has three tables:

restaurants -< inspections -< violations
-----------    -----------    ----------
id             id             id
name           restaurant_id  inspection_id
               inspected_at   reason
               score

So each restaurant has a history of inspections, and each inspection can have zero or more violations.

First, you want to show a list of restaurants with the date and score of their most recent inspection. This is the kind of thing that is easy to express in English, and is a completely reasonable request for a report, but is hard to implement.

With no lateral joins, my thought process goes like this: Since we want one row per restaurant, I’m going to say FROM restaurants. Then I’ll join to inspections and include any inspections than which no inspection is later (i.e. the most recent one). I can achieve that with a correlated sub-query:

SELECT  r.name, i.inspected_at, i.score
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
AND     NOT EXISTS (SELECT  1
                    FROM    inspections i2
                    WHERE   i2.restaurant_id = i.restaurant_id
                    AND     i2.inspected_at > i.inspected_at)
;

But that is kind of tricky, and lateral joins make it so much easier. In a lateral join, we can join to a subquery that is allowed to “reach out” and use rows from other tables in the query:

SELECT  r.name, i.inspected_at, i.score
FROM    restaurants r
LEFT OUTER JOIN LATERAL (SELECT *
                         FROM   inspections i2
                         WHERE  i2.restaurant_id = r.id
                         ORDER BY i2.inspected_at DESC
                         LIMIT 1) i
ON true
;

This is a lot like a correlated sub-query, except the sub-query is in a FROM or JOIN rather than a condition. I think it expresses what we want much more directly: for each restaurant, order the inspections by most recent first, and take the first one.

The second part of this report is to show a count of the violations from whatever inspection we’re using. That’s pretty easy to add:

SELECT  r.name, i.inspected_at, i.score, COUNT(v.id) violations
FROM    restaurants r
LEFT OUTER JOIN LATERAL (SELECT *
                         FROM   inspections i2
                         WHERE  i2.restaurant_id = r.id
                         ORDER BY i2.inspected_at DESC
                         LIMIT 1) i
ON true
LEFT OUTER JOIN violations v
ON  v.inspection_id = i.id
GROUP BY r.name, i.inspected_at, i.score
;

UPDATE:

So radical_3do asked in the comments about the performance of LATERAL vs the correlated sub-query. The answer will always depend on your specific situation, but I decided to test my example for “reasonable” numbers. To keep it simple I’ll leave out the violations table. First I filled up my tables with 1000 restaurants and 30 inspections each:

DELETE FROM inspections;
DELETE FROM restaurants;

INSERT INTO restaurants
(id, name)
SELECT  s, 'Foo'
FROM    generate_series(1, 1000) s
;                   
INSERT INTO inspections     
(id, restaurant_id, score, inspected_at)
SELECT  30 * r + m + 1,
        r + 1,
        (30 * r + m + 1) % 10 + 1,
        '2015-01-01'::date - concat(m, ' MONTHS')::interval
FROM    generate_series(0, 999) r
CROSS JOIN generate_series(0, 29) m
;

CREATE INDEX idx_inspections_biz_and_date ON inspections (restaurant_id ASC, inspected_at DESC);
ANALYZE restaurants;
ANALYZE inspections;

Now when I EXPLAIN my queries, I get this plan for the correlated sub-query:

Sort  (cost=5246.56..5296.56 rows=20000 width=20)
  Sort Key: i.score
  ->  Hash Right Join  (cost=1485.79..3817.79 rows=20000 width=20)
        Hash Cond: (i.restaurant_id = r.id)
        ->  Hash Anti Join  (cost=1440.29..3497.29 rows=20000 width=16)
              Hash Cond: (i.restaurant_id = i2.restaurant_id)
              Join Filter: (i.inspected_at < i2.inspected_at)
              ->  Seq Scan on inspections i  (cost=0.00..1256.00 rows=30000 width=16)
              ->  Hash  (cost=918.29..918.29 rows=30000 width=12)
                    ->  Index Only Scan using idx_inspections_rest_and_date on inspections i2  (cost=0.29..918.29 rows=30000 width=12)
        ->  Hash  (cost=33.00..33.00 rows=1000 width=8)
              ->  Seq Scan on restaurants r  (cost=0.00..33.00 rows=1000 width=8)

I get this plan for the LATERAL join:

Sort  (cost=2366.16..2368.66 rows=1000 width=20)
  Sort Key: i2.score
  ->  Nested Loop Left Join  (cost=0.29..2316.33 rows=1000 width=20)
        ->  Seq Scan on restaurants r  (cost=0.00..33.00 rows=1000 width=8)
        ->  Limit  (cost=0.29..2.26 rows=1 width=20)
              ->  Index Scan using idx_inspections_rest_and_date on inspections i2  (cost=0.29..59.56 rows=30 width=20)
                    Index Cond: (restaurant_id = r.id)

So it looks like the LATERAL version will be faster. (Note that these plans both include an ORDER BY i.score DESC that wasn’t in the original queries above. It doesn’t look like that makes much difference though.) If I run both queries with \timing on, I consistently get about 80ms for the first one and 8 for the second one. That’s a 10x difference! But of course this could all change depending on your actual use case.

UPDATE 2: There are a few more ways to write this query without using lateral joins. Here is a performance comparison of lateral join alternatives.

Postgres \copy from stdin vs pstdin

2015-02-14

Sometimes in Postgres you want to initialize a table by reading a bunch of data from stdin, like this:

psql "$database" -f init.sql < /usr/share/dict/american-english

using an init.sql like this:

DROP TABLE IF EXISTS dictionary;
CREATE TABLE dictionary (id SERIAL PRIMARY KEY, word VARCHAR NOT NULL);
\copy dictionary (word) from stdin
CREATE UNIQUE INDEX idx_dictionary_word ON dictionary (word);

The problem is that with the -f flag, psql ignores its stdin and treats the file you name with -f as if it were stdin. So running the above would add just one word to your database, spelled “CREATE UNIQUE INDEX …”.

The solution to this is to use \copy dictionary (word) from pstdin instead. That tells psql to copy from its real stdin rather than from the -f file.

As far as I know there is no way to do this using COPY FROM rather than \copy from if you want COPY FROM to appear inside a larger .sql file. So this works:

psql "$database" -c "COPY dictionary (word) FROM STDIN" < /usr/share/dict/american-english

but this does not:

psql "$database" -f init.sql < /usr/share/dict/american-english

and neither does this:

psql "$database" -c '\i init.sql' < /usr/share/dict/american-english

But I’d love to be corrected if someone knows a way!

Rails acts_as_list with Soft Delete

2014-12-23

Two really useful gems for a Rails project are acts_as_list and permanent_records (or the somewhat more common acts_as_paranoid). The first one, acts_as_list, adds a position column to your model to provide natural ordering. Usually you don’t want global ordering, but only within an owning model, e.g. the Page objects inside a Book have position 1, 2, 3, and the Page objects inside a different Book also go 1, 2, 3. You can use the scope option to achieve that:

class Page
  belongs_to :book
  acts_as_list scope: :book
end

The problem is that if you are soft-deleting pages using a deleted_at column, acts_as_list doesn’t know to exclude deleted pages from the list. This gets annoying if you want to move pages up and down: Imagine you have page 1, then 5 deleted pages, and now you add a new page p to the end. You want to move your new page to the beginning. You have a button that calls p.move_higher, but the page won’t appear to move until the user presses the button five times. That is pretty confusing!

Fortunately acts_as_list lets you pass a string to scope that it will evaluate when necessary, and you can even embed Ruby code in it. Look at this, and notice the single quotes!:

class Page
  belongs_to :book
  acts_as_list scope: 'book_id = #{book_id} AND deleted_at IS NULL'
end

With that code, acts_as_list will only consider non-deleted pages, so moving a page works as it should.

Postgres CTE for Threaded Comments

2014-09-17

Not long ago I answered a question on the Postgres mailing list I thought was pretty fun: How do you construct a recursive CTE to pull a whole tree of threaded comments, sorting sibling comments by votes?

Threaded, scored comments are what you see on sites like Reddit and Hacker News:

reddit comments

hacker news comments

Threaded means the comments appear in a tree-like structure. Contrast this with Wordpress or Discourse, where every comment appears at the end:

discourse comments

Disqus (which uses threaded comments) has a nice article about implementing them in Postgres with a recursive CTE. They start with this data:

CREATE TABLE comments (
  id SERIAL PRIMARY KEY,
  message VARCHAR,
  author VARCHAR,
  parent_id INTEGER REFERENCES comments(id)
);
INSERT INTO comments (message, author, parent_id)
  VALUES
  ('This thread is really cool!', 'David', NULL),
  ('Ya David, we love it!', 'Jason', 1),
  ('I agree David!', 'Daniel', 1),
  ('gift Jason', 'Anton', 2),
  ('Very interesting post!', 'thedz', NULL),
  ('You sir, are wrong', 'Chris', 5),
  ('Agreed', 'G', 5),
  ('Fo sho, Yall', 'Mac', 5);

Then they recommend using this query:

WITH RECURSIVE cte (id, message, author, path, parent_id, depth)  AS (
    SELECT  id,
        message,
        author,
        array[id] AS path,
        parent_id,
        1 AS depth
    FROM    comments
    WHERE   parent_id IS NULL

    UNION ALL

    SELECT  comments.id,
        comments.message,
        comments.author,
        cte.path || comments.id,
        comments.parent_id,
        cte.depth + 1 AS depth
    FROM    comments
    JOIN cte ON comments.parent_id = cte.id
    )
    SELECT id, message, author, path, depth FROM cte
ORDER BY path;

Essentially what’s happening here is that as we recursively execute the CTE, we build up a “path” to each comment, with the IDs of all ancestors. Then we can sort by the path, so that comments with the same ancestors sort together. This technique relies on the fact that the array A will sort before all other arrays that begin with A, so parents will appear before their children.

(Note that the Disqus solution returns the depth of each comment. This is not really necessary, since it is just the length of path, but I’ve left it in the query because computing depth is often necessary with recursive CTEs, so it might be interesting if you haven’t seen it before.)

That gets you the comments sorted as a tree, but the question on the mailing list was how to also sort by votes, so that while preserving the tree structure, comments with higher votes would appear before their siblings.

To figure this out, let’s modify the Disqus sample data so each comment has a score:

CREATE TABLE comments (
  id SERIAL PRIMARY KEY,
  message VARCHAR,
  author VARCHAR,
  parent_id INTEGER REFERENCES comments(id),
  votes INTEGER
);
INSERT INTO comments (message, author, parent_id, votes)
  VALUES
  ('This thread is really cool!', 'David', NULL, 1),
  ('Ya David, we love it!', 'Jason', 1, 3),
  ('I agree David!', 'Daniel', 1, 4),
  ('gift Jason', 'Anton', 2, 15),
  ('Very interesting post!', 'thedz', NULL, 3),
  ('You sir, are wrong', 'Chris', 5, 3),
  ('Agreed', 'G', 5, 5),
  ('Fo sho, Yall', 'Mac', 5, 12);

One intuitive approach is to change path to include only ancestors, and then resolve ties by looking at votes:

. . .
ORDER BY path, votes DESC

The problem is that this approach fails to keep children with their parents. We need to keep the full chain of IDs.

The correct answer is to sort by path, like in the Disqus article, but at each step also sort by votes. “At each step” is the key. Conceptually what we want is for each step along the path to sort as if by votes DESC, id ASC. To make that happen, rather than constructing the path soley of IDs, we can build it out of (votes, id) tuples. Then we still get our tree, but siblings will differ in the second-to-last array element.

One wrinkle is that we want to sort by votes in reverse order. But it’s easy to just negate the votes to achieve this. Here is the SQL:

WITH RECURSIVE cte (id, message, author, path, parent_id, depth, votes)  AS (
    SELECT  id,
        message,
        author,
        array[-votes,id] AS path,
        parent_id,
        1 AS depth,
        votes
    FROM    comments
    WHERE   parent_id IS NULL
    UNION ALL
    SELECT  comments.id,
        comments.message,
        comments.author,
        cte.path || -comments.votes || comments.id,
        comments.parent_id,
        cte.depth + 1 AS depth,
        comments.votes
    FROM    comments
    JOIN cte ON comments.parent_id = cte.id
    )
    SELECT id, message, author, path, depth, votes FROM cte
ORDER BY path;

Running it get us this:

 id |           message           | author |       path        | depth | votes 
----+-----------------------------+--------+-------------------+-------+-------
  5 | Very interesting post!      | thedz  | {-3,5}            |     1 |     3
  8 | Fo sho, Yall                | Mac    | {-3,5,-12,8}      |     2 |    12
  7 | Agreed                      | G      | {-3,5,-5,7}       |     2 |     5
  6 | You sir, are wrong          | Chris  | {-3,5,-3,6}       |     2 |     3
  1 | This thread is really cool! | David  | {-1,1}            |     1 |     1
  3 | I agree David!              | Daniel | {-1,1,-4,3}       |     2 |     4
  2 | Ya David, we love it!       | Jason  | {-1,1,-3,2}       |     2 |     3
  4 | gift Jason                  | Anton  | {-1,1,-3,2,-15,4} |     3 |    15

You can see that high-scoring comments are sorted higher than their siblings! I hope you enjoyed this walk through recursive CTEs for threaded, scored comments. I thought it was a pretty fun problem!

Next: Flash movie (.swf) won't load in Rails 4