Postgres LATERAL JOIN

2015-02-15

A question came up on the pdxruby mailing list that is a great example for Postgres’s new LATERAL join feature. Lateral joins can be incredibly useful when you need them, but it’s hard to grok their “shape” without a concrete example.

The original question is about an app that shows restaurant inspection results. It has three tables:

restaurants -< inspections -< violations

id             id             id
name           restaurant_id  inspection_id
               inspected_at   reason
               score

So each restaurant has a history of inspections, and each inspection can have zero or more violations.

First, you want to show a list of restaurants with the date and score of their most recent inspection. This is the kind of thing that is easy to express in English, and is a completely reasonable request for a report, but is hard to implement.

With no lateral joins, my thought process goes like this: Since we want one row per restaurant, I’m going to say FROM restaurants. Then I’ll join to inspections and include any inspections than which no inspection is later (i.e. the most recent one). I can achieve that with a correlated sub-query:

SELECT  r.name, i.inspected_at, i.score
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
AND     NOT EXISTS (SELECT  1
                    FROM    inspections i2
                    WHERE   i2.restaurant_id = i.restaurant_id
                    AND     i2.inspected_at > i.inspected_at)
;

But that is kind of tricky, and lateral joins make it so much easier. In a lateral join, we can join to a subquery that is allowed to “reach out” and use rows from other tables in the query:

SELECT  r.name, i.inspected_at, i.score
FROM    restaurants r
LEFT OUTER JOIN LATERAL (SELECT *
                         FROM   inspections i2
                         WHERE  i2.restaurant_id = r.id
                         ORDER BY i2.inspected_at DESC
                         LIMIT 1) i
ON true
;

This is a lot like a correlated sub-query, except the sub-query is in a FROM or JOIN rather than a condition. I think it expresses what we want much more directly: for each restaurant, order the inspections by most recent first, and take the first one.

The second part of this report is to show a count of the violations from whatever inspection we’re using. That’s pretty easy to add:

SELECT  r.name, i.inspected_at, i.score, COUNT(v.id) violations
FROM    restaurants r
LEFT OUTER JOIN LATERAL (SELECT *
                         FROM   inspections i2
                         WHERE  i2.restaurant_id = r.id
                         ORDER BY i2.inspected_at DESC
                         LIMIT 1) i
ON true
LEFT OUTER JOIN violations v
ON  v.inspection_id = i.id
GROUP BY r.name, i.inspected_at, i.score
;

UPDATE:

So radical_3do asked in the comments about the performance of LATERAL vs the correlated sub-query. The answer will always depend on your specific situation, but I decided to test my example for “reasonable” numbers. To keep it simple I’ll leave out the violations table. First I filled up my tables with 1000 restaurants and 30 inspections each:

DELETE FROM inspections;
DELETE FROM restaurants;

INSERT INTO restaurants
(id, name)
SELECT  s, 'Foo'
FROM    generate_series(1, 1000) s
;                   
INSERT INTO inspections     
(id, restaurant_id, score, inspected_at)
SELECT  30 * r + m + 1,
        r + 1,
        (30 * r + m + 1) % 10 + 1,
        '2015-01-01'::date - concat(m, ' MONTHS')::interval
FROM    generate_series(0, 999) r
CROSS JOIN generate_series(0, 29) m
;

CREATE INDEX idx_inspections_biz_and_date ON inspections (restaurant_id ASC, inspected_at DESC);
ANALYZE restaurants;
ANALYZE inspections;

Now when I EXPLAIN my queries, I get this plan for the correlated sub-query:

Sort  (cost=5246.56..5296.56 rows=20000 width=20)
  Sort Key: i.score
  ->  Hash Right Join  (cost=1485.79..3817.79 rows=20000 width=20)
        Hash Cond: (i.restaurant_id = r.id)
        ->  Hash Anti Join  (cost=1440.29..3497.29 rows=20000 width=16)
              Hash Cond: (i.restaurant_id = i2.restaurant_id)
              Join Filter: (i.inspected_at < i2.inspected_at)
              ->  Seq Scan on inspections i  (cost=0.00..1256.00 rows=30000 width=16)
              ->  Hash  (cost=918.29..918.29 rows=30000 width=12)
                    ->  Index Only Scan using idx_inspections_rest_and_date on inspections i2  (cost=0.29..918.29 rows=30000 width=12)
        ->  Hash  (cost=33.00..33.00 rows=1000 width=8)
              ->  Seq Scan on restaurants r  (cost=0.00..33.00 rows=1000 width=8)

I get this plan for the LATERAL join:

Sort  (cost=2366.16..2368.66 rows=1000 width=20)
  Sort Key: i2.score
  ->  Nested Loop Left Join  (cost=0.29..2316.33 rows=1000 width=20)
        ->  Seq Scan on restaurants r  (cost=0.00..33.00 rows=1000 width=8)
        ->  Limit  (cost=0.29..2.26 rows=1 width=20)
              ->  Index Scan using idx_inspections_rest_and_date on inspections i2  (cost=0.29..59.56 rows=30 width=20)
                    Index Cond: (restaurant_id = r.id)

So it looks like the LATERAL version will be faster. (Note that these plans both include an ORDER BY i.score DESC that wasn’t in the original queries above. It doesn’t look like that makes much difference though.) If I run both queries with \timing on, I consistently get about 80ms for the first one and 8 for the second one. That’s a 10x difference! But of course this could all change depending on your actual use case.

UPDATE 2: There are a few more ways to write this query without using lateral joins. Here is a performance comparison of lateral join alternatives.

blog comments powered by Disqus Prev: Too Many Outer Joins Next: Postgres \copy from stdin vs pstdin