A question came up on the pdxruby mailing list that is a great example for Postgres’s new
LATERAL join feature. Lateral joins can be incredibly useful when you need them, but it’s hard to grok their “shape” without a concrete example.
The original question is about an app that shows restaurant inspection results. It has three tables:
restaurants -< inspections -< violations ----------- ----------- ---------- id id id name restaurant_id inspection_id inspected_at reason score
So each restaurant has a history of inspections, and each inspection can have zero or more violations.
First, you want to show a list of restaurants with the date and score of their most recent inspection. This is the kind of thing that is easy to express in English, and is a completely reasonable request for a report, but is hard to implement.
With no lateral joins, my thought process goes like this: Since we want one row per restaurant, I’m going to say
FROM restaurants. Then I’ll join to
inspections and include any inspections than which no inspection is later (i.e. the most recent one). I can achieve that with a correlated sub-query:
SELECT r.name, i.inspected_at, i.score FROM restaurants r LEFT OUTER JOIN inspections i ON i.restaurant_id = r.id AND NOT EXISTS (SELECT 1 FROM inspections i2 WHERE i2.restaurant_id = i.restaurant_id AND i2.inspected_at > i.inspected_at) ;
But that is kind of tricky, and lateral joins make it so much easier. In a lateral join, we can join to a subquery that is allowed to “reach out” and use rows from other tables in the query:
SELECT r.name, i.inspected_at, i.score FROM restaurants r LEFT OUTER JOIN LATERAL (SELECT * FROM inspections i2 WHERE i2.restaurant_id = r.id ORDER BY i2.inspected_at DESC LIMIT 1) i ON true ;
This is a lot like a correlated sub-query, except the sub-query is in a FROM or JOIN rather than a condition. I think it expresses what we want much more directly: for each restaurant, order the inspections by most recent first, and take the first one.
The second part of this report is to show a count of the violations from whatever inspection we’re using. That’s pretty easy to add:
SELECT r.name, i.inspected_at, i.score, COUNT(v.id) violations FROM restaurants r LEFT OUTER JOIN LATERAL (SELECT * FROM inspections i2 WHERE i2.restaurant_id = r.id ORDER BY i2.inspected_at DESC LIMIT 1) i ON true LEFT OUTER JOIN violations v ON v.inspection_id = i.id GROUP BY r.name, i.inspected_at, i.score ;
So radical_3do asked in the comments about the performance of
LATERAL vs the correlated sub-query. The answer will always depend on your specific situation, but I decided to test my example for “reasonable” numbers. To keep it simple I’ll leave out the violations table. First I filled up my tables with 1000 restaurants and 30 inspections each:
DELETE FROM inspections; DELETE FROM restaurants; INSERT INTO restaurants (id, name) SELECT s, 'Foo' FROM generate_series(1, 1000) s ; INSERT INTO inspections (id, restaurant_id, score, inspected_at) SELECT 30 * r + m + 1, r + 1, (30 * r + m + 1) % 10 + 1, '2015-01-01'::date - concat(m, ' MONTHS')::interval FROM generate_series(0, 999) r CROSS JOIN generate_series(0, 29) m ; CREATE INDEX idx_inspections_biz_and_date ON inspections (restaurant_id ASC, inspected_at DESC); ANALYZE restaurants; ANALYZE inspections;
Now when I
EXPLAIN my queries, I get this plan for the correlated sub-query:
Sort (cost=5246.56..5296.56 rows=20000 width=20) Sort Key: i.score -> Hash Right Join (cost=1485.79..3817.79 rows=20000 width=20) Hash Cond: (i.restaurant_id = r.id) -> Hash Anti Join (cost=1440.29..3497.29 rows=20000 width=16) Hash Cond: (i.restaurant_id = i2.restaurant_id) Join Filter: (i.inspected_at < i2.inspected_at) -> Seq Scan on inspections i (cost=0.00..1256.00 rows=30000 width=16) -> Hash (cost=918.29..918.29 rows=30000 width=12) -> Index Only Scan using idx_inspections_rest_and_date on inspections i2 (cost=0.29..918.29 rows=30000 width=12) -> Hash (cost=33.00..33.00 rows=1000 width=8) -> Seq Scan on restaurants r (cost=0.00..33.00 rows=1000 width=8)
I get this plan for the
Sort (cost=2366.16..2368.66 rows=1000 width=20) Sort Key: i2.score -> Nested Loop Left Join (cost=0.29..2316.33 rows=1000 width=20) -> Seq Scan on restaurants r (cost=0.00..33.00 rows=1000 width=8) -> Limit (cost=0.29..2.26 rows=1 width=20) -> Index Scan using idx_inspections_rest_and_date on inspections i2 (cost=0.29..59.56 rows=30 width=20) Index Cond: (restaurant_id = r.id)
So it looks like the
LATERAL version will be faster. (Note that these plans both include an
ORDER BY i.score DESC that wasn’t in the original queries above. It doesn’t look like that makes much difference though.) If I run both queries with
\timing on, I consistently get about 80ms for the first one and 8 for the second one. That’s a 10x difference! But of course this could all change depending on your actual use case.
UPDATE 2: There are a few more ways to write this query without using lateral joins. Here is a performance comparison of lateral join alternatives.blog comments powered by Disqus Prev: Too Many Outer Joins Next: Postgres \copy from stdin vs pstdin