Too Many Outer Joins

2015-02-16

My last post about lateral joins reminded me of a rule I have when using LEFT OUTER JOIN. Recall that in an outer join, every row in the original table will yield one or more rows in the result, depending on how many rows match in the new table. So if you say this:

SELECT  *
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
;

then you’ll get one or more rows for each restaurant.

You could say this to show the average inspection score for each restaurant:

SELECT  r.name, AVG(i.score) avg_score
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
GROUP BY r.name
;

Now suppose you also had an employees table like this:

restaurants -< employees

id             id
name           restaurant_id
               name

You can count the employees in each restaurant like so:

SELECT  r.name, COUNT(e.id) employees
FROM    restaurants r
LEFT OUTER JOIN employees e
ON      e.restaurant_id = r.id
GROUP BY r.name
;

Easy! So can you combine those two queries? How about like this:

SELECT  r.name, AVG(i.score) avg_score, COUNT(e.id) AS employees
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
LEFT OUTER JOIN employees e
ON      e.restaurant_id = r.id
GROUP BY r.name
;

Unfortunately that is totally wrong! Whenever you have one central table, and you want to outer join it to two separate tables, you are in danger. To see why, think through what happens step-by-step as these tables get joined. All the joins happen before any grouping, so first we make a bunch of rows combining restaurants and inspections. Now for each of those rows, we join to the employees table. That means if we have a restaurant with two inspections, we’ll create rows for all its employees for both inspections. That means when we handle the grouping, our aggregate functions will compute wrong results. For the COUNT, we can solve the problem with COUNT(DISTINCT e.id), but it’s not so easy to solve for the AVG.

My rule is that a query can only have one outer join that produces more than one match. If two outer joins both produce multiple matches, then you’ll have trouble.

The only way I know to generate reports like this is to run a subquery for each outer-joined table, so that you can ensure one row per restaurant in the subquery results:

SELECT  r.name, x.avg_score, y.employees
FROM    restaurants r
LEFT OUTER JOIN (SELECT restaurant_id, AVG(score) avg_score
                 FROM   inspections
                 GROUP BY restaurant_id) x
ON      r.id = x.restaurant_id
LEFT OUTER JOIN (SELECT restaurant_id, COUNT(id) employees
                 FROM   employees
                 GROUP BY restaurant_id) y
ON      r.id = y.restaurant_id
;

That approach ensures that your aggregate functions will compute correct results, unpolluted by duplicate rows caused by outer joins to the sibling table(s).

blog comments powered by Disqus Prev: Lateral Join Alternatives Next: Postgres LATERAL JOIN

Illuminated Computing

Too Many Outer Joins

Paul A. Jungwirth

Code

Writing

Talks