I hit a weird error today. My project stores user uploads on S3 with
private permissions, and we use the Paperclip
expiring_url method to let people download them. This started failing with a error like this:
<Error> <Code>AccessDenied</Code> <Message>Request has expired</Message> <Expires>2015-04-28T18:30:51Z</Expires> <ServerTime>2015-04-28T18:32:52Z</ServerTime> <RequestId>BLAHBLAHBLAHBLAH</RequestId> <HostId> MOREBLAHMOREBLAHMOREBLAHMOREBLAHMOREBLAHMOREBLAHMOREBLAH </HostId> </Error>
The problem was that this error occurred immediately, even though we were generating the URLs with a 10 minute expiration time. Well, it turns out S3 was living in the future by 7 minutes (according to
ServerTime above), and our own app was living in the past by another 5. So 5 + 7 = 12 and our URLs were expired before we even generated them. 10 minutes is the example time in the Paperclip documentation linked above, but I guess it’s too short. Watch out for clock skew out there folks!
Whenever I have to write an aggregate SQL query over time-series data, I start with
generate_series. Mostly that’s to avoid missing rows in the output. For instance, if I’m writing a query to show total sales in every hour, I don’t want to write this:
SELECT extract(hour FROM order_placed_at) h, SUM(total_price) FROM sales WHERE order_placed_at BETWEEN ? AND ? GROUP BY h ORDER BY h
because then if there was an hour with no sales, that row will be simply missing from the result.
Instead I use
generate_series to generate all the hours, and join my real table to that:
SELECT h, COALESCE(SUM(total_price), 0) AS revenue_per_hour FROM generate_series(0, 23) s(h) LEFT OUTER JOIN sales ON extract(hour FROM order_placed_at) = h WHERE order_placed_at BETWEEN ? AND ? GROUP BY h ORDER BY h
That way I get 24 rows every time.
But another interesting use case for
generate_series appeared on the Postgres mailing list: Suppose you have a table of events with
end_time, and you want to report how many events were “ongoing” for each hour of the day. That means a row of data can be counted in multiple output rows. Without
generate_series this is hard to write, but if you use that as your “core” table then join to your data table, it just works:
SELECT h, COUNT(events.id) events_per_hour, array_remove(array_agg(events.id), NULL) active_events FROM generate_series(0, 23) s(h) LEFT OUTER JOIN events ON h BETWEEN extract(hour FROM events.start_time) AND extract(hour FROM events.end_time) GROUP BY h ORDER BY h
That gives you one output row per hour, and events can be counted multiple times, within each hour that includes them. I’ve thrown in an
array_agg so you can see the IDs of each event, just to double-check my work.
But there are still problems. What if our query is supposed to include multiple days? How will it handle events that cross midnight? What about events that start on Monday and run until Wednesday? For these kind of things, we should avoid
extract in our
BETWEEN, and instead of generating a series of integers, we should generate a series of timestamps. This is also a great chance to use Postgres’s new(ish) range datatype, to avoid our own bug-prone interval comparisons:
SELECT extract(hour FROM s.h) AS hour, COUNT(DISTINCT events.id) events_per_hour, array_remove(array_agg(events.id), NULL) active_events FROM generate_series( date_trunc('hour', '2015-03-01 00:00:00'::timestamp), date_trunc('hour', '2015-03-31 23:00:00'::timestamp), '1 hour') s(h) LEFT OUTER JOIN events ON tsrange(events.start_time, events.end_time) && tsrange(s.h, s.h + interval '1 hour') GROUP BY hour ORDER BY hour
Here we are using the
&& (overlaps) operator to check whether each event overlaps with each window. If we didn’t want to use ranges, we could also use the SQL
OVERLAPS operator, which accomplishes the same thing.
Now suppose we also want to report on the total time each event was “active” during each window. That is just a
SUM() we can add to our query, using the
* (intersection) range operator. But be careful about
NULL values coming from our
LEFT JOIN! In ranges, a null at either end represents unbounded, so
tsrange(NULL, NULL) is not
NULL, but all time from negative infinity to positive infinity! Here is a query that works around that:
SELECT extract(hour FROM s.h) AS hour, COUNT(DISTINCT events.id) events_per_hour, array_remove(array_agg(events.id), NULL) active_events, SUM(CASE WHEN events.id IS NULL THEN NULL ELSE date_trunc('minute', range_to_interval( tsrange(s.h, s.h + interval '1 hour') * tsrange(events.start_time, events.end_time) )) END) active_hours FROM generate_series( date_trunc('hour', '2015-03-01 00:00:00'::timestamp), date_trunc('hour', '2015-03-31 23:00:00'::timestamp), '1 hour') s(h) LEFT OUTER JOIN events ON tsrange(events.start_time, events.end_time) && tsrange(s.h, s.h + interval '1 hour') GROUP BY hour ORDER BY hour ;
Oh one more thing… . There is no such thing as
range_to_interval. A surprising omission! But here is a function that defines it:
CREATE OR REPLACE FUNCTION range_to_interval(tsrange) RETURNS interval AS $$ SELECT upper($1) - lower($1) $$ LANGUAGE sql STABLE;
It might be even more proper to create a user-defined cast here, but I’ll leave that as an excerise for the reader. :-)
In the comments to my post about Postgres lateral joins, someone asked about getting the same results with better performance using a
GROUP BY query. Actually you can’t get the same results with a straight
GROUP BY clause, but there are three more ways to write the query that are worth considering. So I’d like to talk first about why (top-level)
GROUP BY doesn’t work, and then look at the other ways to do it (including a subquery
Recall that the problem was: for each restaurant, show its info, plus the score of its most recent inspection (if any), plus the number of violations from that inspection. When you hear “most recent” it’s natural to think about
GROUP BY. Maybe you want to write something like this:
SELECT r.id, r.name, MAX(i.inspected_at) FROM restaurants r LEFT OUTER JOIN inspections i ON i.restaurant_id = r.id GROUP BY r.id ;
The problem is that whenever you want to pick a row using one column but then select additional columns from that same row,
GROUP BY lets you down. In the above query, you can’t add
i.score to the
SELECT list, because that’s an error. You have to do something fancier.
One way to get there is a lateral join, as we’ve seen already. Three other ways are:
GROUP BY in a subquery to get the latest inspection date, then use that date to get the rest of the inspection info.
Use a window function.
A couple things to keep in mind:
We want to show all the restaurants, even if they have no inspections yet. That’s why I didn’t write the above query
FROM inspections GROUP BY restaurant_id, and that’s why all our queries below are
FROM restaurants with an outer join to something else.
We also want to include a count of the violations from whatever inspection we choose. I’ve left that out of all my queries because it mostly just adds noise, but remember that whatever solution we consider must be compatible with that requirement. It must be possible to add one more outer join to
violations and get the count. One approach this rules out is putting a whole subquery into the
SELECT list, because that won’t be available for joining to
For the first alternative, we might write the query like this:
SELECT r.id, r.name, i.score, i.inspected_at FROM restaurants r LEFT OUTER JOIN ( SELECT restaurant_id, MAX(inspected_at) AS max_inspection_date FROM inspections GROUP BY restaurant_id) i2 ON i2.restaurant_id = r.id LEFT OUTER JOIN inspections i ON i.restaurant_id = r.id AND i.inspected_at = i2.max_inspection_date ;
Here we’re using
GROUP BY to get the most recent inspection of each restaurant, and then we add a separate join based on that inspection date to get the rest of the inspection info.
I think this is borderline incorrect. There’s no guarantee that
inspected_at is unique. If we have millisecond precision it might be, but what if it’s just a date? Probably no restaurant gets inspected twice in the same day, but it’d be nice if we didn’t have to rely on that assumption. Also pedagogically it’s a bad approach because who knows if your application will have the same guarantees? I’d rather find a pattern that is valid always, not just for this restaurants example.
Also it’s hard to believe this approach is very fast. Indeed, here is the query plan:
Hash Right Join (cost=1162.54..2645.04 rows=1000 width=20) Hash Cond: ((i.restaurant_id = r.id) AND (i.inspected_at = (max(inspections.inspected_at)))) -> Seq Scan on inspections i (cost=0.00..1256.00 rows=30000 width=16) -> Hash (cost=1147.54..1147.54 rows=1000 width=16) -> Hash Right Join (cost=45.79..1147.54 rows=1000 width=16) Hash Cond: (inspections.restaurant_id = r.id) -> GroupAggregate (cost=0.29..1078.29 rows=1000 width=12) -> Index Only Scan using idx_inspections_rest_and_date on inspections (cost=0.29..918.29 rows=30000 width=12) -> Hash (cost=33.00..33.00 rows=1000 width=8) -> Seq Scan on restaurants r (cost=0.00..33.00 rows=1000 width=8)
On my machine I get query times around 20ms, so it’s more than twice as slow as the lateral join. We already have an index on
inspected_at, so I don’t think there’s an easy way to speed things up here.
One great feature in Postgres is window functions. This lets you compute aggregates (and more!) without turning your whole query into a
GROUP BY. I’ve written about window functions a bit here, but it’s a big topic, so if they’re new to you, the Postgres docs are the place to start. With window functions, we can get what we want like this:
SELECT r.id, r.name, i.score, i.inspected_at FROM restaurants r LEFT OUTER JOIN ( SELECT *, row_number() OVER (PARTITION BY restaurant_id ORDER BY inspected_at DESC) AS row_number FROM inspections) i ON i.restaurant_id = r.id AND i.row_number = 1 ;
This approach is more correct than the previous one. There are no risky assumptions about unique inspection dates. But it performs a little worse. Here is the query plan:
Merge Right Join (cost=0.56..4096.38 rows=1000 width=20) Merge Cond: (i.restaurant_id = r.id) -> Subquery Scan on i (cost=0.29..4030.72 rows=150 width=16) Filter: (i.row_number = 1) -> WindowAgg (cost=0.29..3655.72 rows=30000 width=20) -> Index Scan using idx_inspections_rest_and_date on inspections (cost=0.29..3130.72 rows=30000 width=20) -> Index Scan using restaurants_pkey on restaurants r (cost=0.28..61.27 rows=1000 width=8)
Results for me take about 25ms. I’ve gotta say I love window functions, but it looks like in this case at least lateral join is a better choice.
There is a special feature in Postgres called
DISTINCT ON, which is another way to get grouping-like behavior without a
GROUP BY clause. If you’ve heard of
DISTINCT but not
DISTINCT ON, you should check it out. Basically you name one or more columns whose combination should be unique in the result, and then you give an
ORDER BY that includes those columns plus (usually) something else, and for each distinct combo, Postgres will include only the first row. That lets us write our query like so:
SELECT r.id, r.name, i.score, i.inspected_at FROM restaurants r LEFT OUTER JOIN ( SELECT DISTINCT ON (restaurant_id) * FROM inspections i2 ORDER BY restaurant_id, inspected_at DESC) i ON i.restaurant_id = r.id ;
Here is the query plan I get with that query:
Merge Right Join (cost=0.56..3292.00 rows=1000 width=20) Merge Cond: (i2.restaurant_id = r.id) -> Unique (cost=0.29..3205.72 rows=1000 width=20) -> Index Scan using idx_inspections_rest_and_date on inspections i2 (cost=0.29..3130.72 rows=30000 width=20) -> Index Scan using restaurants_pkey on restaurants r (cost=0.28..61.27 rows=1000 width=8)
And results take about 13ms. That is pretty nice! Not quite as fast as the lateral join, but maybe close enough. It is arguably easier to read too.
DISTINCT ON is really a nice feature that is insufficiently known. I believe it is also available in MS SQL Server, but not in Oracle or MySQL—which may or may not matter to you. Personally I’m fine committing to one database vendor and leveraging it to the hilt, especially when that vendor is free, open source, and as mature as Postgres.
So anyway, I think this is an interesting lineup of alternatives to a lateral join, at least for my specific example. It’s fascinating to me that lateral join wound up the fastest of all, as that wasn’t why I originally selected it. But if you’re worried about performance, I wouldn’t depend too much on this blog post. I’d test the alternatives against your own real situation, because you might get different results. To me the lesson that is worthwhile is that these are all options (except maybe
What do you think? Could I have written any of these queries in a more performant way? Is there yet another approach that would work?
Btw, if you want a real hard-core example of a lateral join that is hard to write any other way, here is an example used in funnel analysis.
My last post about lateral joins reminded me of a rule I have when using
LEFT OUTER JOIN. Recall that in an outer join, every row in the original table will yield one or more rows in the result, depending on how many rows match in the new table. So if you say this:
SELECT * FROM restaurants r LEFT OUTER JOIN inspections i ON i.restaurant_id = r.id ;
then you’ll get one or more rows for each restaurant.
You could say this to show the average inspection score for each restaurant:
SELECT r.name, AVG(i.score) avg_score FROM restaurants r LEFT OUTER JOIN inspections i ON i.restaurant_id = r.id GROUP BY r.name ;
Now suppose you also had an
employees table like this:
restaurants -< employees ----------- --------- id id name restaurant_id name
You can count the employees in each restaurant like so:
SELECT r.name, COUNT(e.id) employees FROM restaurants r LEFT OUTER JOIN employees e ON e.restaurant_id = r.id GROUP BY r.name ;
Easy! So can you combine those two queries? How about like this:
SELECT r.name, AVG(i.score) avg_score, COUNT(e.id) AS employees FROM restaurants r LEFT OUTER JOIN inspections i ON i.restaurant_id = r.id LEFT OUTER JOIN employees e ON e.restaurant_id = r.id GROUP BY r.name ;
Unfortunately that is totally wrong! Whenever you have one central table, and you want to outer join it to two separate tables, you are in danger. To see why, think through what happens step-by-step as these tables get joined. All the joins happen before any grouping, so first we make a bunch of rows combining restaurants and inspections. Now for each of those rows, we join to the employees table. That means if we have a restaurant with two inspections, we’ll create rows for all its employees for both inspections. That means when we handle the grouping, our aggregate functions will compute wrong results. For the
COUNT, we can solve the problem with
COUNT(DISTINCT e.id), but it’s not so easy to solve for the
My rule is that a query can only have one outer join that produces more than one match. If two outer joins both produce multiple matches, then you’ll have trouble.
The only way I know to generate reports like this is to run a subquery for each outer-joined table, so that you can ensure one row per restaurant in the subquery results:
SELECT r.name, x.avg_score, y.employees FROM restaurants r LEFT OUTER JOIN (SELECT restaurant_id, AVG(score) avg_score FROM inspections GROUP BY restaurant_id) x ON r.id = x.restaurant_id LEFT OUTER JOIN (SELECT restaurant_id, COUNT(id) employees FROM employees GROUP BY restaurant_id) y ON r.id = y.restaurant_id ;
That approach ensures that your aggregate functions will compute correct results, unpolluted by duplicate rows caused by outer joins to the sibling table(s).
A question came up on the pdxruby mailing list that is a great example for Postgres’s new
LATERAL join feature. Lateral joins can be incredibly useful when you need them, but it’s hard to grok their “shape” without a concrete example.
The original question is about an app that shows restaurant inspection results. It has three tables:
restaurants -< inspections -< violations ----------- ----------- ---------- id id id name restaurant_id inspection_id inspected_at reason score
So each restaurant has a history of inspections, and each inspection can have zero or more violations.
First, you want to show a list of restaurants with the date and score of their most recent inspection. This is the kind of thing that is easy to express in English, and is a completely reasonable request for a report, but is hard to implement.
With no lateral joins, my thought process goes like this: Since we want one row per restaurant, I’m going to say
FROM restaurants. Then I’ll join to
inspections and include any inspections than which no inspection is later (i.e. the most recent one). I can achieve that with a correlated sub-query:
SELECT r.name, i.inspected_at, i.score FROM restaurants r LEFT OUTER JOIN inspections i ON i.restaurant_id = r.id AND NOT EXISTS (SELECT 1 FROM inspections i2 WHERE i2.restaurant_id = i.restaurant_id AND i2.inspected_at > i.inspected_at) ;
But that is kind of tricky, and lateral joins make it so much easier. In a lateral join, we can join to a subquery that is allowed to “reach out” and use rows from other tables in the query:
SELECT r.name, i.inspected_at, i.score FROM restaurants r LEFT OUTER JOIN LATERAL (SELECT * FROM inspections i2 WHERE i2.restaurant_id = r.id ORDER BY i2.inspected_at DESC LIMIT 1) i ON true ;
This is a lot like a correlated sub-query, except the sub-query is in a FROM or JOIN rather than a condition. I think it expresses what we want much more directly: for each restaurant, order the inspections by most recent first, and take the first one.
The second part of this report is to show a count of the violations from whatever inspection we’re using. That’s pretty easy to add:
SELECT r.name, i.inspected_at, i.score, COUNT(v.id) violations FROM restaurants r LEFT OUTER JOIN LATERAL (SELECT * FROM inspections i2 WHERE i2.restaurant_id = r.id ORDER BY i2.inspected_at DESC LIMIT 1) i ON true LEFT OUTER JOIN violations v ON v.inspection_id = i.id GROUP BY r.name, i.inspected_at, i.score ;
So radical_3do asked in the comments about the performance of
LATERAL vs the correlated sub-query. The answer will always depend on your specific situation, but I decided to test my example for “reasonable” numbers. To keep it simple I’ll leave out the violations table. First I filled up my tables with 1000 restaurants and 30 inspections each:
DELETE FROM inspections; DELETE FROM restaurants; INSERT INTO restaurants (id, name) SELECT s, 'Foo' FROM generate_series(1, 1000) s ; INSERT INTO inspections (id, restaurant_id, score, inspected_at) SELECT 30 * r + m + 1, r + 1, (30 * r + m + 1) % 10 + 1, '2015-01-01'::date - concat(m, ' MONTHS')::interval FROM generate_series(0, 999) r CROSS JOIN generate_series(0, 29) m ; CREATE INDEX idx_inspections_biz_and_date ON inspections (restaurant_id ASC, inspected_at DESC); ANALYZE restaurants; ANALYZE inspections;
Now when I
EXPLAIN my queries, I get this plan for the correlated sub-query:
Sort (cost=5246.56..5296.56 rows=20000 width=20) Sort Key: i.score -> Hash Right Join (cost=1485.79..3817.79 rows=20000 width=20) Hash Cond: (i.restaurant_id = r.id) -> Hash Anti Join (cost=1440.29..3497.29 rows=20000 width=16) Hash Cond: (i.restaurant_id = i2.restaurant_id) Join Filter: (i.inspected_at < i2.inspected_at) -> Seq Scan on inspections i (cost=0.00..1256.00 rows=30000 width=16) -> Hash (cost=918.29..918.29 rows=30000 width=12) -> Index Only Scan using idx_inspections_rest_and_date on inspections i2 (cost=0.29..918.29 rows=30000 width=12) -> Hash (cost=33.00..33.00 rows=1000 width=8) -> Seq Scan on restaurants r (cost=0.00..33.00 rows=1000 width=8)
I get this plan for the
Sort (cost=2366.16..2368.66 rows=1000 width=20) Sort Key: i2.score -> Nested Loop Left Join (cost=0.29..2316.33 rows=1000 width=20) -> Seq Scan on restaurants r (cost=0.00..33.00 rows=1000 width=8) -> Limit (cost=0.29..2.26 rows=1 width=20) -> Index Scan using idx_inspections_rest_and_date on inspections i2 (cost=0.29..59.56 rows=30 width=20) Index Cond: (restaurant_id = r.id)
So it looks like the
LATERAL version will be faster. (Note that these plans both include an
ORDER BY i.score DESC that wasn’t in the original queries above. It doesn’t look like that makes much difference though.) If I run both queries with
\timing on, I consistently get about 80ms for the first one and 8 for the second one. That’s a 10x difference! But of course this could all change depending on your actual use case.
UPDATE 2: There are a few more ways to write this query without using lateral joins. Here is a performance comparison of lateral join alternatives.
Sometimes in Postgres you want to initialize a table by reading a bunch of data from stdin, like this:
psql "$database" -f init.sql < /usr/share/dict/american-english
init.sql like this:
DROP TABLE IF EXISTS dictionary; CREATE TABLE dictionary (id SERIAL PRIMARY KEY, word VARCHAR NOT NULL); \copy dictionary (word) from stdin CREATE UNIQUE INDEX idx_dictionary_word ON dictionary (word);
The problem is that with the
psql ignores its stdin and treats the file you name with
-f as if it were stdin. So running the above would add just one word to your database, spelled “CREATE UNIQUE INDEX …”.
The solution to this is to use
\copy dictionary (word) from pstdin instead. That tells
psql to copy from its real stdin rather than from the
As far as I know there is no way to do this using
COPY FROM rather than
\copy from if you want
COPY FROM to appear inside a larger
.sql file. So this works:
psql "$database" -c "COPY dictionary (word) FROM STDIN" < /usr/share/dict/american-english
but this does not:
psql "$database" -f init.sql < /usr/share/dict/american-english
and neither does this:
psql "$database" -c '\i init.sql' < /usr/share/dict/american-english
But I’d love to be corrected if someone knows a way!Next: Rails acts_as_list with Soft Delete