I seem to keep winding up doing projects that require analyzing data stored in Postgres. Often it is helpful to see your data as a histogram: for instance, how many users have placed n orders? R is the best tool for this, but occasionally you want the results as a table pulled out of Postgres. We can use window functions to make this easy. Here is a general-purpose template for a histogram-like SQL query:
SELECT c,
COUNT(*) AS items_count,
SUM(COUNT(*)) OVER (ORDER BY c) AS total_items_count,
(SUM(COUNT(*)) OVER (ORDER BY c))::float / SUM(COUNT(*)) OVER () AS total_items_perc
FROM (SELECT p.id, COUNT(ch.id) c
FROM parent p, child ch
WHERE p.id = ch.parent_id
GROUP BY p.id
) x
GROUP BY x.c
ORDER BY x.c
This gives results like the following:
c | items_count | total_items_count | total_items_perc
-----+-------------+-------------------+-------------------
1 | 1279 | 1279 | 0.161775866430559
2 | 1275 | 2554 | 0.323045788009107
3 | 1076 | 3630 | 0.459144953200101
4 | 886 | 4516 | 0.571211737920567
5 | 653 | 5169 | 0.653807235011384
. . .
Column 2 would be the y-value if you graphed this as a usual histogram. Column 3 or 4 would be the y-value if you graphed it as a cumulative histogram.
Note that the percentage column is wrong if you want to include values where c equals zero. Then you’ll need an outer join in the sub-query. I’ll leave that as an exercise for the reader.
blog comments powered by Disqus Prev: Indexing HTML with Solr/Lucene Next: Highlighting Lines in Github