Histograms in Postgres with Window Functions

2012-11-15

I seem to keep winding up doing projects that require analyzing data stored in Postgres. Often it is helpful to see your data as a histogram: for instance, how many users have placed n orders? R is the best tool for this, but occasionally you want the results as a table pulled out of Postgres. We can use window functions to make this easy. Here is a general-purpose template for a histogram-like SQL query:

SELECT  c,
        COUNT(*) AS items_count,
        SUM(COUNT(*)) OVER (ORDER BY c) AS total_items_count,
        (SUM(COUNT(*)) OVER (ORDER BY c))::float / SUM(COUNT(*)) OVER () AS total_items_perc
FROM    (SELECT p.id, COUNT(ch.id) c
         FROM   parent p, child ch
         WHERE  p.id = ch.parent_id
         GROUP BY p.id
        ) x
GROUP BY x.c
ORDER BY x.c

This gives results like the following:

  c  | items_count | total_items_count | total_items_perc
-----+-------------+-------------------+-------------------
   1 |        1279 |              1279 | 0.161775866430559
   2 |        1275 |              2554 | 0.323045788009107
   3 |        1076 |              3630 | 0.459144953200101
   4 |         886 |              4516 | 0.571211737920567
   5 |         653 |              5169 | 0.653807235011384
. . .

Column 2 would be the y-value if you graphed this as a usual histogram. Column 3 or 4 would be the y-value if you graphed it as a cumulative histogram.

Note that the percentage column is wrong if you want to include values where c equals zero. Then you’ll need an outer join in the sub-query. I’ll leave that as an exercise for the reader.

blog comments powered by Disqus Prev: Indexing HTML with Solr/Lucene Next: Highlighting Lines in Github