Postgres Permissions

2017-03-27

It’s taken me a long time to build up an understanding of the Postgres permissions system, so I thought I’d write it down to help others get up to speed faster.

First let’s talk about what things can be guarded by permissions. The bigs one are databases and tables/views, but you can also set permissions on schemas, functions, sequences, types, languages, tablespaces, domains, foreign servers, foreign data wrappers, and large objects. (Phew!) There are also a few permissions that apply to the whole cluster. (In Postgres terminology, the “cluster” is the server process(es) keeping its files in /var/lib/postgresql/9.5/main and listening on 5432 (or whatever), and a “database” is a namespace in the cluster where you define tables. You can list the databases in your cluster with \l.) We’ll explore below what actions can be allowed or retricted for each of these subjects. Collectively, let’s call permissionable things “topics”.

(As of Postgres 9.5 there is also Row Level Security (RLS). That is complicated enough to be a separate topic, so here I want to cover just the traditional permissions system.)

Second let’s ask who has permissions. In Postgres permissions come first from your user, and second from any groups your user is a part of. Collectively, users and groups are called “roles”. Or to be more accurate, before Postgres 8.1 users and groups were separate things, but now everything is just a role. A role can even be both a “group” and a “user” at the same time: it can have members but also serve as a direct login.

Permission Mechanics

The key to understanding Postgres is that the stored permission between any role and any topic is only ever a 0 or a 1. When Postgres does a permissions check, it takes all your roles’ permissions on the topic, and if there are any with a 1, you’re in. In other words your effective permission is the OR of all your roles’ permissions. (There is one exception to this, explained below, but it rarely matters.)

You can see what groups each role belongs to with \du (list users). You can add someone to a role with GRANT r TO u, or remove them with REVOKE r FROM u, where r is the group and u is the user. (I am using “group” and “user” loosely here for clarity, but remember that they are all roles.)

So some consequences of this: If you REVOKE a permission from a role r1, it loses its direct 1, but it might still get the permission indirectly from other roles. The REVOKE isn’t remembered in any way. It just sets the 1 to a 0 for r1. It doesn’t override other possible sources of a 1.

Another consequence is that commands like GRANT SELECT ON ALL TABLES have an immediate effect of setting a lot of 1s, but they are not remembered either. So if you add some more tables, they will not automatically share the same grants as the old tables.

The complement of GRANT ON ALL is ALTER DEFAULT PRIVILEGES. This lets you say what the 1s and 0s should be for future objects. It does not change any existing objects; it just says how to set up new ones. You can combine GRANT ON ALL and ALTER DEFAULT to get 100% coverage: first for the old objects, then for the new. Actually it would probably be less racy to do the DEFAULT first, in case someone else is making more objects while you type.

The final trick in all this is a magic role named PUBLIC. Every other role is a member of the PUBLIC role, so if it has any 1s, then everyone will apply those 1s. Note there is no way to cancel a 1 from the PUBLIC role, because of what we said above about how permission checks do an OR. You might REVOKE something from a troublemaker and give them a 0, but if the PUBLIC role has a 1, they can do the thing anyway. Of course you can REVOKE a permission from the PUBLIC role itself, but then you have to individually GRANT the permission to all the roles that should have it.

One wrinkle here is that a role can have either the INHERIT or NOINHERIT attribute. If it has NOINHERIT, then it does not benefit from permissions of roles it belongs to. In other words, its own direct 1s are the only ones it’s got (ha!). In practice this is not a significant block though, because if you belong to an admin role but have NOINHERIT, you can still SET ROLE admin and act with the permissions of that role. So NOINHERIT should not be used as a way of restricting what people can do. Also note that even with NOINHERIT, you still inherit permissions from the PUBLIC role. In general I don’t see much point to NOINHERIT, except for the psychological benefit of making people more explicitly prepare before running dangerous commands.

Cluster Permissions

Okay, so those are the mechanics. Now let’s look at some specific permissions. At the top level you have cluster-wide permissions. These are stored as attributes on your role. They are:

LOGIN: This means you can connect to the cluster. You still must in addition have permission to CONNECT to an individual database (covered below), but if your role doesn’t have this basic LOGIN attribute, you can’t get in at all.

SUPERUSER: This means you can do anything. You’ve got root. Your other permissions don’t matter, because when Postgres sees you have this, it doesn’t ask any further questions.

CREATEDB: You can create more databases.

CREATEROLE: You can create more roles.

REPLICATION: You can start streaming replication. Normally you’d only give this to a special role used by slaves to connect to the master database.

You can set these permissions with CREATE ROLE or change them with ALTER ROLE. You can see them with \du.

Database Permissions

The next level is the database. If you type \l you can see the databases in your cluster, and the last columm shows the permissions on each. For example:

perms=# \l
                                                   List of databases
            Name            |     Owner      | Encoding |   Collate   |    Ctype    |         Access privileges         
----------------------------+----------------+----------+-------------+-------------+-----------------------------------
 perms                      | postgres       | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =Tc/postgres                     +
                            |                |          |             |             | postgres=CTc/postgres            +
                            |                |          |             |             | admin=C/postgres
 postgres                   | postgres       | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0                  | postgres       | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres                      +
                            |                |          |             |             | postgres=CTc/postgres
 template1                  | postgres       | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres                      +
                            |                |          |             |             | postgres=CTc/postgres            +

For each database, you’ll see several lines of permissions. Each line shows a role, an equals sign, and then the list of permissions for that role. Someones a line is missing the role and starts with just an equals sign; this is for the PUBLIC role. It shows the permissions that everyone gets for free. Whatever you see here is allowed for anyone.

There is also a forward slash followed by another role name. That shows who granted those permissions.

So in the above example you can see that for the perms database, the PUBLIC role has Tc permissions, and admin has C, granted by postgres.

Each permission is shown by a single letter. They mean:

T: Create temporary (in-memory) tables.

c: CONNECT to the database. It has always worried me that by default databases let anyone connect. In theory this is okay since you can’t SELECT or change anything, but you can still list the tables (\dt), list their colums \d tablename, etc. More precisely, connecting lets people query the tables in pg_catalog and information_schema. They can see the names of all the other database users, all the other databases, and quite a bit more. Personally I feel this is a significant shortcoming in Postgres’s system.

You can revoke access to these system tables, but at the cost of breaking important things. For instance pgadmin fails if it can’t list the databases. Another idea is to use RLS on the system tables, but that is not supported.

Of course, if you can connect to any database, you can do these things. I’m not aware of any information you can get about database foo by connecting there that you can’t also get while connected to database bar. So you could argue that granting everyone CONNECT is not a problem! I don’t have a good answer to that, even though it feels suspicious. I think the moral, then, is that all database roles require more trust than you might expect.

C: This allows you to CREATE a new schema within the database. As the owner of that schema, you’ll be able to do lots of things with it, like create tables, etc.

Finally, individual letters may be followed by an asterisk (*). That means that the user not only has the permission, but may also GRANT the permission to others. The object owner may always grant permission, so Postgres doesn’t bother to show asterisks for them.

Table/View Permissions

For permission purposes, Postgres treats tables and views as the same thing. So ALL TABLES will include views, too.

You can list the permissions for tables and views (and sequences too) using \dp or \z (They do the same thing.):

perms=# \dp
                                       Access privileges
 Schema |      Name      |   Type   |     Access privileges     | Column privileges | Policies 
--------+----------------+----------+---------------------------+-------------------+----------
 public | sales          | table    | postgres=arwdDxt/postgres+|                   | 
        |                |          | admin=arwdDxt/postgres    |                   | 
 public | sales_id_seq   | sequence |                           |                   | 
 public | widgets        | table    | postgres=arwdDxt/postgres+|                   | 
        |                |          | admin=arwdDxt/postgres   +|                   | 
        |                |          | joe=r/postgres            |                   | 
 public | widgets_id_seq | sequence |                           |                   | 

What you see here is a lot like the output from \l. A blank means you have never made any permissions changes for this object, so its permissions still match the default. (Note that once you change something, even if you change it back, the column will never go back to blank again.)

Of course now there are more letters than before. That’s because tables have lots of permissions:

SELECT (r): You can read from the table.

UPDATE (w): You can write to the table.

INSERT (a): You can append to the table.

DELETE (d): You can delete from the table.

TRUNCATE (D): You can really delete from the table.

REFERENCES (x): This lets you create foreign key constraints. You need the permission on both tables. Also, you can grant this permission for just specific columns.

TRIGGER (t): You can add a trigger to the table.

It is also useful to see the default privileges for each table (from ALTER DEFAULT PRIVILEGES):

perms=# \ddp
         Default access privileges
 Owner | Schema | Type | Access privileges 
-------+--------+------+-------------------
(0 rows)

perms=# ALTER DEFAULT PRIVILEGES GRANT SELECT ON TABLES TO joe;
ALTER DEFAULT PRIVILEGES
perms=# \ddp
               Default access privileges
  Owner   | Schema | Type  |     Access privileges     
----------+--------+-------+---------------------------
 postgres |        | table | postgres=arwdDxt/postgres+
          |        |       | joe=r/postgres
(1 row)

perms=# create table inventory (id serial, widget_id integer);
CREATE TABLE
perms=# \dp
                                        Access privileges
 Schema |       Name       |   Type   |     Access privileges     | Column privileges | Policies 
--------+------------------+----------+---------------------------+-------------------+----------
 public | inventory        | table    | postgres=arwdDxt/postgres+|                   | 
        |                  |          | joe=r/postgres            |                   | 
 ...

Here you can see that new tables will start out with whatever permissions we have prepared as the defaults.

Note that you can also say ALTER DEFAULT PRIVILEGES FOR ROLE r GRANT .... Normally when you alter default privileges, it only affects (future) objects you own. If you want to change the privileges for future objects created by someone else, you can use FOR ROLE. Naturally you can only that if you belong to that role.

With all these rules, it is natural to want to see the final calculated rights of each user in the system. Unfortunately Postgres doesn’t have a built-in way to show that. There is a function called has_table_privilege, but it does not consider inherited permissions. For instance has_table_privilege('joe', 'public.sales', 'INSERT') will return false, even though joe belongs to the admin role. (And even if we change admin to INHERIT, it will still return false.) You can also say SELECT * FROM information_schema.role_table_grants WHERE table_schema = 'public', but that also does not compute inherited permissions.

Sequence Permissions

If you squint, a sequence is pretty much just a tiny table: you can SELECT from it (call currval), or UPDATE it (call setval), or do both at once (call nextval). So it appears in \dp with the tables and views. But unlike tables and views, they also have the USAGE permission, which allows currval or nextval.

The USAGE permission is not just an abbreviation for SELECT plus UPDATE; it has its own bit. You can see it in \dp as U (vs. r and w). Yet if you have it, you don’t need r or w to use the sequence. Granting only USAGE would let someone use the sequence normally but not call setval, which might be useful.

I can’t think of any reason to give someone just SELECT on a sequence. Even currval won’t work, because currval only tells you the last time you called nextval. You will get an error like ERROR: currval of sequence "widgets_id_seq" is not yet defined in this session.

Function Permissions

Functions have just an EXECUTE permission. New functions will grant execute permission to PUBLIC, so be careful if you are creating something sensitive. If you like, you can change this with ALTER DEFAULT PERMISSIONS.

You can see functions in psql with \df or \df+, but neither will show the permissions. Instead you should use a query like

SELECT proacl FROM pg_proc WHERE proname = 'f';

The format of proacl is like anything else:

perms=# select proacl from pg_proc where proname = 'f';
                proacl                 
---------------------------------------
 {postgres=X/postgres,joe=X/postgres}

Here we have already revoked execute from PUBLIC, then granted it to joe.

Note that usually functions execute with the permissions of the invoker, but you can create “setuid” functions that execute with the permissions of the definer. These can be very useful, but they have similar dangers to Unix setuid programs, so you have to be careful with them.

Schema Permissions

Schemas have just CREATE and USAGE permissions. CREATE lets you add more stuff to the schema (tables, functions, sequences, etc.). USAGE is a prerequisite to using objects within the schema. For instance to SELECT from a table, you must first have USAGE permission on its schema. (You must also have SELECT permission on the table.) Note that even without USAGE, you can still list the schema contents, and even query for its tables’ definitions etc.

To be honest I find the schema USAGE permission kind of confusing and superfluous, and I don’t think I’m the only one. I can’t think of anything you can do with this permission you can’t do with permissions on individual tables, functions, etc., and because you can say GRANT ... ON ALL whatevers IN SCHEMA foo, it is not even necessary as an abbreviation. I guess if you really want to lock people out, you might as well use it as an extra layer though.

Conclusion

There are permissions for more specialized database objects too, like types and foreign data wrappers. But if you have made it this far, you should have no problem understanding the documentation for those. You might also want to read about RLS to get really fine-grained access control. Here I’ve only tried to cover the basics, and give you an accurate mental model of how Postgres decides who can do what. If you have any questions, comments, or corrections, I’m happy to listen!

Where to put Postgres json_each?

2016-09-07

Postgres has several “set-returning” functions that essentially return more than one row. Probably the most well-known is generate_series, used to get a list of numbers/dates/etc. Another very handy one is json_each. You can use it to break up a JSON object into one row for each key/value pair. But that one always trips me up, because unlike generate_series it (1) returns more than one column per row and (2) usually takes a column as input.

Suppose you have a users table with a preferences column, like so:

t=# select * from users;
 id |                        preferences                         
----+------------------------------------------------------------
  1 | {"lang": "en", "time_zone": "America/Los_Angeles"}
  2 | {"lang": "es", "time_zone": "America/Cayman", "volume": 5}

If you do this you get tuples:

t=# select json_each(preferences) from users;
               json_each               
---------------------------------------
 (lang,"""en""")
 (time_zone,"""America/Los_Angeles""")
 (lang,"""es""")
 (time_zone,"""America/Cayman""")
 (volume,5)

It would be nicer to get separate key and value columns. That works great if json_each appears in the FROM clause:

t=# select * from json_each('{"lang": "en", "time_zone": "America/Los_Angeles"}');
    key    |         value         
-----------+-----------------------
 lang      | "en"
 time_zone | "America/Los_Angeles"

But if you replace the literal with a column, it breaks:

t=# select key, value from json_each(users.preferences);
ERROR:  missing FROM-clause entry for table "users"
LINE 1: select key, value from json_each(users.preferences);

This doesn’t work either:

t=# select key, value from (select json_each(preferences) from users) x(key, value);
ERROR:  table "x" has 1 columns available but 2 columns specified

So what to do? Well here is something that works, although I find it hard to remember:

t=# select id, (json_each(preferences)).* from users;
 id |    key    |         value         
----+-----------+-----------------------
  1 | lang      | "en"
  1 | time_zone | "America/Los_Angeles"
  2 | lang      | "es"
  2 | time_zone | "America/Cayman"
  2 | volume    | 5

Basically you are decomposing the tuple. You can pick your columns too:

t=# select id, (json_each(preferences)).key, (json_each(preferences)).value from users;
 id |    key    |         value         
----+-----------+-----------------------
  1 | lang      | "en"
  1 | time_zone | "America/Los_Angeles"
  2 | lang      | "es"
  2 | time_zone | "America/Cayman"
  2 | volume    | 5

Possibly nicer is this trick about set-returning functions allowing an implicit lateral join:

t=# select id, key, value from users, json_each(preferences);
 id |    key    |         value         
----+-----------+-----------------------
  1 | lang      | "en"
  1 | time_zone | "America/Los_Angeles"
  2 | lang      | "es"
  2 | time_zone | "America/Cayman"
  2 | volume    | 5
(5 rows)

So if you’re struggling to use json_each with a non-literal input, there you go!

When the Inner JSON Effect Works

2016-07-29

Hacker News has a fun discussion today about an article on The Daily WTF called The Inner JSON Effect. It’s about how a lead-programmer mad genius built an entire web application on top of svn, not by using svn to track history, but by using a separate revision number for each individual function, and having each class contain nothing but a list of the revision numbers of each function it wants as a method. It is a hilarious example of what not to do. Obviously it’s fun to laugh at something as crazy as JDSL, but it’s even more interesting to try linking it to things we might find more normal. So a few provocative thoughts:

  • jsonapi? I work on a lot of Ember projects, where jsonapi is the blessed format for frontend/backend communication. It seems to me awfully verbose without adding a lot of new information. Just as the WTF story adds extra indirection where you could just define functions, so jsonapi makes you list out the attributes and references—even though in regular JSON they are right there. It feels to me like the worst of the old XML excesses. I admit this is a bit trollish, but I’ve yet to understand what I get from jsonapi. Maybe someone can explain it to me.

  • Ansible? I’m a Chef guy and haven’t tried ansible, and I know a lot of developers-with-sysadmin-chops love it for its simplicity. Personally I’m fearful of using something so restrictive. I have similar doubts about Puppet, although at least there I know there are good “escape hatches”. I want to be able to throw in an if statement when I need to. Here I feel my sentiment is wrong though, since these are such successful projects. So I’m curious if anyone can articulate principles that make Ansible’s YAML “work” but not TDWTF webapp’s JSON.

And going the other direction, I think sometimes—not very often—you can get huge wins out of creatively building on top of an existing tool. But it is always a risk and requires a lot of good taste. A few examples:

  • I built a product with user-customizable questionnaires, and we knew up front the logic for branching/skipping/computations was going to be super complicated. I could tell it was heading toward what I call a “Turing-Complete Feature”. (Enterprise Workflow Engine anyone?) Instead of trying to model and store all the rules in a database, or writing a new language (you can see the looming Inner Platform Effect), we basically made each questionnaire a server-side underscore.js template, with a simple Javascript API to access information about the current context. The feature was done quickly and we had great flexibility building the actual surveys. Our surveys were built by an internal team of not-quite-programmers, so the developers could work on other things. We even had the potential of letting customers write their own surveys, since Javascript is the lingua franca of modern programming, with good security isolation. (After all your browser uses it to run untrusted code!) Many customers told us our advanced survey functionality is what set us apart from the competition.

  • Another project also has user-customizable forms, with something close to an EAV pattern in the database. We wanted to offer customers reporting capabilities, so we started talking about building an API. It would be for the existing “power users”, who were doing reports and analysis using R, SAS, Tableau, and similar tools. But the API was going to be high effort, slow, and hard to join data. So we decided our API would be . . . SQL! Of course you can’t share your real tables, for security but also because if those become a public API they can never change. Plus the EAV format made them too abstract to use easily. So instead we gave each customer a private database with one table for each of their forms. This lets them ask questions in a very direct way, using tools they already know, with great performance and flexibility. (Okay I am using the past tense but this one is not actually turned on for external customers yet. Still, using a derived, flattened database for SQL reporting is not so strange, even if young CRUD developers haven’t seen it before.)

  • Even the idea of building on top of a version control system is appealing. (Maybe not svn though! :-) I would love a self-hosted password manager with a Firefox plugin, where my passwords are stored (encrypted) in a git repository I keep on my own server. Not only would using git save the author time, but it would be like using a “standard protocol” rather than something proprietary. Same thing for a self-hosted bookmarking service.

I think all three of these “good” examples are a little audacious, but in my opinion they all work well. So are there any general guidelines that can help us make better decisions about when to make an unexpected swerve? So much of programming comes down to taste. That leads us into pointless vi-vs-emacs wars, but it also sets good programming apart from bad. How can we make that taste teachable? Or should it be entrusted only to the old and wise? Are there guidelines we can identify, so that we can discuss them, evaluate them, and make decisions about them? (Apparently one rule is don’t write your own TCP/IP stack. :-)

One principle of my own is a “risk budget”: you have to take risks—they are a fertile source of advantage—but you should have a list of them and not take too many. Risks can include: tight deadlines, unreasonable clients/management, new clients/management, unclear requirements, new tech, new-to-me tech, high scalability requirements, etc. One risk is “cutting out a layer” like in the Javascript survey API and the SQL reporting interface. I’m not saying it’s always a good idea, just that it can be. Building your own filesystem on top of NFS is another example of cutting out a layer. I can’t speak to how well that one worked out, but it’s an interesting idea.

What are some other guidelines people use? I’m especially interested in guidelines about “when to leave CRUD behind”, or when to make a choice that seems strange at first glance but pays off. If programming were chess, when do you sac a piece?

Postgres custom range types for Geoip

2016-07-05

I had to add a geoip lookup table to one of my projects recently, and it was a perfect application for the new(ish) range types in Postgres. A range is a column type with a start and end point. The endpoints can be “closed” (includes the endpoint) or “open” (includes everything up to the endpoint, but not the endpoint itself): for instance for integers [1,3] is 1, 2, and 3, but [1,3) is just 1 and 2. Postgres comes with a few built-in range types for numbers and timestamps, but you can also define your own. Since I wanted to use the Postgres inet datatype, that’s what I had to do.

My goal was to have a table like this:

db=> \d geoips
        Table "public.geoips"
    Column    |   Type    | Modifiers
--------------+-----------+-----------
 ips          | inetrange | not null
 country_code | text      | not null
 latitude     | real      | not null
 longitude    | real      | not null

Also I wanted to get fast lookups based on which row contained a given IP. I would have about 4 million rows, so I would need an effective index. In fact part of the reason for using a range type is that they support GiST indexes, where you can get fast lookups based on both the start and end values. A B-Tree index for something like ip BETWEEN start_ip AND end_ip can quickly get you to start_ip, but then it needs to scan forward for the end_ip check. So I expected a GiST index would do better.

Defining the range type can be as simple as this:

CREATE TYPE inetrange AS RANGE (
  subtype = inet
);

Then we can start using them:

db=> select '[1.2.3.4,1.2.3.8]'::inetrange;
     inetrange     
-------------------
 [1.2.3.4,1.2.3.8]

You also get a free constructor function:

db=> select inetrange('1.2.3.4', '1.2.3.8', '[]');
     inetrange     
-------------------
 [1.2.3.4,1.2.3.8]

And by the way, since the inet type supports IPv6 addresses, you can make ranges with those too. The only caveat is I haven’t found a way to prevent you from mixing them:

db=> select inetrange('1.2.3.4', '2001:0db8:0000:0042:0000:8a2e:0370:7334', '[]');
                inetrange                
-----------------------------------------
 [1.2.3.4,2001:db8:0:42:0:8a2e:370:7334]

So don’t do that! :-)

You can also use the <@ operator to see if an IP falls within a range:

db=> select '1.2.3.6'::inet <@ '[1.2.3.4,1.2.3.8]'::inetrange;
 ?column? 
----------
 t

Another nice thing about using a range is we can define an exclusion constraint to ensure that no two ranges overlap. That will guarantee that our lookups never return more than one row. We can include the exclusion constraint when we create the table:

CREATE TABLE geoips (
  ips inetrange NOT NULL,
  country_code TEXT NOT NULL,
  latitude REAL NOT NULL,
  longitude REAL NOT NULL,
  CONSTRAINT geoips_dont_overlap
    EXCLUDE USING gist (ips WITH &&)
    DEFERRABLE INITIALLY IMMEDIATE
);

That tells Postgres to forbid any new value x where x && ips is true, when checked against all old ips. It also automatically creates a GiST index for us—the same index we’ll use when we query the table.

Now we can load our data—and by the way if you use COPY or \copy with a CSV file, Postgres has no trouble with range literals:

"[1.0.0.0,1.0.0.255]","AU","-27.467940","153.028090"

But there is still one problem. Our GiST index isn’t so fast after all!:

db=> select * from geoips where '1.2.3.4'::inet <@ ips;
         ip          | country_code | latitude | longitude 
---------------------+--------------+--------------+-------
 [1.2.3.0,1.2.3.255] | AU           | -27.4679 |   153.028 
(1 row)

Time: 1.120 ms
db=> select * from geoips where '10.2.3.4'::inet <@ ips;
            ip             | country_code | latitude | longitude 
---------------------------+--------------+----------+-----------
 [10.0.0.0,10.255.255.255] | -            |        0 |         0 
(1 row)

Time: 54.669 ms
db=> select * from geoips where '100.2.3.4'::inet <@ ips;
           ip            | country_code | latitude | longitude 
-------------------------+--------------+----------+-----------
 [100.2.3.0,100.2.3.255] | US           |  40.8681 |  -73.4257 
(1 row)

Time: 322.719 ms

As you can see, it gets slower and slower for higher-numbered IPs. What is going on here?

The problem is that we didn’t implement a function to tell GiST indexes how “far apart” two inetranges are. That means we have to start all over! So drop the table, drop the type, and try again… . First, we’ll implement the diff function:

CREATE OR REPLACE FUNCTION inet_diff(x inet, y inet)
  RETURNS DOUBLE PRECISION AS
$$
DECLARE
BEGIN 
  RETURN x - y; 
END;
$$
LANGUAGE 'plpgsql' STRICT IMMUTABLE;

As you can see, all this does is subtract one inet from the other inet. Pretty simple!

Now create the inetrange again:

CREATE TYPE inetrange AS RANGE (
  subtype = inet,
  subtype_diff = inet_diff
);

Now create the table, load it, and try querying again:

db=> select * from geoips where '1.2.3.4'::inet <@ ips;
         ip          | country_code | latitude | longitude 
---------------------+--------------+--------------+-------
 [1.2.3.0,1.2.3.255] | AU           | -27.4679 |   153.028 
(1 row)

Time: 1.004 ms
db=> select * from geoips where '10.2.3.4'::inet <@ ips;
            ip             | country_code | latitude | longitude 
---------------------------+--------------+----------+-----------
 [10.0.0.0,10.255.255.255] | -            |        0 |         0 
(1 row)

Time: 0.678 ms
db=> select * from geoips where '100.2.3.4'::inet <@ ips;
           ip            | country_code | latitude | longitude 
-------------------------+--------------+----------+-----------
 [100.2.3.0,100.2.3.255] | US           |  40.8681 |  -73.4257 
(1 row)

Time: 0.618 ms

Those times are much better! So that is how you can use custom Postgres ranges to implement a fast geoip lookup table.

Now there is one last thing: technically if your range elements are discrete rather than continuous (as ours are), you should define a canonical function for the type, so that Postgres can resolve ambiguities between ranges with open and closed endpoints. For instance, [1,3) is the same thing as [1,2], but Postgres doesn’t know that unless we tell it.

Unfortunately, because the function depends on the type and the type depends on the function, there are complications. You have to first create a “shell type”, which requires superuser privileges, and also your canonical function must be written in C, because plpgsql functions can’t return shell types. And loading a C function requires superuser privileges too. To make this all a lot simpler, I created a Postgres extension for inetrange. That will create the inetrange type as described here, but also with a canonical function. So for real work, it is probably easiest to just use that! Or you could use the non-superuser commands above and skip the extension.

Ledger with Autosync

2015-08-01

A while back I started tracking my finances with legder-cli, a command-line tool for double-entry bookkeeping. It’s been a lot of fun, but one of the stumbling blocks is recording each transaction by hand. Fortunately almost every bank nowadays offers an API so you can download your recent account activity in a format like OFX (Open Financial Exchange) or QFX (Quicken’s extended OFX), and ledger has tools to import that data.

For this I use ledger-autosync, which is built on top of ofxclient. First you run ofxclient to create a ~/ofxclient.ini file with the details of all your accounts. Then you run ledger-autosync, which will first read your ledger file (whatever $LEDGER_FILE points to), then contact the banks and print out any new transactions. It tries to guess expense categories, but often you need to correct these. Still, that’s a lot easier than typing in everything!

Because of this step, I like to have ledger-autosync append to a review.ldg file, which I periodically edit to correct expense categories, then cut-and-paste into my regular $LEDGER_FILE. I also do some extra tweaking to format the entries how I like them. Here is the command I use:

ledger-autosync \
  -l ~/share/accounting/in-progress.ldg \
  -i 2 \
  | sed 's,^\([0-9]\{4\}\)/\([0-9]\{2\}\)/\([0-9]\{2\}\),\1-\2-\3,' \
  >> ~/share/accounting/review.ldg

That uses two spaces for indents (-i 2) and rewrites the dates.

Also note how I override $LEDGER_FILE with the -l option. That’s a trick to keep ledger-autosync from entering the same transactions twice into review.ldg. The contents of in-progress.ldg are just this:

# This file exists so that ledger-autosync will read review.ldg
# and not download entries it's already pulled down
# but we haven't yet reviewed.

include pj.ldg
include review.ldg

And that’s it! This setup makes it pretty painless keeping track of everything. I don’t have to do a lot of typing, and nothing slips through the cracks. I actually kind of like that categorizing forces me to look at expenses every week or so. If you are thinking about using ledger, hopefully this helps you out!

Paperclip expiring_url and "Request has expired"

2015-04-28

I hit a weird error today. My project stores user uploads on S3 with private permissions, and we use the Paperclip expiring_url method to let people download them. This started failing with a error like this:

<Error>
  <Code>AccessDenied</Code>
  <Message>Request has expired</Message>
  <Expires>2015-04-28T18:30:51Z</Expires>
  <ServerTime>2015-04-28T18:32:52Z</ServerTime>
  <RequestId>BLAHBLAHBLAHBLAH</RequestId>
  <HostId>
    MOREBLAHMOREBLAHMOREBLAHMOREBLAHMOREBLAHMOREBLAHMOREBLAH
  </HostId>
</Error>

The problem was that this error occurred immediately, even though we were generating the URLs with a 10 minute expiration time. Well, it turns out S3 was living in the future by 7 minutes (according to ServerTime above), and our own app was living in the past by another 5. So 5 + 7 = 12 and our URLs were expired before we even generated them. 10 minutes is the example time in the Paperclip documentation linked above, but I guess it’s too short. Watch out for clock skew out there folks!

Next: Another reason to use generate_series for time series queries