Postgres custom range types for Geoip

2016-07-05

I had to add a geoip lookup table to one of my projects recently, and it was a perfect application for the new(ish) range types in Postgres. A range is a column type with a start and end point. The endpoints can be “closed” (includes the endpoint) or “open” (includes everything up to the endpoint, but not the endpoint itself): for instance for integers [1,3] is 1, 2, and 3, but [1,3) is just 1 and 2. Postgres comes with a few built-in range types for numbers and timestamps, but you can also define your own. Since I wanted to use the Postgres inet datatype, that’s what I had to do.

My goal was to have a table like this:

db=> \d geoips
        Table "public.geoips"
    Column    |   Type    | Modifiers
--------------+-----------+-----------
 ips          | inetrange | not null
 country_code | text      | not null
 latitude     | real      | not null
 longitude    | real      | not null

Also I wanted to get fast lookups based on which row contained a given IP. I would have about 4 million rows, so I would need an effective index. In fact part of the reason for using a range type is that they support GiST indexes, where you can get fast lookups based on both the start and end values. A B-Tree index for something like ip BETWEEN start_ip AND end_ip can quickly get you to start_ip, but then it needs to scan forward for the end_ip check. So I expected a GiST index would do better.

Defining the range type can be as simple as this:

CREATE TYPE inetrange AS RANGE (
  subtype = inet
);

Then we can start using them:

db=> select '[1.2.3.4,1.2.3.8]'::inetrange;
     inetrange

 [1.2.3.4,1.2.3.8]

You also get a free constructor function:

db=> select inetrange('1.2.3.4', '1.2.3.8', '[]');
     inetrange

 [1.2.3.4,1.2.3.8]

And by the way, since the inet type supports IPv6 addresses, you can make ranges with those too. The only caveat is I haven’t found a way to prevent you from mixing them:

db=> select inetrange('1.2.3.4', '2001:0db8:0000:0042:0000:8a2e:0370:7334', '[]');
                inetrange

 [1.2.3.4,2001:db8:0:42:0:8a2e:370:7334]

So don’t do that! :-)

You can also use the <@ operator to see if an IP falls within a range:

db=> select '1.2.3.6'::inet <@ '[1.2.3.4,1.2.3.8]'::inetrange;
 ?column?

 t

Another nice thing about using a range is we can define an exclusion constraint to ensure that no two ranges overlap. That will guarantee that our lookups never return more than one row. We can include the exclusion constraint when we create the table:

CREATE TABLE geoips (
  ips inetrange NOT NULL,
  country_code TEXT NOT NULL,
  latitude REAL NOT NULL,
  longitude REAL NOT NULL,
  CONSTRAINT geoips_dont_overlap
    EXCLUDE USING gist (ips WITH &&)
    DEFERRABLE INITIALLY IMMEDIATE
);

That tells Postgres to forbid any new value x where x && ips is true, when checked against all old ips. It also automatically creates a GiST index for us—the same index we’ll use when we query the table.

Now we can load our data—and by the way if you use COPY or \copy with a CSV file, Postgres has no trouble with range literals:

"[1.0.0.0,1.0.0.255]","AU","-27.467940","153.028090"

But there is still one problem. Our GiST index isn’t so fast after all!:

db=> select * from geoips where '1.2.3.4'::inet <@ ips;
         ip          | country_code | latitude | longitude 
---------------------+--------------+--------------+-------
 [1.2.3.0,1.2.3.255] | AU           | -27.4679 |   153.028 
(1 row)

Time: 1.120 ms
db=> select * from geoips where '10.2.3.4'::inet <@ ips;
            ip             | country_code | latitude | longitude 
---------------------------+--------------+----------+-----------
 [10.0.0.0,10.255.255.255] | -            |        0 |         0 
(1 row)

Time: 54.669 ms
db=> select * from geoips where '100.2.3.4'::inet <@ ips;
           ip            | country_code | latitude | longitude 
-------------------------+--------------+----------+-----------
 [100.2.3.0,100.2.3.255] | US           |  40.8681 |  -73.4257 
(1 row)

Time: 322.719 ms

As you can see, it gets slower and slower for higher-numbered IPs. What is going on here?

The problem is that we didn’t implement a function to tell GiST indexes how “far apart” two inetranges are. That means we have to start all over! So drop the table, drop the type, and try again… . First, we’ll implement the diff function:

CREATE OR REPLACE FUNCTION inet_diff(x inet, y inet)
  RETURNS DOUBLE PRECISION AS
$$
DECLARE
BEGIN 
  RETURN x - y; 
END;
$$
LANGUAGE 'plpgsql' STRICT IMMUTABLE;

As you can see, all this does is subtract one inet from the other inet. Pretty simple!

Now create the inetrange again:

CREATE TYPE inetrange AS RANGE (
  subtype = inet,
  subtype_diff = inet_diff
);

Now create the table, load it, and try querying again:

db=> select * from geoips where '1.2.3.4'::inet <@ ips;
         ip          | country_code | latitude | longitude 
---------------------+--------------+--------------+-------
 [1.2.3.0,1.2.3.255] | AU           | -27.4679 |   153.028 
(1 row)

Time: 1.004 ms
db=> select * from geoips where '10.2.3.4'::inet <@ ips;
            ip             | country_code | latitude | longitude 
---------------------------+--------------+----------+-----------
 [10.0.0.0,10.255.255.255] | -            |        0 |         0 
(1 row)

Time: 0.678 ms
db=> select * from geoips where '100.2.3.4'::inet <@ ips;
           ip            | country_code | latitude | longitude 
-------------------------+--------------+----------+-----------
 [100.2.3.0,100.2.3.255] | US           |  40.8681 |  -73.4257 
(1 row)

Time: 0.618 ms

Those times are much better! So that is how you can use custom Postgres ranges to implement a fast geoip lookup table.

Now there is one last thing: technically if your range elements are discrete rather than continuous (as ours are), you should define a canonical function for the type, so that Postgres can resolve ambiguities between ranges with open and closed endpoints. For instance, [1,3) is the same thing as [1,2], but Postgres doesn’t know that unless we tell it.

Unfortunately, because the function depends on the type and the type depends on the function, there are complications. You have to first create a “shell type”, which requires superuser privileges, and also your canonical function must be written in C, because plpgsql functions can’t return shell types. And loading a C function requires superuser privileges too. To make this all a lot simpler, I created a Postgres extension for inetrange. That will create the inetrange type as described here, but also with a canonical function. So for real work, it is probably easiest to just use that! Or you could use the non-superuser commands above and skip the extension.

blog comments powered by Disqus Prev: When the Inner JSON Effect Works Next: Ledger with Autosync