Scaling Sidekiq

2017-04-12

Sidekiq is a great option for handling background jobs in Ruby projects. Here I’ll show you how to get the best utilization out of a box dedicated to running Sidekiq jobs. Whether you have one machine or ten, the goal is to work off as many jobs as possible from each machine.

CPU Utilization

To do that, we want to keep every core busy. You can monitor your CPU activity with a tool like vmstat(8). If you say vmstat 10, you’ll get a new row every ten seconds, like so:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 9  0  92056 553620 174128 6268208    0    0     1    25 1323 1443 23 77  0  0  0
 8  0  92056 552628 174128 6268492    0    0    14    38 1307 1447 22 78  0  0  0
 8  0  92056 551760 174132 6268672    0    0     3    49 1317 1461 23 77  0  0  0
 8  0  92056 550544 174140 6268832    0    0     1    28 1353 1490 23 77  0  0  0
 8  0  92056 550296 174148 6268940    0    0     1    35 1240 1360 22 78  0  0  0

The last few columns show percentage of CPU time doing user work (us), system work (sy)—which more or less means executing kernel system calls—, idle (id), and blocked on I/O (wa, “waiting”). Add the first two together to get how much your CPU is working.

There is also st, which means “stolen time”! If you are running a VM, this means the time your hypervisor gives to someone else. You will probably see all zeros in this column, so I’ll ignore it going forward. We want to get our idle and wait time as close to zero as possible, without drowning the machine in too much work. Above we have a machine that is doing a good job keeping busy. You can see that its user and system time are at 100, and the other columns are zero.

So how do we do this? Usually it means running lots of jobs at the same time, not one-after-another. Even a one-core machine can juggle many jobs. That’s because jobs typically have to block on I/O, for example when they talk to the network or save things to disk. While that job is waiting for an answer, your CPU can work on another one.

Threads

Sidekiq is great here because it supports multi-threading. Multiple threads let you do concurrent work in one Ruby process. Without threads you’d need a separate process for each concurrent job, and that can use up memory quickly, especially with something like Rails. It is always sad to have more CPU available that you can’t use because you’re out of RAM.

In practice, threads mostly help if you are using a concurrent Ruby implementation like JRuby. MRI Ruby has a Global Interpreter Lock (GIL), which prevents two threads from executing at once. Still, even in MRI you will still see some benefit, because when one thread blocks on I/O, MRI can make progress on another. So despite the GIL, MRI can still make sure at least some thread is running.

With Sidekiq, you can say how many threads to run with the concurrency setting. Normally you’d set this in your sidekiq.yml file. Note that each thread needs its own database connection! That means if your concurrency is 10, then in your database.yml you must have a pool of 10 also (or more). Otherwise the threads will halt each other waiting to check out a database connection, and what is the point of that? They might even get timeout errors.

Processes

But wait, there’s more! We can push the concurrency up and up, and still see something like this from vmstat:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0  92048 585840 174148 6269120    0    0     0    16  956  990  7 18 75  0  0
 1  0  92048 585212 174160 6269256    0    0     1    23 1012 1037  8 17 75  0  0
 1  0  92048 580456 174164 6269312    0    0     1    33  896  884  8 17 75  0  0

Why is the idle CPU time stuck at 75?

It turns out this is a four-core machine, and even with multiple threads, a single MRI process can only use one core, because of the GIL. That’s not what we want at all!

So the answer is to run more processes—at least one per core. You can see how many cores you’ve got by saying cat /proc/cpuinfo. If you are using something like god, it is easy to put several processes into a group and control them together, like so:

app_root = '/var/www/myapp/current'
4.times do |i|
  God.watch do |w|
    w.name     = "myapp-sidekiq-#{i}"
    w.group    = "myapp-sidekiq"
    w.log      = File.join(app_root, 'log', "#{w.name}.log")
    w.pid_file = File.join(app_root, 'current', 'tmp', 'pids', "#{w.name}.pid")
    w.start    = <<-EOS.gsub("\n", " ")
      cd '#{app_root}' &&
        bundle exec ./bin/sidekiq --environment production
                                  --pidfile '#{w.pid_file}'
                                  --logfile '#{w.log}'
                                  --daemon
    EOS
    # ...
  end
end

If you aren’t using god and don’t know how to do this with your own process manager, then I think Sidekiq Enterprise has a similar feature called Swarms.

Scheduling cores

That gets us almost there, but there is still a problem. You might still see idle time stuck somewhere, like (on a 4-core system) 25. The problem is that while the kernel does its best to use your whole CPU, it can still wind up putting two Sidekiq processes on the same core, and since they are such long-lived processes, they are stuck that way, competing for their shared core while another one sits idle. The kernel doesn’t know ahead of time that they are going to run for days and keep so busy.

Fortunately we can still force each process onto its own core. For that we use the taskset(1) command. When you say taskset -c 2 date, you are telling Linux to run date on core 2. (Core numbers start from zero, as you can see in /proc/cpuinfo.) So our god config would become:

w.start    = <<-EOS.gsub("\n", " ")
  cd '#{app_root}' &&
    bundle exec taskset -c #{i} ./bin/sidekiq --environment production
                                              --pidfile '#{w.pid_file}'
                                              --logfile '#{w.log}'
                                              --daemon
EOS

After that, we’ll have one process on each core. At this point, you should start experimenting with your concurrency setting, to make each core fully utilized. The right setting there will depend on how much blocking a job does, but I have seen useful numbers up to 20. Just try some things out, and watch vmstat.

Note that you don’t have to increase the connection pool size in database.yml as you add cores (just concurrency). That’s because each process is a separate thing, each with its own pool. But you do have to increase the max connections on your database server. For instance with Postgres you’d want to set max_connections in postgresql.conf. Here you need to allow enough connections for concurrency times cores (times servers), plus some more for your actual Rails app serving web requests, plus some more for anything else you’ve got going on. That can be a lot of connections! Don’t be surprised if improving your job throughput exposes a bottleneck elsewhere in your system.

Conclusion

Tuning Sidekiq can be complicated, because of the double layers of threads plus processes. You need a confident understanding of how Ruby and your operating system handle concurrency, and it helps to use tools like vmstat to measure what’s going on and verify your understanding. If you have a box dedicated to just Sidekiq jobs, my recommendation is to run one process per core, using taskset to keep them separate, and then tune concurrency from there. Hopefully this will help with your own projects!

Postgres Permissions

2017-03-27

It’s taken me a long time to build up an understanding of the Postgres permissions system, so I thought I’d write it down to help others get up to speed faster.

First let’s talk about what things can be guarded by permissions. The big ones are databases and tables/views, but you can also set permissions on schemas, functions, sequences, types, languages, tablespaces, domains, foreign servers, foreign data wrappers, and large objects. (Phew!) There are also a few permissions that apply to the whole cluster. (In Postgres terminology, the “cluster” is the server process(es) keeping its files in /var/lib/postgresql/9.5/main and listening on 5432 (or whatever), and a “database” is a namespace in the cluster where you define tables. You can list the databases in your cluster with \l.) We’ll explore below what actions can be allowed or retricted for each of these subjects. Collectively, let’s call permissionable things “topics”.

(As of Postgres 9.5 there is also Row Level Security (RLS). That is complicated enough to be a separate topic, so here I want to cover just the traditional permissions system.)

Second let’s ask who has permissions. In Postgres permissions come first from your user, and second from any groups your user is a part of. Collectively, users and groups are called “roles”. Or to be more accurate, before Postgres 8.1 users and groups were separate things, but now everything is just a role. A role can even be both a “group” and a “user” at the same time: it can have members but also serve as a direct login.

Permission Mechanics

The key to understanding Postgres is that the stored permission between any role and any topic is only ever a 0 or a 1. When Postgres does a permissions check, it takes all your roles’ permissions on the topic, and if there are any with a 1, you’re in. In other words your effective permission is the OR of all your roles’ permissions. (There is one exception to this, explained below, but it rarely matters.)

You can see what groups each role belongs to with \du (list users). You can add someone to a role with GRANT r TO u, or remove them with REVOKE r FROM u, where r is the group and u is the user. (I am using “group” and “user” loosely here for clarity, but remember that they are all roles.)

So some consequences of this: If you REVOKE a permission from a role r1, it loses its direct 1, but it might still get the permission indirectly from other roles. The REVOKE isn’t remembered in any way. It just sets the 1 to a 0 for r1. It doesn’t override other possible sources of a 1.

Another consequence is that commands like GRANT SELECT ON ALL TABLES have an immediate effect of setting a lot of 1s, but they are not remembered either. So if you add some more tables, they will not automatically share the same grants as the old tables.

The complement of GRANT ON ALL is ALTER DEFAULT PRIVILEGES. This lets you say what the 1s and 0s should be for future objects. It does not change any existing objects; it just says how to set up new ones. You can combine GRANT ON ALL and ALTER DEFAULT to get 100% coverage: first for the old objects, then for the new. Actually it would probably be less racy to do the DEFAULT first, in case someone else is making more objects while you type.

The final trick in all this is a magic role named PUBLIC. Every other role is a member of the PUBLIC role, so if it has any 1s, then everyone will apply those 1s. Note there is no way to cancel a 1 from the PUBLIC role, because of what we said above about how permission checks do an OR. You might REVOKE something from a troublemaker and give them a 0, but if the PUBLIC role has a 1, they can do the thing anyway. Of course you can REVOKE a permission from the PUBLIC role itself, but then you have to individually GRANT the permission to all the roles that should have it.

One wrinkle here is that a role can have either the INHERIT or NOINHERIT attribute. If it has NOINHERIT, then it does not benefit from permissions of roles it belongs to. In other words, its own direct 1s are the only ones it’s got (ha!). In practice this is not a significant block though, because if you belong to an admin role but have NOINHERIT, you can still SET ROLE admin and act with the permissions of that role. So NOINHERIT should not be used as a way of restricting what people can do. Also note that even with NOINHERIT, you still inherit permissions from the PUBLIC role. In general I don’t see much point to NOINHERIT, except for the psychological benefit of making people more explicitly prepare before running dangerous commands.

So you have the PUBLIC role, and you have ALTER DEFAULT PRIVILEGES, and both are a kind of default, but in different ways. It is important to keep these separate! In fact I’m pretty sure I’ve seen places in the docs where PUBLIC permissions are referred to as “defaults”. If you are working with either of these, watch out that you are clear on how they differ!

Cluster Permissions

Okay, so those are the mechanics. Now let’s look at some specific permissions. At the top level you have cluster-wide permissions. These are stored as attributes on your role. They are:

LOGIN: This means you can connect to the cluster. You still must in addition have permission to CONNECT to an individual database (covered below), but if your role doesn’t have this basic LOGIN attribute, you can’t get in at all.

SUPERUSER: This means you can do anything. You’ve got root. Your other permissions don’t matter, because when Postgres sees you have this, it doesn’t ask any further questions.

CREATEDB: You can create more databases.

CREATEROLE: You can create more roles.

REPLICATION: You can start streaming replication. Normally you’d only give this to a special role used by slaves to connect to the master database.

You can set these permissions with CREATE ROLE or change them with ALTER ROLE. You can see them with \du.

Database Permissions

The next level is the database. If you type \l you can see the databases in your cluster, and the last columm shows the permissions on each. For example:

perms=# \l
                                                   List of databases
            Name            |     Owner      | Encoding |   Collate   |    Ctype    |         Access privileges         
----------------------------+----------------+----------+-------------+-------------+-----------------------------------
 perms                      | postgres       | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =Tc/postgres                     +
                            |                |          |             |             | postgres=CTc/postgres            +
                            |                |          |             |             | admin=C/postgres
 postgres                   | postgres       | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0                  | postgres       | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres                      +
                            |                |          |             |             | postgres=CTc/postgres
 template1                  | postgres       | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres                      +
                            |                |          |             |             | postgres=CTc/postgres            +

For each database, you’ll see several lines of permissions. Each line shows a role, an equals sign, and then the list of permissions for that role. Someones a line is missing the role and starts with just an equals sign; this is for the PUBLIC role. It shows the permissions that everyone gets for free. Whatever you see here is allowed for anyone.

There is also a forward slash followed by another role name. That shows who granted those permissions.

So in the above example you can see that for the perms database, the PUBLIC role has Tc permissions, and admin has C, granted by postgres.

Each permission is shown by a single letter. They mean:

T: Create temporary (in-memory) tables.

c: CONNECT to the database. It has always worried me that by default databases let anyone connect. In theory this is okay since you can’t SELECT or change anything, but you can still list the tables (\dt), list their colums \d tablename, etc. More precisely, connecting lets people query the tables in pg_catalog and information_schema. They can see the names of all the other database users, all the other databases, and quite a bit more. Personally I feel this is a significant shortcoming in Postgres’s system.

You can revoke access to these system tables, but at the cost of breaking important things. For instance pgadmin fails if it can’t list the databases. Another idea is to use RLS on the system tables, but that is not supported.

Of course, if you can connect to any database, you can do these things. I’m not aware of any information you can get about database foo by connecting there that you can’t also get while connected to database bar. So you could argue that granting everyone CONNECT is not a problem! I don’t have a good answer to that, even though it feels suspicious. I think the moral, then, is that all database roles require more trust than you might expect.

C: This allows you to CREATE a new schema within the database. As the owner of that schema, you’ll be able to do lots of things with it, like create tables, etc.

Finally, individual letters may be followed by an asterisk (*). That means that the user not only has the permission, but may also GRANT the permission to others. The object owner may always grant permission, so Postgres doesn’t bother to show asterisks for them.

Table/View Permissions

For permission purposes, Postgres treats tables and views as the same thing. So ALL TABLES will include views, too.

You can list the permissions for tables and views (and sequences too) using \dp or \z (They do the same thing.):

perms=# \dp
                                       Access privileges
 Schema |      Name      |   Type   |     Access privileges     | Column privileges | Policies 
--------+----------------+----------+---------------------------+-------------------+----------
 public | sales          | table    | postgres=arwdDxt/postgres+|                   | 
        |                |          | admin=arwdDxt/postgres    |                   | 
 public | sales_id_seq   | sequence |                           |                   | 
 public | widgets        | table    | postgres=arwdDxt/postgres+|                   | 
        |                |          | admin=arwdDxt/postgres   +|                   | 
        |                |          | joe=r/postgres            |                   | 
 public | widgets_id_seq | sequence |                           |                   | 

What you see here is a lot like the output from \l. A blank means you have never made any permissions changes for this object, so its permissions still match the default. (When I say “the default”, I don’t mean the defaults you’ve established with ALTER DEFAULT PRIVILEGES, but rather the default defaults that Postgres ships with.) Note that once you change something, even if you change it back, the column will never go back to blank again.

Of course now there are more letters than before. That’s because tables have lots of permissions:

SELECT (r): You can read from the table.

UPDATE (w): You can write to the table.

INSERT (a): You can append to the table.

DELETE (d): You can delete from the table.

TRUNCATE (D): You can really delete from the table.

REFERENCES (x): This lets you create foreign key constraints elsewhere that reference this table. You need the permission on both tables. Also, you can grant this permission for just specific columns. (I confess I don't really get this permission. What is the use of forbidding foreign keys? I mean I get why not anyone can create a foreign key, but what is the use of making it a separate permission from other DDL commands? Besides, if a foreign key makes sense, why not just define it yourself?)

In the first version of this post, I questioned the usefulness of this permission, but the folks from the Postgres mailing list were really helpful in making it click for me. As Peter Holzer explains, only a table’s owner may add constraints to it (or run ALTER TABLE at all). But suppose joe owns the table joes_widgets, and fred owns the table freds_frozzles. Joe wants to add a frozzle_id column pointing to Fred’s table. He would need the REFERENCES privilege on Fred’s table to do it.

And why can’t Postgres just check Joe’s SELECT privilege instead? It’s because making a foreign key does more than just read from the table. Joe’s foreign key could prevent Fred from changing/deleting his own records, as well as costing Fred time checking the constraint. I don’t know how often this permission will come into play for me, but I’m happy to figure out why it’s there!

TRIGGER (t): You can add a trigger to the table.

It is also useful to see the default privileges for each table (from ALTER DEFAULT PRIVILEGES):

perms=# \ddp
         Default access privileges
 Owner | Schema | Type | Access privileges 
-------+--------+------+-------------------
(0 rows)

perms=# ALTER DEFAULT PRIVILEGES GRANT SELECT ON TABLES TO joe;
ALTER DEFAULT PRIVILEGES
perms=# \ddp
               Default access privileges
  Owner   | Schema | Type  |     Access privileges     
----------+--------+-------+---------------------------
 postgres |        | table | postgres=arwdDxt/postgres+
          |        |       | joe=r/postgres
(1 row)

perms=# create table inventory (id serial, widget_id integer);
CREATE TABLE
perms=# \dp
                                        Access privileges
 Schema |       Name       |   Type   |     Access privileges     | Column privileges | Policies 
--------+------------------+----------+---------------------------+-------------------+----------
 public | inventory        | table    | postgres=arwdDxt/postgres+|                   | 
        |                  |          | joe=r/postgres            |                   | 
 ...

Here you can see that new tables will start out with whatever permissions we have prepared as the defaults.

Note that you can also say ALTER DEFAULT PRIVILEGES FOR ROLE r GRANT .... Normally when you alter default privileges, it only affects (future) objects you own. If you want to change the privileges for future objects created by someone else, you can use FOR ROLE. Naturally you can only that if you belong to that role.

With all these rules, it is natural to want to see the final calculated rights of each user in the system. Unfortunately Postgres doesn’t have a built-in way to show that. You can say SELECT * FROM information_schema.role_table_grants WHERE table_schema = 'public', but that does not compute inherited permissions.

Fortunately there is a function called has_table_privilege you can use to get most of the way there. It does consider role memberships: for instance if joe belongs to the admin role and has INHERIT, then has_table_privilege('joe', 'public.sales', 'INSERT') will return true. Change joe to NOINHERIT and it will return false. I do wish there were a way to run the computation as if all roles were INHERIT, but this is close enough for me! Others have put this together in a nice query to get the permissions for any table you like. Here is a version I like:

SELECT  t.tablename,
        u.usename,
        has_table_privilege(usename, concat(schemaname, '.', tablename), 'select') AS select,
        has_table_privilege(usename, concat(schemaname, '.', tablename), 'insert') AS insert,
        has_table_privilege(usename, concat(schemaname, '.', tablename), 'update') AS update,
        has_table_privilege(usename, concat(schemaname, '.', tablename), 'delete') AS delete, 
        has_table_privilege(usename, concat(schemaname, '.', tablename), 'truncate') AS truncate, 
        has_table_privilege(usename, concat(schemaname, '.', tablename), 'references') AS references,
        has_table_privilege(usename, concat(schemaname, '.', tablename), 'trigger') AS trigger
FROM    pg_tables t,
        pg_user u 
WHERE   t.tablename = 'inventory';

Of course you could just as well filter on user to see one person’s permissions for all the tables. If you do a lot of permissions work, it might even be worth wrapping this up into a view so you can filter on whatever you like.

Incidentally, there are lots of other has_whatever_privilege functions to ask questions not covered by psql \ commands.

Sequence Permissions

If you squint, a sequence is pretty much just a tiny table: you can SELECT from it (call currval), or UPDATE it (call setval), or do both at once (call nextval). So it appears in \dp with the tables and views. But unlike tables and views, they also have the USAGE permission, which allows currval or nextval.

The USAGE permission is not just an abbreviation for SELECT plus UPDATE; it has its own bit. You can see it in \dp as U (vs. r and w). Yet if you have it, you don’t need r or w to use the sequence. Granting only USAGE would let someone use the sequence normally but not call setval, which might be useful.

I can’t think of any reason to give someone just SELECT on a sequence. Even currval won’t work, because currval only tells you the last time you called nextval. You will get an error like ERROR: currval of sequence "widgets_id_seq" is not yet defined in this session.

Function Permissions

Functions have just an EXECUTE permission. New functions will grant execute permission to PUBLIC, so be careful if you are creating something sensitive. If you like, you can change this with ALTER DEFAULT PERMISSIONS.

You can see functions in psql with \df or \df+, but neither will show the permissions. Instead you should use a query like

SELECT proacl FROM pg_proc WHERE proname = 'f';

The format of proacl is like anything else:

perms=# select proacl from pg_proc where proname = 'f';
                proacl                 
---------------------------------------
 {postgres=X/postgres,joe=X/postgres}

Here we have already revoked execute from PUBLIC, then granted it to joe.

Note that usually functions execute with the permissions of the invoker, but you can create “setuid” functions that execute with the permissions of the definer. These can be very useful, but they have similar dangers to Unix setuid programs, so you have to be careful with them.

Schema Permissions

Schemas have just CREATE and USAGE permissions. CREATE lets you add more stuff to the schema (tables, functions, sequences, etc.). USAGE is a prerequisite to using objects within the schema. For instance to SELECT from a table, you must first have USAGE permission on its schema. (You must also have SELECT permission on the table.) Note that even without USAGE, you can still list the schema contents, and even query for its tables’ definitions etc.

To be honest I find the schema USAGE permission kind of confusing and superfluous, and I don’t think I’m the only one. I can’t think of any security system you can build with this permission you can’t build with permissions on individual tables, functions, etc., and because you can say GRANT ... ON ALL whatevers IN SCHEMA foo, it is not even necessary as an abbreviation. I guess if you really want to lock people out, you might as well use it as an extra layer though.

Conclusion

There are permissions for more specialized database objects too, like types and foreign data wrappers. But if you have made it this far, you should have no problem understanding the documentation for those. You might also want to read about RLS to get really fine-grained access control. Here I’ve only tried to cover the basics, and give you an accurate mental model of how Postgres decides who can do what. If you have any questions, comments, or corrections, I’m happy to listen!

Where to put Postgres json_each?

2016-09-07

Postgres has several “set-returning” functions that essentially return more than one row. Probably the most well-known is generate_series, used to get a list of numbers/dates/etc. Another very handy one is json_each. You can use it to break up a JSON object into one row for each key/value pair. But that one always trips me up, because unlike generate_series it (1) returns more than one column per row and (2) usually takes a column as input.

Suppose you have a users table with a preferences column, like so:

t=# select * from users;
 id |                        preferences                         
----+------------------------------------------------------------
  1 | {"lang": "en", "time_zone": "America/Los_Angeles"}
  2 | {"lang": "es", "time_zone": "America/Cayman", "volume": 5}

If you do this you get tuples:

t=# select json_each(preferences) from users;
               json_each               
---------------------------------------
 (lang,"""en""")
 (time_zone,"""America/Los_Angeles""")
 (lang,"""es""")
 (time_zone,"""America/Cayman""")
 (volume,5)

It would be nicer to get separate key and value columns. That works great if json_each appears in the FROM clause:

t=# select * from json_each('{"lang": "en", "time_zone": "America/Los_Angeles"}');
    key    |         value         
-----------+-----------------------
 lang      | "en"
 time_zone | "America/Los_Angeles"

But if you replace the literal with a column, it breaks:

t=# select key, value from json_each(users.preferences);
ERROR:  missing FROM-clause entry for table "users"
LINE 1: select key, value from json_each(users.preferences);

This doesn’t work either:

t=# select key, value from (select json_each(preferences) from users) x(key, value);
ERROR:  table "x" has 1 columns available but 2 columns specified

So what to do? Well here is something that works, although I find it hard to remember:

t=# select id, (json_each(preferences)).* from users;
 id |    key    |         value         
----+-----------+-----------------------
  1 | lang      | "en"
  1 | time_zone | "America/Los_Angeles"
  2 | lang      | "es"
  2 | time_zone | "America/Cayman"
  2 | volume    | 5

Basically you are decomposing the tuple. You can pick your columns too:

t=# select id, (json_each(preferences)).key, (json_each(preferences)).value from users;
 id |    key    |         value         
----+-----------+-----------------------
  1 | lang      | "en"
  1 | time_zone | "America/Los_Angeles"
  2 | lang      | "es"
  2 | time_zone | "America/Cayman"
  2 | volume    | 5

Possibly nicer is this trick about set-returning functions allowing an implicit lateral join:

t=# select id, key, value from users, json_each(preferences);
 id |    key    |         value         
----+-----------+-----------------------
  1 | lang      | "en"
  1 | time_zone | "America/Los_Angeles"
  2 | lang      | "es"
  2 | time_zone | "America/Cayman"
  2 | volume    | 5
(5 rows)

So if you’re struggling to use json_each with a non-literal input, there you go!

When the Inner JSON Effect Works

2016-07-29

Hacker News has a fun discussion today about an article on The Daily WTF called The Inner JSON Effect. It’s about how a lead-programmer mad genius built an entire web application on top of svn, not by using svn to track history, but by using a separate revision number for each individual function, and having each class contain nothing but a list of the revision numbers of each function it wants as a method. It is a hilarious example of what not to do. Obviously it’s fun to laugh at something as crazy as JDSL, but it’s even more interesting to try linking it to things we might find more normal. So a few provocative thoughts:

  • jsonapi? I work on a lot of Ember projects, where jsonapi is the blessed format for frontend/backend communication. It seems to me awfully verbose without adding a lot of new information. Just as the WTF story adds extra indirection where you could just define functions, so jsonapi makes you list out the attributes and references—even though in regular JSON they are right there. It feels to me like the worst of the old XML excesses. I admit this is a bit trollish, but I’ve yet to understand what I get from jsonapi. Maybe someone can explain it to me.

  • Ansible? I’m a Chef guy and haven’t tried ansible, and I know a lot of developers-with-sysadmin-chops love it for its simplicity. Personally I’m fearful of using something so restrictive. I have similar doubts about Puppet, although at least there I know there are good “escape hatches”. I want to be able to throw in an if statement when I need to. Here I feel my sentiment is wrong though, since these are such successful projects. So I’m curious if anyone can articulate principles that make Ansible’s YAML “work” but not TDWTF webapp’s JSON.

And going the other direction, I think sometimes—not very often—you can get huge wins out of creatively building on top of an existing tool. But it is always a risk and requires a lot of good taste. A few examples:

  • I built a product with user-customizable questionnaires, and we knew up front the logic for branching/skipping/computations was going to be super complicated. I could tell it was heading toward what I call a “Turing-Complete Feature”. (Enterprise Workflow Engine anyone?) Instead of trying to model and store all the rules in a database, or writing a new language (you can see the looming Inner Platform Effect), we basically made each questionnaire a server-side underscore.js template, with a simple Javascript API to access information about the current context. The feature was done quickly and we had great flexibility building the actual surveys. Our surveys were built by an internal team of not-quite-programmers, so the developers could work on other things. We even had the potential of letting customers write their own surveys, since Javascript is the lingua franca of modern programming, with good security isolation. (After all your browser uses it to run untrusted code!) Many customers told us our advanced survey functionality is what set us apart from the competition.

  • Another project also has user-customizable forms, with something close to an EAV pattern in the database. We wanted to offer customers reporting capabilities, so we started talking about building an API. It would be for the existing “power users”, who were doing reports and analysis using R, SAS, Tableau, and similar tools. But the API was going to be high effort, slow, and hard to join data. So we decided our API would be . . . SQL! Of course you can’t share your real tables, for security but also because if those become a public API they can never change. Plus the EAV format made them too abstract to use easily. So instead we gave each customer a private database with one table for each of their forms. This lets them ask questions in a very direct way, using tools they already know, with great performance and flexibility. (Okay I am using the past tense but this one is not actually turned on for external customers yet. Still, using a derived, flattened database for SQL reporting is not so strange, even if young CRUD developers haven’t seen it before.)

  • Even the idea of building on top of a version control system is appealing. (Maybe not svn though! :-) I would love a self-hosted password manager with a Firefox plugin, where my passwords are stored (encrypted) in a git repository I keep on my own server. Not only would using git save the author time, but it would be like using a “standard protocol” rather than something proprietary. Same thing for a self-hosted bookmarking service.

I think all three of these “good” examples are a little audacious, but in my opinion they all work well. So are there any general guidelines that can help us make better decisions about when to make an unexpected swerve? So much of programming comes down to taste. That leads us into pointless vi-vs-emacs wars, but it also sets good programming apart from bad. How can we make that taste teachable? Or should it be entrusted only to the old and wise? Are there guidelines we can identify, so that we can discuss them, evaluate them, and make decisions about them? (Apparently one rule is don’t write your own TCP/IP stack. :-)

One principle of my own is a “risk budget”: you have to take risks—they are a fertile source of advantage—but you should have a list of them and not take too many. Risks can include: tight deadlines, unreasonable clients/management, new clients/management, unclear requirements, new tech, new-to-me tech, high scalability requirements, etc. One risk is “cutting out a layer” like in the Javascript survey API and the SQL reporting interface. I’m not saying it’s always a good idea, just that it can be. Building your own filesystem on top of NFS is another example of cutting out a layer. I can’t speak to how well that one worked out, but it’s an interesting idea.

What are some other guidelines people use? I’m especially interested in guidelines about “when to leave CRUD behind”, or when to make a choice that seems strange at first glance but pays off. If programming were chess, when do you sac a piece?

Postgres custom range types for Geoip

2016-07-05

I had to add a geoip lookup table to one of my projects recently, and it was a perfect application for the new(ish) range types in Postgres. A range is a column type with a start and end point. The endpoints can be “closed” (includes the endpoint) or “open” (includes everything up to the endpoint, but not the endpoint itself): for instance for integers [1,3] is 1, 2, and 3, but [1,3) is just 1 and 2. Postgres comes with a few built-in range types for numbers and timestamps, but you can also define your own. Since I wanted to use the Postgres inet datatype, that’s what I had to do.

My goal was to have a table like this:

db=> \d geoips
        Table "public.geoips"
    Column    |   Type    | Modifiers
--------------+-----------+-----------
 ips          | inetrange | not null
 country_code | text      | not null
 latitude     | real      | not null
 longitude    | real      | not null

Also I wanted to get fast lookups based on which row contained a given IP. I would have about 4 million rows, so I would need an effective index. In fact part of the reason for using a range type is that they support GiST indexes, where you can get fast lookups based on both the start and end values. A B-Tree index for something like ip BETWEEN start_ip AND end_ip can quickly get you to start_ip, but then it needs to scan forward for the end_ip check. So I expected a GiST index would do better.

Defining the range type can be as simple as this:

CREATE TYPE inetrange AS RANGE (
  subtype = inet
);

Then we can start using them:

db=> select '[1.2.3.4,1.2.3.8]'::inetrange;
     inetrange     
-------------------
 [1.2.3.4,1.2.3.8]

You also get a free constructor function:

db=> select inetrange('1.2.3.4', '1.2.3.8', '[]');
     inetrange     
-------------------
 [1.2.3.4,1.2.3.8]

And by the way, since the inet type supports IPv6 addresses, you can make ranges with those too. The only caveat is I haven’t found a way to prevent you from mixing them:

db=> select inetrange('1.2.3.4', '2001:0db8:0000:0042:0000:8a2e:0370:7334', '[]');
                inetrange                
-----------------------------------------
 [1.2.3.4,2001:db8:0:42:0:8a2e:370:7334]

So don’t do that! :-)

You can also use the <@ operator to see if an IP falls within a range:

db=> select '1.2.3.6'::inet <@ '[1.2.3.4,1.2.3.8]'::inetrange;
 ?column? 
----------
 t

Another nice thing about using a range is we can define an exclusion constraint to ensure that no two ranges overlap. That will guarantee that our lookups never return more than one row. We can include the exclusion constraint when we create the table:

CREATE TABLE geoips (
  ips inetrange NOT NULL,
  country_code TEXT NOT NULL,
  latitude REAL NOT NULL,
  longitude REAL NOT NULL,
  CONSTRAINT geoips_dont_overlap
    EXCLUDE USING gist (ips WITH &&)
    DEFERRABLE INITIALLY IMMEDIATE
);

That tells Postgres to forbid any new value x where x && ips is true, when checked against all old ips. It also automatically creates a GiST index for us—the same index we’ll use when we query the table.

Now we can load our data—and by the way if you use COPY or \copy with a CSV file, Postgres has no trouble with range literals:

"[1.0.0.0,1.0.0.255]","AU","-27.467940","153.028090"

But there is still one problem. Our GiST index isn’t so fast after all!:

db=> select * from geoips where '1.2.3.4'::inet <@ ips;
         ip          | country_code | latitude | longitude 
---------------------+--------------+--------------+-------
 [1.2.3.0,1.2.3.255] | AU           | -27.4679 |   153.028 
(1 row)

Time: 1.120 ms
db=> select * from geoips where '10.2.3.4'::inet <@ ips;
            ip             | country_code | latitude | longitude 
---------------------------+--------------+----------+-----------
 [10.0.0.0,10.255.255.255] | -            |        0 |         0 
(1 row)

Time: 54.669 ms
db=> select * from geoips where '100.2.3.4'::inet <@ ips;
           ip            | country_code | latitude | longitude 
-------------------------+--------------+----------+-----------
 [100.2.3.0,100.2.3.255] | US           |  40.8681 |  -73.4257 
(1 row)

Time: 322.719 ms

As you can see, it gets slower and slower for higher-numbered IPs. What is going on here?

The problem is that we didn’t implement a function to tell GiST indexes how “far apart” two inetranges are. That means we have to start all over! So drop the table, drop the type, and try again… . First, we’ll implement the diff function:

CREATE OR REPLACE FUNCTION inet_diff(x inet, y inet)
  RETURNS DOUBLE PRECISION AS
$$
DECLARE
BEGIN 
  RETURN x - y; 
END;
$$
LANGUAGE 'plpgsql' STRICT IMMUTABLE;

As you can see, all this does is subtract one inet from the other inet. Pretty simple!

Now create the inetrange again:

CREATE TYPE inetrange AS RANGE (
  subtype = inet,
  subtype_diff = inet_diff
);

Now create the table, load it, and try querying again:

db=> select * from geoips where '1.2.3.4'::inet <@ ips;
         ip          | country_code | latitude | longitude 
---------------------+--------------+--------------+-------
 [1.2.3.0,1.2.3.255] | AU           | -27.4679 |   153.028 
(1 row)

Time: 1.004 ms
db=> select * from geoips where '10.2.3.4'::inet <@ ips;
            ip             | country_code | latitude | longitude 
---------------------------+--------------+----------+-----------
 [10.0.0.0,10.255.255.255] | -            |        0 |         0 
(1 row)

Time: 0.678 ms
db=> select * from geoips where '100.2.3.4'::inet <@ ips;
           ip            | country_code | latitude | longitude 
-------------------------+--------------+----------+-----------
 [100.2.3.0,100.2.3.255] | US           |  40.8681 |  -73.4257 
(1 row)

Time: 0.618 ms

Those times are much better! So that is how you can use custom Postgres ranges to implement a fast geoip lookup table.

Now there is one last thing: technically if your range elements are discrete rather than continuous (as ours are), you should define a canonical function for the type, so that Postgres can resolve ambiguities between ranges with open and closed endpoints. For instance, [1,3) is the same thing as [1,2], but Postgres doesn’t know that unless we tell it.

Unfortunately, because the function depends on the type and the type depends on the function, there are complications. You have to first create a “shell type”, which requires superuser privileges, and also your canonical function must be written in C, because plpgsql functions can’t return shell types. And loading a C function requires superuser privileges too. To make this all a lot simpler, I created a Postgres extension for inetrange. That will create the inetrange type as described here, but also with a canonical function. So for real work, it is probably easiest to just use that! Or you could use the non-superuser commands above and skip the extension.

Ledger with Autosync

2015-08-01

A while back I started tracking my finances with legder-cli, a command-line tool for double-entry bookkeeping. It’s been a lot of fun, but one of the stumbling blocks is recording each transaction by hand. Fortunately almost every bank nowadays offers an API so you can download your recent account activity in a format like OFX (Open Financial Exchange) or QFX (Quicken’s extended OFX), and ledger has tools to import that data.

For this I use ledger-autosync, which is built on top of ofxclient. First you run ofxclient to create a ~/ofxclient.ini file with the details of all your accounts. Then you run ledger-autosync, which will first read your ledger file (whatever $LEDGER_FILE points to), then contact the banks and print out any new transactions. It tries to guess expense categories, but often you need to correct these. Still, that’s a lot easier than typing in everything!

Because of this step, I like to have ledger-autosync append to a review.ldg file, which I periodically edit to correct expense categories, then cut-and-paste into my regular $LEDGER_FILE. I also do some extra tweaking to format the entries how I like them. Here is the command I use:

ledger-autosync \
  -l ~/share/accounting/in-progress.ldg \
  -i 2 \
  | sed 's,^\([0-9]\{4\}\)/\([0-9]\{2\}\)/\([0-9]\{2\}\),\1-\2-\3,' \
  >> ~/share/accounting/review.ldg

That uses two spaces for indents (-i 2) and rewrites the dates.

Also note how I override $LEDGER_FILE with the -l option. That’s a trick to keep ledger-autosync from entering the same transactions twice into review.ldg. The contents of in-progress.ldg are just this:

# This file exists so that ledger-autosync will read review.ldg
# and not download entries it's already pulled down
# but we haven't yet reviewed.

include pj.ldg
include review.ldg

And that’s it! This setup makes it pretty painless keeping track of everything. I don’t have to do a lot of typing, and nothing slips through the cracks. I actually kind of like that categorizing forces me to look at expenses every week or so. If you are thinking about using ledger, hopefully this helps you out!

Next: Paperclip expiring_url and "Request has expired"