What to Learn

2017-08-02

I once heard a tech speaker say that in programming her job description was “learn new things,” and I’m happy to steal that way of putting it. In seventeen years of professional work I’ve never done a project that didn’t require me to learn something new on-the-job. It’s what I love about programming. But how do you decide what to learn?

Some things you learn because your project demands it, and those lessons are small and focused (hopefully most of the time): a new library here, some protocol detail there, today something about Linux, tomorrow something about Docker. This is stuff you do on the job, practically every day.

That’s not what I’m talking about here, but I’ll offer some advice in passing: when you learn a new tidbit, try to write it down. You don’t have to spend time polishing it, and if it helps then write it somewhere private. But write it down. I use personal man pages for this—and I’m not very good at it myself. If you’re bad at it too, then at the very least spend an extra 20 minutes making sure you actually understand, and come up with a few experiments to test that your take is correct. Try to put your understanding into words, at least in your own head.

But instead of that on-the-spot learning, I want to talk about things we learn that take more time and have a more long-term payoff. We typically do this off the job, without pay. I think most programmers love learning (or they would soon find a different job), so we can’t help ourselves. But also it pays to keep your skills current and sharp. Every professional has to do this. My favorite book about professional services work (which despite the title is about way more than managing) talks about developing your “asset”—you—by continuously learning. Lawyers, architects, accountants, doctors—all have to keep learning. Car mechanics too. With programmers the pace is different, but I expect the world doesn’t exactly stand still for anyone else either.

There is so much to learn! And the hype is everywhere, stealing your attention and diffusing your time. Prototype, jQuery, Backbone, Knockout, Angular, Ember, React, Vue, … Less, Sass, Uglify, Asset Pipeline, Npm, Bower, Babel, Gulp, Grunt, Ember-cli, Webpack, … Oracle, MySQL, Postgres, Memcached, Cassandra, Mongo, CouchDB, Redis, Riak, DynamoDB, … Aaah! You can’t learn it all, so you have to be deliberate.

I often hear advice to learn one new language a year, two new languages a year, whatever. The best versions of this advice say to learn a new “kind” of language, like Lisp or Haskell or Prolog. For a new programmer, that’s pretty good advice, and I still follow it myself. (For me the last few years it’s been Haskell, Rust, and Elixir.) But for several years I’ve tried to adopt a more strategic approach. One of the problems with learning another language is that either it’s something you won’t actually use, or you mostly leave behind the old one, so it’s like starting over from zero. (Not really, but a little bit.) After a couple dozen you start to wonder how to make the investment more worthwhile. Is there a way to make our learning build on itself, so we aren’t throwing away so much time? Here is my own approach to having some “continuity” in what I learn:

First, realize that there are so many more categories besides language! There are operating systems, cloud environments, back-end frameworks, front-end frameworks, databases, build tools, deployment tools, networking protocols, specialties like GIS or machine learning, industries like e-commerce or finance or health care, “soft” skills like writing, requirements gathering, design, management, financial planning, sales. Don’t get stuck in a rut of thinking in only one dimension.

Second, don’t be too focused. Go ahead and mix in some “useless” learning. I’ve had fun lately reading about the history of transistors, the integrated circuit, and the Internet. Or to take it to an extreme, you could learn some Greek or Latin or Chinese. :-) Whatever you like. One of those books (The Chip) actually talks about how Jack Kilby, the co-inventor of the IC, would read several newpapers and a bunch of magazines every day, plus every new patent granted by the government. Maybe that’s an extreme, but it’s good to have some breadth because you never know what will come in handy or inspire you. But more than that, recreation is important. Read some trashy science fiction or something.

But when you are being deliberate, I think there are three good alternatives to “learn another language”. The first is to learn something that complements your current skills. Suppose you are (or want to be) a “full-stack web developer.” Okay, learn some Rails, pick one Javascript framework and learn it, but then also learn some advanced Postgres, learn some details about HTTP or SSL or CORS, learn Wireshark and IP, learn HTTP Canvas, learn a configuration management tool like Chef or Ansible. I think Chef is a great complement to Rails (or Ansible to Django). My own current “expansion of territory” is down the stack, learning some Rust and reading The Linux Programming Interface. Learn the things that border on what you do, so you are gradually expanding. This is “breadth”, but in a calculated, not desultory way. You’ll probably put those skills to use right away, so they’ll sink in and make you better at your job.

Second is to dive deep somewhere. I’ve really enjoyed getting to know Postgres. I’ve been hired to scale it, to replicate it, to write C extensions for it. Maybe for you it is React or Datomic or AWS or reverse engineering or Ruby performance tuning. But get on the mailing list, follow what problems the community is trying to solve right now, write some blog posts, get to know someone in the community. Whatever it is, use it enough to find some friction points and maybe even fix one or two. You don’t have to make it your whole identity (though you could), but instead of learning a new thing, go in the opposite direction: go deep.

So far this is a lot like the classic “T-shaped person” advice, but I’m saying that for the breadth, pick things connected to your specialty, and for your specialty, have a “specialty within the specialty”. Keep trying to push a little further out, a little further down.

Third and finally is to learn something truly new, at the cutting edge of research. For the last couple years I’ve been reading about temporal databases, which have 20-30 years of academic study but few well-developed practical tools (especially open source ones). This isn’t something I’ve been able to use on a real project (yet), but it’s been great fun, and it feels like a way to find opportunities to build something before anyone else does. How you find your topic is by listening to your pain and seeing if there are other people trying to solve the same problems. Some other things I wish I could become an expert in: Bayesian statistics, operational transforms, HTTP/2, column-store databases, RDF, type theory, vector CPU instructions. There is so much happening! Pick something that people are writing papers about and learn a little.

So that’s what I’ve learned the last few years about learning. Instead of “learn another language”, try to be strategic. Try to build on what you have. Careers are long, so try to find some long-term problems you can grapple with. Maybe like Kilby you will even solve one!

Doing Many Things

2017-05-05

I often say that making code do one thing is easy, but making it do lots of things at once is hard. You especially see this in UI code, although it happens on the back end too. The worst of all is front-end web development, where so many things are outside your control.

For example, a customer asks to add a real-time “suggestions” feature to a search box, so that as users type they see possible things to search for. That’s pretty easy, until you think about all the details you have to support:

  • It works in IE/Firefox/Chrome/Safari, across lots of versions.
  • It works on Windows/OS X/iOS/Linux.
  • It works on desktops/tablets/phones.
  • It works when the user resizes the browser window.
  • It works when a user has configured text to appear larger than normal.
  • It has a hover effect that changes the background color and the cursor.
  • It has a hover effect that bolds the suggestion text, but doesn’t resize the row’s height/width.
  • It works when there are two search boxes on the page.
  • It works in the header/body text/left sidebar.
  • It works inside a dropdown menu.
  • It works inside a div that is floated.
  • It works inside a div with absolute positioning.
  • It works inside a div with relative positioning.
  • It works inside flowed text.
  • It works in an iframe.
  • It works without resizing the containing element.
  • It works when the containing element has overflow: hidden.
  • It works on lots of different pages across the website.
  • It works on all our different websites.
  • It works when the doctype is HTML5, HTML 4 Loose, HTML 4 Strict, XHTML, and missing.
  • It works in Bootstrap/Foundation/our custom CSS. (Hopefully you can just pick one here!)
  • It works with vanilla Javascript/jQuery/Angular/React/Ember. (Ditto.)
  • It works with or without Turbolinks.
  • It works with Ajax.
  • It works when the Javascript/CSS is minified and concatenated.
  • It works when the HTML and other static assets are served from a CDN on a different domain name from the back end.
  • The suggestions cover whatever is below the search box.
  • … but new modal dialogs cover the suggestions.
  • It works inside a modal dialog.
  • The suggestions don’t extend outside the modal dialog, but you can scroll to see more of them.
  • It works when a modal dialog opens another modal dialog.
  • Clicking a suggestion takes you to the search results.
  • Clicking elsewhere hides the suggestions.
  • Clicking a link elsewhere on the page still works when suggestions are visible.
  • When in a modal dialog and showing suggestions, clicking in the dialog hides just the suggestions.
  • When in a modal dialog and showing suggestions, clicking outside the dialog hides the suggestions and the dialog too.
  • When the search box is close to the bottom of the page, suggestions appear above it instead of below.
  • Clicking a suggestion doesn’t show that funny dashed line.
  • Clicking a suggestion row in the margin around the text still chooses the suggestion.
  • It works when suggestions include quotes and angle brackets (", ', <, >).
  • It works in English/Spanish/French/Mandarin Chinese/etc.
  • It works with right-to-left text (e.g. Arabic).
  • When a suggestion is too long to fit, it appears with an ellipsis, but hovering shows a tooltip with the whole suggestion.
  • The ellipsis works correctly with Arabic and Chinese text.
  • The ellipsis doesn’t change when the text becomes bold from a hover.
  • The back button works as expected.
  • It works when there are no suggestions.
  • It works when there are 1,000 sugestions, but it doesn’t show them all.
  • You can use a scroll bar to see more suggestions.
  • Scrolling down far enough loads more suggestions from the back end.
  • A suggestion is guaranteed to have at least some search results.
  • Suggestions show how many search results they have before you click.
  • It is fast.
  • It has tests.

I’m not saying it’s reasonable to do all these things every time. You may have customers for whom supporting Arabic and Chinese is a waste of money. I didn’t even list the things users will ask for that are just crazy! I did throw in “search box inside a dropdown menu” to give you a mild taste, and I may have included others that are … questionable. But you have to at least think about them.

In any case, there is a lot to get right, and these details are the difference between polished and unpolished. I learned early in my career that if I deliver something where a few details are still not quite right, I will surely get caught and have to fix it, and the customer will be a lot happier if they never see the broken version in the first place. Polish takes time, but it is still required.

The problem is that completing the list gets harder and harder as you approach the end. No item by itself is hard, but adding the next item without breaking the others is hard. In relativity, accerlating at high velocity takes more energy than accelerating the same amount at low velocity, and the energy to reach c is infinite. Web development has a similar law: requirement n + 1 costs more than requirement n. Adding an ellpisis isn’t hard. Adding it in right-to-left text with an Arabic font on IE 6 in quirks mode when the user has large text enabled is hard.

And now, just to draw one more connection: one thing you learn from experience is ruthless simplification, and skepticism toward adding “concepts” to your app. If you have n features, and they all interact, the complexity grows not by O(n) but O(n^2). I’ve talked mostly about front-end work, but I think it is the same everywhere, and we just see it more on the front end because there it is so unavoidable. Some of it is multiple browsers, some of it is features users expect from any UI, but there our n is already high before we even do anything. In our own feature design we can at least be ruthless. On the front end, we just have to keep polishing. It makes me pity people who sell Wordpress plugins for a living.

As much as you can, it pays off tremendously to manage the combinatorial explosion. You can limit scope, you can stop supporting old browsers. You can design components to be isolated, since if they don’t interact they don’t raise each other’s n. Somehow you have to keep that number under control. The good news is that the marginal benefit of cutting n is large.

You can also write tests, which let you tolerate a higher n by making it less expensive to test in an n^2 world. Sadly browser tests are the most expensive to write, the most brittle (i.e. the shortest-lived), and the longest to run, but still I am always glad to have them.

I guess I have written myself into two morals here. One is: don’t permit any more complexity than you have to. The second is: sometimes you just have to keep polishing, also known as “That’s why they call it work!” Painfully, these really tug you in different directions. Choosing when to apply them is not easy.

Good luck!

Scaling Sidekiq

2017-04-12

Sidekiq is a great option for handling background jobs in Ruby projects. Here I’ll show you how to get the best utilization out of a box dedicated to running Sidekiq jobs. Whether you have one machine or ten, the goal is to work off as many jobs as possible from each machine.

CPU Utilization

To do that, we want to keep every core busy. You can monitor your CPU activity with a tool like vmstat(8). If you say vmstat 10, you’ll get a new row every ten seconds, like so:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 9  0  92056 553620 174128 6268208    0    0     1    25 1323 1443 23 77  0  0  0
 8  0  92056 552628 174128 6268492    0    0    14    38 1307 1447 22 78  0  0  0
 8  0  92056 551760 174132 6268672    0    0     3    49 1317 1461 23 77  0  0  0
 8  0  92056 550544 174140 6268832    0    0     1    28 1353 1490 23 77  0  0  0
 8  0  92056 550296 174148 6268940    0    0     1    35 1240 1360 22 78  0  0  0

The last few columns show percentage of CPU time doing user work (us), system work (sy)—which more or less means executing kernel system calls—, idle (id), and blocked on I/O (wa, “waiting”). Add the first two together to get how much your CPU is working.

There is also st, which means “stolen time”! If you are running a VM, this means the time your hypervisor gives to someone else. You will probably see all zeros in this column, so I’ll ignore it going forward. We want to get our idle and wait time as close to zero as possible, without drowning the machine in too much work. Above we have a machine that is doing a good job keeping busy. You can see that its user and system time are at 100, and the other columns are zero.

So how do we do this? Usually it means running lots of jobs at the same time, not one-after-another. Even a one-core machine can juggle many jobs. That’s because jobs typically have to block on I/O, for example when they talk to the network or save things to disk. While that job is waiting for an answer, your CPU can work on another one.

(This is not really relevant here, but disk I/O is a little different than other I/O. Technically, reading/writing with a regular file can’t “block” but only “sleep”. That completely messes up non-blocking I/O for regular files, and even the newer aio functions have many limitations and in fact are implemented by threads in userspace—but fortunately we are not talking about single-threaded non-blocking I/O; we are talking about multiple processes/threads. Whether you call it blocking or sleeping, the CPU will still schedule different work if something is stuck on a regular file read. And if this paragraph doesn’t make sense, feel free to dismiss it as a pedantic footnote. :-)

Threads

Sidekiq is great here because it supports multi-threading. Multiple threads let you do concurrent work in one Ruby process (at least as long as you are not still stuck on Ruby 1.8). Without threads you’d need a separate process for each concurrent job, and that can use up memory quickly, especially with something like Rails. It is always sad to have more CPU available that you can’t use because you’re out of RAM.

In practice, threads mostly help if you are using a concurrent Ruby implementation like JRuby. MRI Ruby has a Global Interpreter Lock (GIL), which prevents two threads from executing at once. Still, even in MRI you will still see some benefit, because when one thread blocks on I/O, MRI can make progress on another. So despite the GIL, MRI can still make sure at least some thread is running.

With Sidekiq, you can say how many threads to run with the concurrency setting. Normally you’d set this in your sidekiq.yml file. Note that each thread needs its own database connection! That means if your concurrency is 10, then in your database.yml you must have a pool of 10 also (or more). Otherwise the threads will halt each other waiting to check out a database connection, and what is the point of that? They might even get timeout errors.

Processes

But wait, there’s more! We can push the concurrency up and up, and still see something like this from vmstat:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0  92048 585840 174148 6269120    0    0     0    16  956  990  7 18 75  0  0
 1  0  92048 585212 174160 6269256    0    0     1    23 1012 1037  8 17 75  0  0
 1  0  92048 580456 174164 6269312    0    0     1    33  896  884  8 17 75  0  0

Why is the idle CPU time stuck at 75?

It turns out this is a four-core machine, and even with multiple threads, a single MRI process can only use one core, because of the GIL. That’s not what we want at all!

So the answer is to run more processes—at least one per core. You can see how many cores you’ve got by saying cat /proc/cpuinfo. If you are using something like god, it is easy to put several processes into a group and control them together, like so:

app_root = '/var/www/myapp/current'
4.times do |i|
  God.watch do |w|
    w.name     = "myapp-sidekiq-#{i}"
    w.group    = "myapp-sidekiq"
    w.log      = File.join(app_root, 'log', "#{w.name}.log")
    w.pid_file = File.join(app_root, 'current', 'tmp', 'pids', "#{w.name}.pid")
    w.start    = <<-EOS.gsub("\n", " ")
      cd '#{app_root}' &&
        bundle exec ./bin/sidekiq --environment production
                                  --pidfile '#{w.pid_file}'
                                  --logfile '#{w.log}'
                                  --daemon
    EOS
    # ...
  end
end

If you aren’t using god and don’t know how to do this with your own process manager, then I think Sidekiq Enterprise has a similar feature called Swarms.

Scheduling cores

That gets us almost there, but there is still a problem. You might still see idle time stuck somewhere, like (on a 4-core system) 25. The problem is that while the kernel does its best to use your whole CPU, it can still wind up putting two Sidekiq processes on the same core, and since they are such long-lived processes, they are stuck that way, competing for their shared core while another one sits idle. The kernel doesn’t know ahead of time that they are going to run for days and keep so busy.

Fortunately we can still force each process onto its own core. For that we use the taskset(1) command. When you say taskset -c 2 date, you are telling Linux to run date on core 2. (Core numbers start from zero, as you can see in /proc/cpuinfo.) So our god config would become:

w.start    = <<-EOS.gsub("\n", " ")
  cd '#{app_root}' &&
    bundle exec taskset -c #{i} ./bin/sidekiq --environment production
                                              --pidfile '#{w.pid_file}'
                                              --logfile '#{w.log}'
                                              --daemon
EOS

After that, we’ll have one process on each core. At this point, you should start experimenting with your concurrency setting, to make each core fully utilized. The right setting there will depend on how much blocking a job does, but I have seen useful numbers up to 20. Just try some things out, and watch vmstat.

Note that you don’t have to increase the connection pool size in database.yml as you add cores (just concurrency). That’s because each process is a separate thing, each with its own pool. But you do have to increase the max connections on your database server. For instance with Postgres you’d want to set max_connections in postgresql.conf. Here you need to allow enough connections for concurrency times cores (times servers), plus some more for your actual Rails app serving web requests, plus some more for anything else you’ve got going on. That can be a lot of connections! Don’t be surprised if improving your job throughput exposes a bottleneck elsewhere in your system.

Conclusion

Tuning Sidekiq can be complicated, because of the double layers of threads plus processes. You need a confident understanding of how Ruby and your operating system handle concurrency, and it helps to use tools like vmstat to measure what’s going on and verify your understanding. If you have a box dedicated to just Sidekiq jobs, my recommendation is to run one process per core, using taskset to keep them separate, and then tune concurrency from there. Hopefully this will help with your own projects!

Postgres Permissions

2017-03-27

It’s taken me a long time to build up an understanding of the Postgres permissions system, so I thought I’d write it down to help others get up to speed faster.

First let’s talk about what things can be guarded by permissions. The big ones are databases and tables/views, but you can also set permissions on schemas, functions, sequences, types, languages, tablespaces, domains, foreign servers, foreign data wrappers, and large objects. (Phew!) There are also a few permissions that apply to the whole cluster. (In Postgres terminology, the “cluster” is the server process(es) keeping its files in /var/lib/postgresql/9.5/main and listening on 5432 (or whatever), and a “database” is a namespace in the cluster where you define tables. You can list the databases in your cluster with \l.) We’ll explore below what actions can be allowed or retricted for each of these subjects. Collectively, let’s call permissionable things “topics”.

(As of Postgres 9.5 there is also Row Level Security (RLS). That is complicated enough to be a separate topic, so here I want to cover just the traditional permissions system.)

Second let’s ask who has permissions. In Postgres permissions come first from your user, and second from any groups your user is a part of. Collectively, users and groups are called “roles”. Or to be more accurate, before Postgres 8.1 users and groups were separate things, but now everything is just a role. A role can even be both a “group” and a “user” at the same time: it can have members but also serve as a direct login.

Permission Mechanics

The key to understanding Postgres is that the stored permission between any role and any topic is only ever a 0 or a 1. When Postgres does a permissions check, it takes all your roles’ permissions on the topic, and if there are any with a 1, you’re in. In other words your effective permission is the OR of all your roles’ permissions. (There is one exception to this, explained below, but it rarely matters.)

You can see what groups each role belongs to with \du (list users). You can add someone to a role with GRANT r TO u, or remove them with REVOKE r FROM u, where r is the group and u is the user. (I am using “group” and “user” loosely here for clarity, but remember that they are all roles.)

So some consequences of this: If you REVOKE a permission from a role r1, it loses its direct 1, but it might still get the permission indirectly from other roles. The REVOKE isn’t remembered in any way. It just sets the 1 to a 0 for r1. It doesn’t override other possible sources of a 1.

Another consequence is that commands like GRANT SELECT ON ALL TABLES have an immediate effect of setting a lot of 1s, but they are not remembered either. So if you add some more tables, they will not automatically share the same grants as the old tables.

The complement of GRANT ON ALL is ALTER DEFAULT PRIVILEGES. This lets you say what the 1s and 0s should be for future objects. It does not change any existing objects; it just says how to set up new ones. You can combine GRANT ON ALL and ALTER DEFAULT to get 100% coverage: first for the old objects, then for the new. Actually it would probably be less racy to do the DEFAULT first, in case someone else is making more objects while you type.

The final trick in all this is a magic role named PUBLIC. Every other role is a member of the PUBLIC role, so if it has any 1s, then everyone will apply those 1s. Note there is no way to cancel a 1 from the PUBLIC role, because of what we said above about how permission checks do an OR. You might REVOKE something from a troublemaker and give them a 0, but if the PUBLIC role has a 1, they can do the thing anyway. Of course you can REVOKE a permission from the PUBLIC role itself, but then you have to individually GRANT the permission to all the roles that should have it.

One wrinkle here is that a role can have either the INHERIT or NOINHERIT attribute. If it has NOINHERIT, then it does not benefit from permissions of roles it belongs to. In other words, its own direct 1s are the only ones it’s got (ha!). In practice this is not a significant block though, because if you belong to an admin role but have NOINHERIT, you can still SET ROLE admin and act with the permissions of that role. So NOINHERIT should not be used as a way of restricting what people can do. Also note that even with NOINHERIT, you still inherit permissions from the PUBLIC role. In general I don’t see much point to NOINHERIT, except for the psychological benefit of making people more explicitly prepare before running dangerous commands.

So you have the PUBLIC role, and you have ALTER DEFAULT PRIVILEGES, and both are a kind of default, but in different ways. It is important to keep these separate! In fact I’m pretty sure I’ve seen places in the docs where PUBLIC permissions are referred to as “defaults”. If you are working with either of these, watch out that you are clear on how they differ!

Cluster Permissions

Okay, so those are the mechanics. Now let’s look at some specific permissions. At the top level you have cluster-wide permissions. These are stored as attributes on your role. They are:

LOGIN: This means you can connect to the cluster. You still must in addition have permission to CONNECT to an individual database (covered below), but if your role doesn’t have this basic LOGIN attribute, you can’t get in at all.

SUPERUSER: This means you can do anything. You’ve got root. Your other permissions don’t matter, because when Postgres sees you have this, it doesn’t ask any further questions.

CREATEDB: You can create more databases.

CREATEROLE: You can create more roles.

REPLICATION: You can start streaming replication. Normally you’d only give this to a special role used by slaves to connect to the master database.

You can set these permissions with CREATE ROLE or change them with ALTER ROLE. You can see them with \du.

Database Permissions

The next level is the database. If you type \l you can see the databases in your cluster, and the last columm shows the permissions on each. For example:

perms=# \l
                                                   List of databases
            Name            |     Owner      | Encoding |   Collate   |    Ctype    |         Access privileges         
----------------------------+----------------+----------+-------------+-------------+-----------------------------------
 perms                      | postgres       | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =Tc/postgres                     +
                            |                |          |             |             | postgres=CTc/postgres            +
                            |                |          |             |             | admin=C/postgres
 postgres                   | postgres       | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0                  | postgres       | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres                      +
                            |                |          |             |             | postgres=CTc/postgres
 template1                  | postgres       | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres                      +
                            |                |          |             |             | postgres=CTc/postgres            +

For each database, you’ll see several lines of permissions. Each line shows a role, an equals sign, and then the list of permissions for that role. Someones a line is missing the role and starts with just an equals sign; this is for the PUBLIC role. It shows the permissions that everyone gets for free. Whatever you see here is allowed for anyone.

There is also a forward slash followed by another role name. That shows who granted those permissions.

So in the above example you can see that for the perms database, the PUBLIC role has Tc permissions, and admin has C, granted by postgres.

Each permission is shown by a single letter. They mean:

T: Create temporary (in-memory) tables.

c: CONNECT to the database. It has always worried me that by default databases let anyone connect. In theory this is okay since you can’t SELECT or change anything, but you can still list the tables (\dt), list their colums \d tablename, etc. More precisely, connecting lets people query the tables in pg_catalog and information_schema. They can see the names of all the other database users, all the other databases, and quite a bit more. Personally I feel this is a significant shortcoming in Postgres’s system.

You can revoke access to these system tables, but at the cost of breaking important things. For instance pgadmin fails if it can’t list the databases. Another idea is to use RLS on the system tables, but that is not supported.

Of course, if you can connect to any database, you can do these things. I’m not aware of any information you can get about database foo by connecting there that you can’t also get while connected to database bar. So you could argue that granting everyone CONNECT is not a problem! I don’t have a good answer to that, even though it feels suspicious. I think the moral, then, is that all database roles require more trust than you might expect.

C: This allows you to CREATE a new schema within the database. As the owner of that schema, you’ll be able to do lots of things with it, like create tables, etc.

Finally, individual letters may be followed by an asterisk (*). That means that the user not only has the permission, but may also GRANT the permission to others. The object owner may always grant permission, so Postgres doesn’t bother to show asterisks for them.

Table/View Permissions

For permission purposes, Postgres treats tables and views as the same thing. So ALL TABLES will include views, too.

You can list the permissions for tables and views (and sequences too) using \dp or \z (They do the same thing.):

perms=# \dp
                                       Access privileges
 Schema |      Name      |   Type   |     Access privileges     | Column privileges | Policies 
--------+----------------+----------+---------------------------+-------------------+----------
 public | sales          | table    | postgres=arwdDxt/postgres+|                   | 
        |                |          | admin=arwdDxt/postgres    |                   | 
 public | sales_id_seq   | sequence |                           |                   | 
 public | widgets        | table    | postgres=arwdDxt/postgres+|                   | 
        |                |          | admin=arwdDxt/postgres   +|                   | 
        |                |          | joe=r/postgres            |                   | 
 public | widgets_id_seq | sequence |                           |                   | 

What you see here is a lot like the output from \l. A blank means you have never made any permissions changes for this object, so its permissions still match the default. (When I say “the default”, I don’t mean the defaults you’ve established with ALTER DEFAULT PRIVILEGES, but rather the default defaults that Postgres ships with.) Note that once you change something, even if you change it back, the column will never go back to blank again.

Of course now there are more letters than before. That’s because tables have lots of permissions:

SELECT (r): You can read from the table.

UPDATE (w): You can write to the table.

INSERT (a): You can append to the table.

DELETE (d): You can delete from the table.

TRUNCATE (D): You can really delete from the table.

REFERENCES (x): This lets you create foreign key constraints elsewhere that reference this table. You need the permission on both tables. Also, you can grant this permission for just specific columns. (I confess I don't really get this permission. What is the use of forbidding foreign keys? I mean I get why not anyone can create a foreign key, but what is the use of making it a separate permission from other DDL commands? Besides, if a foreign key makes sense, why not just define it yourself?)

In the first version of this post, I questioned the usefulness of this permission, but the folks from the Postgres mailing list were really helpful in making it click for me. As Peter Holzer explains, only a table’s owner may add constraints to it (or run ALTER TABLE at all). But suppose joe owns the table joes_widgets, and fred owns the table freds_frozzles. Joe wants to add a frozzle_id column pointing to Fred’s table. He would need the REFERENCES privilege on Fred’s table to do it.

And why can’t Postgres just check Joe’s SELECT privilege instead? It’s because making a foreign key does more than just read from the table. Joe’s foreign key could prevent Fred from changing/deleting his own records, as well as costing Fred time checking the constraint. I don’t know how often this permission will come into play for me, but I’m happy to figure out why it’s there!

TRIGGER (t): You can add a trigger to the table.

It is also useful to see the default privileges for each table (from ALTER DEFAULT PRIVILEGES):

perms=# \ddp
         Default access privileges
 Owner | Schema | Type | Access privileges 
-------+--------+------+-------------------
(0 rows)

perms=# ALTER DEFAULT PRIVILEGES GRANT SELECT ON TABLES TO joe;
ALTER DEFAULT PRIVILEGES
perms=# \ddp
               Default access privileges
  Owner   | Schema | Type  |     Access privileges     
----------+--------+-------+---------------------------
 postgres |        | table | postgres=arwdDxt/postgres+
          |        |       | joe=r/postgres
(1 row)

perms=# create table inventory (id serial, widget_id integer);
CREATE TABLE
perms=# \dp
                                        Access privileges
 Schema |       Name       |   Type   |     Access privileges     | Column privileges | Policies 
--------+------------------+----------+---------------------------+-------------------+----------
 public | inventory        | table    | postgres=arwdDxt/postgres+|                   | 
        |                  |          | joe=r/postgres            |                   | 
 ...

Here you can see that new tables will start out with whatever permissions we have prepared as the defaults.

Note that you can also say ALTER DEFAULT PRIVILEGES FOR ROLE r GRANT .... Normally when you alter default privileges, it only affects (future) objects you own. If you want to change the privileges for future objects created by someone else, you can use FOR ROLE. Naturally you can only that if you belong to that role.

With all these rules, it is natural to want to see the final calculated rights of each user in the system. Unfortunately Postgres doesn’t have a built-in way to show that. You can say SELECT * FROM information_schema.role_table_grants WHERE table_schema = 'public', but that does not compute inherited permissions.

Fortunately there is a function called has_table_privilege you can use to get most of the way there. It does consider role memberships: for instance if joe belongs to the admin role and has INHERIT, then has_table_privilege('joe', 'public.sales', 'INSERT') will return true. Change joe to NOINHERIT and it will return false. I do wish there were a way to run the computation as if all roles were INHERIT, but this is close enough for me! Others have put this together in a nice query to get the permissions for any table you like. Here is a version I like:

SELECT  t.tablename,
        u.usename,
        has_table_privilege(usename, concat(schemaname, '.', tablename), 'select') AS select,
        has_table_privilege(usename, concat(schemaname, '.', tablename), 'insert') AS insert,
        has_table_privilege(usename, concat(schemaname, '.', tablename), 'update') AS update,
        has_table_privilege(usename, concat(schemaname, '.', tablename), 'delete') AS delete, 
        has_table_privilege(usename, concat(schemaname, '.', tablename), 'truncate') AS truncate, 
        has_table_privilege(usename, concat(schemaname, '.', tablename), 'references') AS references,
        has_table_privilege(usename, concat(schemaname, '.', tablename), 'trigger') AS trigger
FROM    pg_tables t,
        pg_user u 
WHERE   t.tablename = 'inventory';

Of course you could just as well filter on user to see one person’s permissions for all the tables. If you do a lot of permissions work, it might even be worth wrapping this up into a view so you can filter on whatever you like.

Incidentally, there are lots of other has_whatever_privilege functions to ask questions not covered by psql \ commands.

Sequence Permissions

If you squint, a sequence is pretty much just a tiny table: you can SELECT from it (call currval), or UPDATE it (call setval), or do both at once (call nextval). So it appears in \dp with the tables and views. But unlike tables and views, they also have the USAGE permission, which allows currval or nextval.

The USAGE permission is not just an abbreviation for SELECT plus UPDATE; it has its own bit. You can see it in \dp as U (vs. r and w). Yet if you have it, you don’t need r or w to use the sequence. Granting only USAGE would let someone use the sequence normally but not call setval, which might be useful.

I can’t think of any reason to give someone just SELECT on a sequence. Even currval won’t work, because currval only tells you the last time you called nextval. You will get an error like ERROR: currval of sequence "widgets_id_seq" is not yet defined in this session.

Function Permissions

Functions have just an EXECUTE permission. New functions will grant execute permission to PUBLIC, so be careful if you are creating something sensitive. If you like, you can change this with ALTER DEFAULT PERMISSIONS.

You can see functions in psql with \df or \df+, but neither will show the permissions. Instead you should use a query like

SELECT proacl FROM pg_proc WHERE proname = 'f';

The format of proacl is like anything else:

perms=# select proacl from pg_proc where proname = 'f';
                proacl                 
---------------------------------------
 {postgres=X/postgres,joe=X/postgres}

Here we have already revoked execute from PUBLIC, then granted it to joe.

Note that usually functions execute with the permissions of the invoker, but you can create “setuid” functions that execute with the permissions of the definer. These can be very useful, but they have similar dangers to Unix setuid programs, so you have to be careful with them.

Schema Permissions

Schemas have just CREATE and USAGE permissions. CREATE lets you add more stuff to the schema (tables, functions, sequences, etc.). USAGE is a prerequisite to using objects within the schema. For instance to SELECT from a table, you must first have USAGE permission on its schema. (You must also have SELECT permission on the table.) Note that even without USAGE, you can still list the schema contents, and even query for its tables’ definitions etc.

To be honest I find the schema USAGE permission kind of confusing and superfluous, and I don’t think I’m the only one. I can’t think of any security system you can build with this permission you can’t build with permissions on individual tables, functions, etc., and because you can say GRANT ... ON ALL whatevers IN SCHEMA foo, it is not even necessary as an abbreviation. I guess if you really want to lock people out, you might as well use it as an extra layer though.

Conclusion

There are permissions for more specialized database objects too, like types and foreign data wrappers. But if you have made it this far, you should have no problem understanding the documentation for those. You might also want to read about RLS to get really fine-grained access control. Here I’ve only tried to cover the basics, and give you an accurate mental model of how Postgres decides who can do what. If you have any questions, comments, or corrections, I’m happy to listen!

Where to put Postgres json_each?

2016-09-07

Postgres has several “set-returning” functions that essentially return more than one row. Probably the most well-known is generate_series, used to get a list of numbers/dates/etc. Another very handy one is json_each. You can use it to break up a JSON object into one row for each key/value pair. But that one always trips me up, because unlike generate_series it (1) returns more than one column per row and (2) usually takes a column as input.

Suppose you have a users table with a preferences column, like so:

t=# select * from users;
 id |                        preferences                         
----+------------------------------------------------------------
  1 | {"lang": "en", "time_zone": "America/Los_Angeles"}
  2 | {"lang": "es", "time_zone": "America/Cayman", "volume": 5}

If you do this you get tuples:

t=# select json_each(preferences) from users;
               json_each               
---------------------------------------
 (lang,"""en""")
 (time_zone,"""America/Los_Angeles""")
 (lang,"""es""")
 (time_zone,"""America/Cayman""")
 (volume,5)

It would be nicer to get separate key and value columns. That works great if json_each appears in the FROM clause:

t=# select * from json_each('{"lang": "en", "time_zone": "America/Los_Angeles"}');
    key    |         value         
-----------+-----------------------
 lang      | "en"
 time_zone | "America/Los_Angeles"

But if you replace the literal with a column, it breaks:

t=# select key, value from json_each(users.preferences);
ERROR:  missing FROM-clause entry for table "users"
LINE 1: select key, value from json_each(users.preferences);

This doesn’t work either:

t=# select key, value from (select json_each(preferences) from users) x(key, value);
ERROR:  table "x" has 1 columns available but 2 columns specified

So what to do? Well here is something that works, although I find it hard to remember:

t=# select id, (json_each(preferences)).* from users;
 id |    key    |         value         
----+-----------+-----------------------
  1 | lang      | "en"
  1 | time_zone | "America/Los_Angeles"
  2 | lang      | "es"
  2 | time_zone | "America/Cayman"
  2 | volume    | 5

Basically you are decomposing the tuple. You can pick your columns too:

t=# select id, (json_each(preferences)).key, (json_each(preferences)).value from users;
 id |    key    |         value         
----+-----------+-----------------------
  1 | lang      | "en"
  1 | time_zone | "America/Los_Angeles"
  2 | lang      | "es"
  2 | time_zone | "America/Cayman"
  2 | volume    | 5

Possibly nicer is this trick about set-returning functions allowing an implicit lateral join:

t=# select id, key, value from users, json_each(preferences);
 id |    key    |         value         
----+-----------+-----------------------
  1 | lang      | "en"
  1 | time_zone | "America/Los_Angeles"
  2 | lang      | "es"
  2 | time_zone | "America/Cayman"
  2 | volume    | 5
(5 rows)

So if you’re struggling to use json_each with a non-literal input, there you go!

When the Inner JSON Effect Works

2016-07-29

Hacker News has a fun discussion today about an article on The Daily WTF called The Inner JSON Effect. It’s about how a lead-programmer mad genius built an entire web application on top of svn, not by using svn to track history, but by using a separate revision number for each individual function, and having each class contain nothing but a list of the revision numbers of each function it wants as a method. It is a hilarious example of what not to do. Obviously it’s fun to laugh at something as crazy as JDSL, but it’s even more interesting to try linking it to things we might find more normal. So a few provocative thoughts:

  • jsonapi? I work on a lot of Ember projects, where jsonapi is the blessed format for frontend/backend communication. It seems to me awfully verbose without adding a lot of new information. Just as the WTF story adds extra indirection where you could just define functions, so jsonapi makes you list out the attributes and references—even though in regular JSON they are right there. It feels to me like the worst of the old XML excesses. I admit this is a bit trollish, but I’ve yet to understand what I get from jsonapi. Maybe someone can explain it to me.

  • Ansible? I’m a Chef guy and haven’t tried ansible, and I know a lot of developers-with-sysadmin-chops love it for its simplicity. Personally I’m fearful of using something so restrictive. I have similar doubts about Puppet, although at least there I know there are good “escape hatches”. I want to be able to throw in an if statement when I need to. Here I feel my sentiment is wrong though, since these are such successful projects. So I’m curious if anyone can articulate principles that make Ansible’s YAML “work” but not TDWTF webapp’s JSON.

And going the other direction, I think sometimes—not very often—you can get huge wins out of creatively building on top of an existing tool. But it is always a risk and requires a lot of good taste. A few examples:

  • I built a product with user-customizable questionnaires, and we knew up front the logic for branching/skipping/computations was going to be super complicated. I could tell it was heading toward what I call a “Turing-Complete Feature”. (Enterprise Workflow Engine anyone?) Instead of trying to model and store all the rules in a database, or writing a new language (you can see the looming Inner Platform Effect), we basically made each questionnaire a server-side underscore.js template, with a simple Javascript API to access information about the current context. The feature was done quickly and we had great flexibility building the actual surveys. Our surveys were built by an internal team of not-quite-programmers, so the developers could work on other things. We even had the potential of letting customers write their own surveys, since Javascript is the lingua franca of modern programming, with good security isolation. (After all your browser uses it to run untrusted code!) Many customers told us our advanced survey functionality is what set us apart from the competition.

  • Another project also has user-customizable forms, with something close to an EAV pattern in the database. We wanted to offer customers reporting capabilities, so we started talking about building an API. It would be for the existing “power users”, who were doing reports and analysis using R, SAS, Tableau, and similar tools. But the API was going to be high effort, slow, and hard to join data. So we decided our API would be . . . SQL! Of course you can’t share your real tables, for security but also because if those become a public API they can never change. Plus the EAV format made them too abstract to use easily. So instead we gave each customer a private database with one table for each of their forms. This lets them ask questions in a very direct way, using tools they already know, with great performance and flexibility. (Okay I am using the past tense but this one is not actually turned on for external customers yet. Still, using a derived, flattened database for SQL reporting is not so strange, even if young CRUD developers haven’t seen it before.)

  • Even the idea of building on top of a version control system is appealing. (Maybe not svn though! :-) I would love a self-hosted password manager with a Firefox plugin, where my passwords are stored (encrypted) in a git repository I keep on my own server. Not only would using git save the author time, but it would be like using a “standard protocol” rather than something proprietary. Same thing for a self-hosted bookmarking service.

I think all three of these “good” examples are a little audacious, but in my opinion they all work well. So are there any general guidelines that can help us make better decisions about when to make an unexpected swerve? So much of programming comes down to taste. That leads us into pointless vi-vs-emacs wars, but it also sets good programming apart from bad. How can we make that taste teachable? Or should it be entrusted only to the old and wise? Are there guidelines we can identify, so that we can discuss them, evaluate them, and make decisions about them? (Apparently one rule is don’t write your own TCP/IP stack. :-)

One principle of my own is a “risk budget”: you have to take risks—they are a fertile source of advantage—but you should have a list of them and not take too many. Risks can include: tight deadlines, unreasonable clients/management, new clients/management, unclear requirements, new tech, new-to-me tech, high scalability requirements, etc. One risk is “cutting out a layer” like in the Javascript survey API and the SQL reporting interface. I’m not saying it’s always a good idea, just that it can be. Building your own filesystem on top of NFS is another example of cutting out a layer. I can’t speak to how well that one worked out, but it’s an interesting idea.

What are some other guidelines people use? I’m especially interested in guidelines about “when to leave CRUD behind”, or when to make a choice that seems strange at first glance but pays off. If programming were chess, when do you sac a piece?

Next: Postgres custom range types for Geoip