Postgres CTE for Threaded Comments

2014-09-17

Not long ago I answered a question on the Postgres mailing list I thought was pretty fun: How do you construct a recursive CTE to pull a whole tree of threaded comments, sorting sibling comments by votes?

Threaded, scored comments are what you see on sites like Reddit and Hacker News:

reddit comments

hacker news comments

Threaded means the comments appear in a tree-like structure. Contrast this with Wordpress or Discourse, where every comment appears at the end:

discourse comments

Disqus (which uses threaded comments) has a nice article about implementing them in Postgres with a recursive CTE. They start with this data:

CREATE TABLE comments (
  id SERIAL PRIMARY KEY,
  message VARCHAR,
  author VARCHAR,
  parent_id INTEGER REFERENCES comments(id)
);
INSERT INTO comments (message, author, parent_id)
  VALUES
  ('This thread is really cool!', 'David', NULL),
  ('Ya David, we love it!', 'Jason', 1),
  ('I agree David!', 'Daniel', 1),
  ('gift Jason', 'Anton', 2),
  ('Very interesting post!', 'thedz', NULL),
  ('You sir, are wrong', 'Chris', 5),
  ('Agreed', 'G', 5),
  ('Fo sho, Yall', 'Mac', 5);

Then they recommend using this query:

WITH RECURSIVE cte (id, message, author, path, parent_id, depth)  AS (
    SELECT  id,
        message,
        author,
        array[id] AS path,
        parent_id,
        1 AS depth
    FROM    comments
    WHERE   parent_id IS NULL

    UNION ALL

    SELECT  comments.id,
        comments.message,
        comments.author,
        cte.path || comments.id,
        comments.parent_id,
        cte.depth + 1 AS depth
    FROM    comments
    JOIN cte ON comments.parent_id = cte.id
    )
    SELECT id, message, author, path, depth FROM cte
ORDER BY path;

Essentially what’s happening here is that as we recursively execute the CTE, we build up a “path” to each comment, with the IDs of all ancestors. Then we can sort by the path, so that comments with the same ancestors sort together. This technique relies on the fact that the array A will sort before all other arrays that begin with A, so parents will appear before their children.

(Note that the Disqus solution returns the depth of each comment. This is not really necessary, since it is just the length of path, but I’ve left it in the query because computing depth is often necessary with recursive CTEs, so it might be interesting if you haven’t seen it before.)

That gets you the comments sorted as a tree, but the question on the mailing list was how to also sort by votes, so that while preserving the tree structure, comments with higher votes would appear before their siblings.

To figure this out, let’s modify the Disqus sample data so each comment has a score:

CREATE TABLE comments (
  id SERIAL PRIMARY KEY,
  message VARCHAR,
  author VARCHAR,
  parent_id INTEGER REFERENCES comments(id),
  votes INTEGER
);
INSERT INTO comments (message, author, parent_id, votes)
  VALUES
  ('This thread is really cool!', 'David', NULL, 1),
  ('Ya David, we love it!', 'Jason', 1, 3),
  ('I agree David!', 'Daniel', 1, 4),
  ('gift Jason', 'Anton', 2, 15),
  ('Very interesting post!', 'thedz', NULL, 3),
  ('You sir, are wrong', 'Chris', 5, 3),
  ('Agreed', 'G', 5, 5),
  ('Fo sho, Yall', 'Mac', 5, 12);

One intuitive approach is to change path to include only ancestors, and then resolve ties by looking at votes:

. . .
ORDER BY path, votes DESC

The problem is that this approach fails to keep children with their parents. We need to keep the full chain of IDs.

The correct answer is to sort by path, like in the Disqus article, but at each step also sort by votes. “At each step” is the key. Conceptually what we want is for each step along the path to sort as if by votes DESC, id ASC. To make that happen, rather than constructing the path soley of IDs, we can build it out of (votes, id) tuples. Then we still get our tree, but siblings will differ in the second-to-last array element.

One wrinkle is that we want to sort by votes in reverse order. But it’s easy to just negate the votes to achieve this. Here is the SQL:

WITH RECURSIVE cte (id, message, author, path, parent_id, depth, votes)  AS (
    SELECT  id,
        message,
        author,
        array[-votes,id] AS path,
        parent_id,
        1 AS depth,
        votes
    FROM    comments
    WHERE   parent_id IS NULL
    UNION ALL
    SELECT  comments.id,
        comments.message,
        comments.author,
        cte.path || -comments.votes || comments.id,
        comments.parent_id,
        cte.depth + 1 AS depth,
        comments.votes
    FROM    comments
    JOIN cte ON comments.parent_id = cte.id
    )
    SELECT id, message, author, path, depth, votes FROM cte
ORDER BY path;

Running it get us this:

 id |           message           | author |       path        | depth | votes 
----+-----------------------------+--------+-------------------+-------+-------
  5 | Very interesting post!      | thedz  | {-3,5}            |     1 |     3
  8 | Fo sho, Yall                | Mac    | {-3,5,-12,8}      |     2 |    12
  7 | Agreed                      | G      | {-3,5,-5,7}       |     2 |     5
  6 | You sir, are wrong          | Chris  | {-3,5,-3,6}       |     2 |     3
  1 | This thread is really cool! | David  | {-1,1}            |     1 |     1
  3 | I agree David!              | Daniel | {-1,1,-4,3}       |     2 |     4
  2 | Ya David, we love it!       | Jason  | {-1,1,-3,2}       |     2 |     3
  4 | gift Jason                  | Anton  | {-1,1,-3,2,-15,4} |     3 |    15

You can see that high-scoring comments are sorted higher than their siblings!

If you are clever, you may have noticed that negating votes is not really necessary. We did it because we wanted each level to sort votes DESC, id ASC. But really we don’t care which direction the IDs sort. We just care that parents and siblings stay together. So this SQL would work too:

WITH RECURSIVE cte (id, message, author, path, parent_id, depth, votes)  AS (
    SELECT  id,
        message,
        author,
        array[votes,id] AS path,
        parent_id,
        1 AS depth,
        votes
    FROM    comments
    WHERE   parent_id IS NULL
    UNION ALL
    SELECT  comments.id,
        comments.message,
        comments.author,
        cte.path || comments.votes || comments.id,
        comments.parent_id,
        cte.depth + 1 AS depth,
        comments.votes
    FROM    comments
    JOIN cte ON comments.parent_id = cte.id
    )
    SELECT id, message, author, path, depth, votes FROM cte
ORDER BY path DESC;

I hope you enjoyed this walk through recursive CTEs for threaded, scored comments. I thought it was a pretty fun problem!

Flash movie (.swf) won't load in Rails 4

2014-07-24

I work on a Rails project that serves Flash .swf files out of Rails controllers, like this:

send_file swf_filename, disposition: 'inline'

I know that’s a crazy approach, but the reason is that these movies should only be seen by logged-in users. (Also I didn’t set it up like this. :-)

Recently some customers complained that Flash movies wouldn’t load. I was pulling my hair out figuring out why. I couldn’t reproduce it. The Rails and nginx logs showed no errors. Their browsers were retrieving the file, but then they would just display a blank white page, sometimes with a thin border where the movie should be.

Finally I was able to reproduce it on a different machine, and from there I eventually figured out it only happened for people running on the latest version of Flash. I installed the Flash debugger, I turned on tracing, but that didn’t detect any errors.

Another clue was that the problem wasn’t happening on another site that shares this codebase. By comparing browser requests between the sites, I was able to figure out the cause was an X-Content-Type-Options HTTP header with a nosniff value.

It turns out that Rails 4 adds this header by default, and although the working site was still on Rails 3, my predecessor had upgraded the problem site to 4. So one fix is to turn off that header like this:

response.headers.except! 'X-Content-Type-Options'

But a better fix would be to add a header that indicates the correct Content Type. We were serving the file as application/octet-stream, so with nosniff the browser couldn’t figure out how to handle the file. We needed to serve the .swf file as application/x-shockwave-flash instead:

response.headers['Content-Type'] = 'application/x-shockwave-flash'

Hopefully this will help someone else out there!

Dates and Time Zones in Rails

2014-04-18

I work on a video education app that needs to report how many lessons were assigned each day. I have a timetracking app that needs to invoice based on when one month starts and another ends. And I’ve got another app for watching cronjobs, that counts how many failures happened in a given day. All these apps share a problem: if they get time zones wrong, they will give incorrect results. For instance, if a video lesson was assigned on Tuesday after 5pm PDT, its UTC time will be Wednesday. Even though all three apps only show dates, not times, time zones still matter. So here are some lessons I’ve learned dealing with Rails time zones, both programming myself and leading a team.

But first, here are two articles that lay some groundwork:

Both those posts describe some nice methods provided by ActiveSupport/Rails, as well as why you’d want to use them instead of built-in Ruby methods. For instance, saying Time.now will get you the time in your OS’s default time zone, whereas Time.zone.now will get you the time in whatever timezone you put in config/application.rb. It’s a good idea to use the Rails timezone, since you have more control over that and it’s part of your source control.

If you want to explore, take a look at ActiveSupport::TimeZone. You can get a specific time zone like this:

tz = ActiveSupport::TimeZone['Pacific Time (US & Canada)']

That tz will be the start of just about everything you do with time. An ActiveSupport::TimeZone is also what you get from Time.zone.

Those other articles are great, but this post talks about time zones from a higher level, aspiring to share best practices rather than helpful methods. In my examples I’ll assume a Postgres database, although there’s nothing here that won’t work on MySQL as well.

Scope Your Time Zones

The first thing to remember is that all times are relative to somebody. Probably that’s a user, but maybe it’s something else. For my video education app, sometimes it’s a teacher and sometimes it’s a student. For my cronjob app, it’s a job (and sometimes a user). But there is always a reference point. In that sense, the articles above that recommend Time.zone.now are wrong. Time.zone.now is no better than Time.now if you need many time zones—and who doesn’t? So it should really be user.time_zone.now. I’d recommend adding this method to your User class right from the beginning, so that even if you haven’t yet implemented user-specific time zones, you are still writing the rest of your app correctly:

def time_zone
  ActiveSupport::TimeZone['Pacific Time (US & Canada)']
end

Ignore Time Zones as Much as Possible

Now that you’ve got time zones everwhere, the next step is to get rid of them. Thinking about time zones is hard, and finding time zone bugs is hard. You want to avoid them as much as possible!

It’s been said that in C, if you start thinking about big endian vs little endian, you’re probably doing things wrong. Oftentimes when programmers half-understand something, they do more than they should, and that’s very true with time zones. Your code will be a lot easier to understand if you tackle time zones with a few well-placed strokes, rather than lots of fiddling all over the place.

One thing to remember is that 07:05:00 PDT and 14:05:00 UTC are the same instant. If you converted both to seconds since the epoch, you’d get the same number. So in that sense, changing the time zone doesn’t change anything: it’s just a bit of extra metadata hanging onto the time representing someone’s perspective. Knowing that x PDT and y UTC are the same instant is really helpful when you feel the urge to fiddle with timezones. Is your fiddling just a noop?

By default, Rails comes configured with its default time zone as UTC. Leave it that way! Since every time needs a reference point anyway, your code shouldn’t care about the Rails-wide setting. UTC is a good neutral choice. For one thing, it doesn’t have daylight savings time. And if you see it, you know you’re dealing with a time-zone-less value.

You should also leave your OS time zone as UTC, if possible. Keeping it consistent with Rails will remove one chance for abiguiuty. And again, it’s a good neutral.

You also want UTC in your database. If you use a migration to create columns with t.timestamps or t.datetime :foo, Rails will make a TIMESTAMP WITHOUT TIME ZONE column. You can think of this as a time in UTC if you like. Really it’s an int of (micro)seconds since the epoch. Whenever you give a time to ActiveRecord, it will convert it to UTC before it hits Postgres. Or more correctly, it will strip off the timezone part and give Postgres the int. Remember, it’s the same instant! But it’s nice to imagine the column as UTC. If you’re in psql and type SELECT created_at FROM lessons, that’s what you’re seeing.

When everything in your stack is UTC, it’s like looking through nice clear glass. You don’t have to think about conversions at each layer.

You should also strive to make your app code as time-zone-less as possible. The big principle here is to handle time zones once, hopefully at the beginning of the HTTP request, when you decide the correct reference point for all times. Usually that’s current_user.time_zone.

The second article above suggests you add this to your controller:

around_filter :user_time_zone, if: :current_user

def user_time_zone(&block)
  Time.use_zone(current_user.time_zone, &block)
end

That makes me pretty uncomfortable. The idea is that you can say Time.zone everywhere else in your app and always get the current user’s time zone. But I’d rather be explicit about where the time zone is coming from. Write code to take a tz argument if necessary (emphasis on the if necessary). This will make your code less surprising, less coupled, and easier to test.

Even better is to write your code to take a time. Almost always that’s really what you want. Remember that no matter the time zone, it’s all the same instant. Usually a single time is sufficient, because it can be a reference point for creating other times, using t + 1.day or t + 2.weeks or whatever. Use a (user|cronjob|foo)-scoped time zone to get your first time, and then forget about time zones for the rest of the stack. If you implemented the User#time_zone method above, your time zone code is already well-encapsulated, so there’s no need for an around_filter to further abbreviate things.

Here is another approach I don’t like. This blog post suggests you deal with time zones in Postgres like so:

created_at AT TIME ZONE 'UTC' AT TIME ZONE 'US/Pacific'

Does that look strange to you? What’s happening is that you start with a TIMESTAMP WITHOUT TIME ZONE, so first you tack on a time zone (just metadata), then you convert it to Pacific time. In other words, you’re doing this:

time without zone -> assumed to be UTC -> converted to Pacific Time

It’s good to know that this is how to get a Postgres timestamp converted to whatever time zone you want, but I wouldn’t recommend it in a Rails app (which is the article’s context). For one thing, Rails and Postgres don’t use the same names for all timezones, so you have to maintain your own mapping between the two. This is one of those chores that will keep nagging you for the life of your app, so I’d rather just avoid it. But more important, this approach means you have to pass your timezone all the way down to the database layer. I’d rather deal with time zones early, get to a TimeWithZone (or even an int), and then forget about time zones for the rest of the code.

Remember Daylight Savings Time

This is a small tip, but be careful about DST. You should avoid ever writing a fixed offset or a string like “PDT”. This is wrong:

Time.new(2014, 4, 8, 3, 15, 0, '-07:00')

If you were after PDT, your code is broken during PST. If you were after MST, your code is broken during MDT. Similarly in SQL if you say AT TIME ZONE 'PDT', you’ve broken PST.

If you stick with the ActiveSupport::TimeZone instances, you can forget about DST, and things will just work. That’s why there is no such thing as this:

ActiveSupport::TimeZone['PDT']

only this:

ActiveSupport::TimeZone['Pacific Time (US & Canada)']

Even Dates are Times

When you think you’re just dealing with dates, time zones probably still matter. Was the lesson assigned on Tuesday or Wednesday? It depends. From the teacher’s perspective? The student’s? The principal’s? Unless you’re really sure, I’d recommend always storing a full date+time in your database. Also, in Ruby avoid converting things to Date. Of course if you read the articles above you know you don’t want this:

Date.today

But this isn’t any better:

Time.zone.today

Or even this:

current_user.time_zone.today

Here is a query I’ve seen:

Lesson.where("date(created_at) >= ?", tz.today - 3.days)

But that’s wrong, because you’re stripping off all the time zone information. It’s going to report Tuesday’s lessons in Wednesday. (It also means you need a database index on the expression date(created_at), which is probably less often useful than a normal index on just the column.)

To improve this query, I’d write it like this:

Lesson.where("created_at >= ?", tz.now.midnight - 3.days)

If today is Wednesday (for you), that will give all the lessons created since the beginning of Sunday. Because we’re carrying a full time all the way down to the database query, we don’t have problems with the definition of “today.”

If you are ever tempted to use today or to_date, you probably want these methods instead:

tz.now.midnight     # the start of today, 00:00:00
tz.now.end_of_day   # 23:59:59

Displaying times

So far we’ve dealt with times as inputs. Times as outputs is a lot easier. Your rule should be to ignore time zones until the last minute, when you actually format the value for rendering. So it should go right into your view:

= lesson.created_at.in_time_zone(current_user.time_zone).strftime("%A, %B %-d, %Y")

You might want a helper for this though, something like:

def format_time(time, tz, format)
  time.in_time_zone(tz).strftime(format)
end

Then your Haml can be:

= format_time(lesson.created_at, current_user.time_zone, "%A, %B %-d, %Y")

Or de-parameterize that as much as you like:

def format_time(time)
  time.in_time_zone(current_user.time_zone).strftime("%A, %B %-d, %Y")
end

and:

= format_time(lesson.created_at)

Conclusion

So in general, my principles for handling times in Rails are:

  • Even dates are times.
  • Set “global” time zones to UTC everywhere you can.
  • Don’t ever use global time zones; scope it to a user or whatever is appropriate.
  • To most of your code, a time is just an instant.
  • Push time zone code to the very beginning of your request, to the top of your stack.
  • Push time zone code to the very end of your request, when rendering the view.

Good luck!

Basics of Web Architecture

2014-03-07

This post is adapted from an email I wrote several years ago and have since reused many times to help explain the “moving pieces” on the web to non-programmers. Someone on Hacker News asked about this kind of thing, so I thought I’d finally post it, with some images reused from my old talk at Wharton about the same thing. This article won’t make you a programmer, but hopefully it can give a big-picture overview of how the major parts fit together. It’s like Web 101. In some places it oversimplifies (lies a little as I like to put it), but in ways that I think is helpful for a beginner.

HTML

HTML hit the world in 1994. The way it works is your web browser, based on a URL, asks a web server somewhere out on the Internet for an HTML file, and the web server sends it back. So there are two actors: the web browser and the web server.

Basic web architecture

The web server is just a program running on someone else’s computer. The most popular web server is called Apache, but there are many others. A web server listens for people requesting URLs, and it sends back HTML files, images, etc. There are lots of companies out there that will host your website for you. Basically they provide a web server, and you upload whatever files you want the world to see.

The web browser is Firefox or Safari or IE or whatever. You type a URL into the address bar, and it finds the web server out on the Internet and asks it for the file in the URL. The server sends it some HTML, and the browser is responsible for rendering the HTML file. It may also request supporting files like images. Different web browsers render HTML slightly differently, making many headaches for web designers.

IE5

HTML stands for HyperText Markup Language. The HyperText means it has links you can click to visit new pages, and the Markup means you use angle-bracket tags to “mark up” certain parts of the text. In markup, there is an opening tag and (usually) a closing tag. The closing tag repeats the name of the opening tag, but with a forward slash at the beginning. This markup can indicate structure or display. For instance, <i>ibid</i> will render “ibid” in italics; this is an example of using markup for display. On the other hand, you might have this:

  • Walk the dog
  • Do the laundry
  • Pay the bills

This produces a bulleted list with three items (ul = unnumbered list, li = list item):

  • Walk the dog.
  • Do the laundry.
  • Pay the bills.

A tag may also have attributes, which look like name="value". This is how you write links, which are “anchor tags”: <a href="http://www.google.com/">Click here to visit Google</a>.

HTML may reference other files. When this happens, the browser downloads those files too and adds them to the display. The main example is images, which work like this: <img src="my-portrait.png">. (Notice there is no closing tag for images.) And here are some other cases of HTML referencing outside files:

CSS

CSS stands for Cascading StyleSheets. It’s an attempt to separate out display-style information from the HTML code, so the HTML just contains structural markup. A CSS file is a separate file, referenced by the HTML file, which tells how each type of element should be rendered (e.g. in big font, with a 2-pixel red border, or whatever). It’s considered good form to use CSS for display and HTML for structure, at least insofar as that’s possible. This is partly because it reduces typing and makes changes easier, it makes reading the HTML easier, and you can have different people working on different things. CSS also supports more advanced display options that just HTML, so some effects you can only achieve with CSS.

Again, each browser interprets CSS in its own slightly incompatible way. People who write really slick-looking web pages spend a ton of time discovering and applying tricks to get an acceptable display on each browser type. One way is to write separate HTML+CSS files for each browser type, detect what a user is using, and send the appropriate files. But no one really does this. At most people write a special CSS file for IE (which is the most incompatible with the others), and then use special codes to serve it only to IE browsers. There are other less drastic techniques as well.

Because this gets so expensive, it’s common to decide up front which browsers you’ll support, and which will get a slightly wonky-looking site. Your choice will depend on your expected audience. Nowadays some sites will just refuse to support older browsers or even not-so-old versions of IE. I like to aim for IE 8+ or even 7+. If you’re somebody like Amazon, then you add browsers with less market share like IE 5+, Safari, Opera, Konquerer, and maybe more. (Amazon is perhaps a bad example because most of their site uses a fairly loose, flexible design, so slight inconsistencies aren’t noticeable. A better example would be something with an artsy look like a winery.) Nowadays a big new challenge is supporting mobile devices like Androids and iPhones.

Javascript

Javascript is a full-fledged programming language (unlike HTML & CSS), with sequential execution of commands, variables, functions, etc. It runs on the user’s browser, which means a web server sends a bunch of Javascript code over the Internet, and the user runs it on his local machine. You can put Javascript straight into an HTML page like this:

<script language="javascript">
alert("Hello World!");
</script>

Or you can have a separate .js file referenced by the HTML page.

Either way, the point of Javascript is to cause dynamic effects in the user’s browser. You can’t accomplish this with HTML or CSS, because those just tell how to lay out the page. With Javascript you can make things change and move around. You can add popups like “Do you really want to delete that?” or “Call today for your free sample!”

Note that Javascript has nothing to do with Java. It was invented by Netscape when Java was a hip new programming language (invented by Sun), and as a marketing gimmick they called it Javascript. Really!

Again, different browsers treat Javascript in different ways. This is where the browser incompatibilities are the worst. It can be truly maddening, and to get good results you really need someone with a lot of experience.

jQuery

jQuery is an extension to Javascript. It’s pretty new, but it’s wonderful. It lets you easily get lots of trendy effects like fade-in or rollovers, that used to be more expensive to build. It also encapsulates a lot of the cross-browser boilerplate people used to write themselves to support all the different browsers out there. Basically instead of calling native Javascript functions, you call the jQuery functions, which automatically test for the browser’s functionality do the right thing based on the result. That sort of library has been around for many years, but jQuery is an exceptionally good implementation of it, and it’s quickly become the de facto standard. jQuery makes it much less expensive to get a pretty site and fancy effects.

In fact nowadays (2014) jQuery is old hat! People are doing more and more with Javascript. It’s no longer about rollovers and popups, but many people build their whole interface with Javascript. To manage this extra complexity, there are new frameworks out there like Backbone.js, Angular.js, and Ember.js. These frameworks provide lots of common functionality and you keep things organized.

Flash

Like Javascript, this is a programming language that gets run browser-side. But whereas Javascript can touch any element on the page, a Flash program is confined to a given square. Flash is particularly aimed at graphics, so the big ads are all written in Flash, and there are lots of Flash games, and YouTube videos are Flash. Unlike with Javascript, Flash doesn’t suffer from cross-browser incompatibilities because it is owned by Adobe (originally by Macromedia), who distributes browser “plugins” that execute the Flash code. Because the browser doesn’t run the Flash directly, it can’t be inconsistent about it. This is why you need a plugin to run Flash.

The are several disadvantages to Flash: It requires users to download the plugin (although most have it already), it is opaque to search engines, so Google can’t index what you put in Flash, and it is unsupported on iPhones. Sometimes it is used to create an intro movie on the front page of sites, but many people find that annoying, so I don’t recommend it. In general, Flash is being replaced with Javascript+HTML5, so learning it may not be a good investment, unless you want to make games or dynamic advertisements.

PHP

Let’s go back to the beginning: the way the web works is your web browser asks a web server for an HTML file, and the web server sends the file across the Internet. But what if the web page is something like a shopping cart, where the HTML is different for each user? In this case, the web server doesn’t just have a static HTML file, but it runs a program to generate an HTML file on the spot, and it sends the result to your web browser. So here the web server is basically a middle-man: it delegates the task of generating the HTML to a separate program, and it passes the result along to the web browser. The old term for this is CGI, and while it isn’t always correct any more to call this sort of thing a CGI, it’s still useful as a catch-all term for any server-side HTML generation. You can build a CGI in whatever language you like. Java and C# are common choices, as is PHP. Most modern techniques, PHP included, involve writing a file that is mostly HTML, with some special code embedded inside to output the variable parts. Generally people choose PHP for smaller sites and Java/C#/Python/Ruby for bigger sites, but that distinction seems less valid each year.

CGI (in any language) is quite different from browser-side programming like Javascript and Flash. First of all, it all happens on your own web server, so you don’t have to deal with browser incompatibilities. CGI is invisible to the user. By the time the browser sees anything, it’s just got plain old HTML. On the other hand, this means that CGI can’t give you effects like animations and popups, which require executing code on the browser. CGI is just a way to serve different HTML to each user. Note that CGI is not incompatible with Javascript and Flash. Any website of moderate complexity will employ CGI on the server side to generate custom HTML, plus Javascript/Flash on the browser side to get flashy effects. (You can even use CGI to dynamically generate Javascript, CSS, or images, but this is rare.)

Using CGI also imposes a larger burden on your web server than just serving static HTML files. Maybe the CGI has to access a database to find a user’s favorite genre of books, for example. If you have a lot of traffic, it’s important to consider the efficiency of your CGI code. The ability to support lots of traffic is called scalability. This isn’t something you really ought to worry too much about right away, but someone experienced will make fewer mistakes here.

An aside about performance

On the web, “performance” has two components: latency and scalability. Latency is how snappy your site is: how quickly you can serve a page and get all the images etc. loaded. Big players like Google and Facebook have done studies showing that engagement and conversions drop off if you have page load times over 600ms or so, hence the saying “speed is a feature.”

Scalability, on the other hand, refers to how well your site degrades as you add more users. So instead of snappiness, it’s about capacity. You can also think of scalability as how hard/expensive it is to add more capacity.

People talk about “vertical” vs. “horizontal” scalability. Vertical scalability means buying faster hardware. Horizontal scalability means buying more cheap hardware and distributing the load. Vertical scalability is expensive and has hard limits, but horizontal scalability requires careful up-front design. Rails and Django are nice frameworks because they enable or even in some ways force a more scalable architecture. One thing that is hard (by “hard” I mean it’s a current research topic for Ph.D.s) is horizontal scaling for databases. There are worthwhile techniques, but they are complex to implement. Often it’s better to just buy a bigger box.

Latency and scalability don’t always come together. You can be snappy but crash when a media hit causes traffic to spike, or you could have a sluggish site that keeps chugging no matter how many visitors arrive. Sometimes optimizations will help both categories, but not always.

Java+J2EE, Ruby+Rails, Python+Django

These are all server-side technologies like PHP. They all go beyond CGI in that they are “application servers”, so they do more work for you, like tracking sessions and pooling database connections. (Actually PHP can do that, too.) They encourage a pattern of web development called Model-View-Controller (MVC), which helps separate your code into different concerns. The Model defines the structure of your data and your “business logic.” It is the back end of your site. Its job is to talk to the database and provide a convenient, intelligible API for the other layers. The Controller is what handles incoming web requests. Each request goes to the Controller code first. The Controller decides what to do. It asks things like “Is this person logged in?” and “What page do I show for this URL?” The View is what renders the actual HTML. Usually you write your view in some kind of templating mini-language, like JSP or ERB or HAML. All those languages let you write a bunch of HTML and then embed Java/Ruby/etc code into it to inject things like user names, product images, or whatever. In general, the Controller sets up whatever data the View will need by loading Model objects, and then the View reads those Model objects to fill in its blanks.

To new coders, MVC probably seems like a lot of extraneous infrastructure. It does increase the learning curve, but it’s basically the standard in professional web development. It’s the pattern adopted by J2EE, Rails, and Django as the right way to build websites. By separating your code into layers and giving each layer its own responsibility, it becomes much easier to manage complexity. Otherwise it’s all but impossible to reason correctly about what your code will do. Without MVC, it’s easy to make a small change and inadvertently break something that you thought was unrelated. MVC protects you from this kind of thing. It also makes it easier to separate tasks among multiple developers.

Why not distribute public keys via SMTP?

2013-07-15

One of the hurdles to end-to-end email encryption is key distribution. Doing this manually ensures that only technical people will bother, so why not automate it via some protocol? A natural method would be to extend SMTP so the sender can ask if a user’s key exists, and if so then use that to send the message, e.g. like this:

HELO relay.example.org
250
MAIL FROM:<bob@example.org>
250
KEY FOR:<alice@example.com>
250
--- BEGIN PGP PUBLIC KEY BLOCK ---
...
--- END PGP PUBLIC KEY BLOCK ---
RCPT TO:<alice@example.com>
...

The KEY FOR capability could also be discoverable via the common EHLO command:

EHLO relay.example.org
250-smtp.example.org 
250-KEYS    
...

Why hasn’t this been suggested before? Googling I found only one link. Is there some flaw with it?

Paperclip with Server-Side Files

2013-06-07

Several times in the last few years I’ve built Rails sites that needed to store files via Paperclip that were not uploaded by a user, but generated programmatically on the server side, for instance PDF or Excel reports built from something in the database. There’s not much documentation on this, and my first effort led to this StackOverflow question. The first solution is not perfect, but it sometimes works:

user.photo = photo_bytes
user.photo.instance_write(:file_name, photo_file_name)
user.save!

But sometimes that won’t do, because Paperclip uses the file name to determine what kind of file you’ve got. For instance, suppose you’re generating a spreadsheet with the WriteExcel gem:

io = StringIO.new
xls = WriteExcel.new(io)
# ...
xls.close
report.spreadsheet = io.string

This will raise a NoHandlerError in the last line. The solution is to make sure you give Paperclip something with a filename, not just the raw binary string.

The easy way out would be to write your spreadsheet to a temp file, then point Paperclip at it. But writing and reading a temp file adds risk of failure, increases disk IO, and slows things down. I’m stubborn, so I really wanted to keep everything in memory.

My solution is based on the second answer to that StackOverflow question above. If we give Paperclip an IO object with a method called original_filename, it will do the right thing. So let’s define this class:

class NamedStringIO < StringIO

  def initialize(data, filename)
    super(data)
    @filename = filename
  end

  def original_filename
    @filename
  end

end

Now we can change our Paperclip code to this:

io = StringIO.new
xls = WriteExcel.new(io)
# ...
xls.close
report.spreadsheet = NamedStringIO.new(io.string, xls_filename)

That code makes Paperclip happy, and it lets us generate a named “file” that never hits the disk.

It would be even better to create our NamedStringIO at the top, so we don’t have that extra StringIO we give to WriteExcel. For some reason I couldn’t get WriteExcel to accept my NamedStringIO, so I had to copy things around a bit. Oh well, good enough for me!

Next: Rules for Rails Migrations