I see a lot of developers, even smart and experienced ones, get frustrated by Rails migrations—or cause frustration for others on their team. So here is a quick overview of how they work, plus a few rules for handling them without headaches. If you follow these rules, your migrations can be easy and smooth. Perhaps you’ll even stop wishing you could use Mongo.
A migration is a Ruby script with an up method containing the database changes you’d like applied, along with a down method to undo those changes and restore the database to its old structure. In newer versions of Rails, you can also write a single change method, and if you restrict yourself to the right methods, Rails will automatically understand how to undo your changes.
Migration scripts live in db/migrate, and they are named with a timestamp plus description, like this:
20121024000909_create_users.rb
or this:
20121102162841_add_address_to_profile.rb
You can create a new migration by saying rails g migration create_users, and Rails will automatically give it a timestamp and start the file. (Tip: If you’re using the command line, you can follow up that command with vi db/migrate/*!$* to open the migration in a text editor.)
By convention, migrations are named create_foo when creating a brand-new table, and add_foo_to_bar when adding the column foo to the table bar. But you can name them whatever you like. Note each migration needs a unique name, despite the timestamp, because each migration lives in a Ruby class whose name is the migration name camelized.
You can run the latest migrations by saying rake db:migrate, and you can undo the last-run migration with rake db:rollback. (You can run a limited number of migrations with rake db:migrate STEP=3). Rails automatically keeps track of which migrations you’ve applied in the schema_migrations table, which has just a single column listing the timestamps of all applied migrations.
down method.You might be tempted to leave out the down method, but it’s very useful to fill it in. Everyone makes mistakes, and the down method ensures you can recover. It also means that if your migration is not quite right the first time, you can db:rollback, make your fixes, and then db:migrate again. (But see below about pushed migrations.)
You may run into trouble if a migration fails halfway through. If this happens, the migration is not recorded in schema_migrations, but it might leave tables/columns in your database. Rather than dropping these, I prefer to simply comment out the lines that ran successfully, then restore them once the migration has succeeded. This works both when migrating and when rolling back.
A corollary to “implement the down method” is “test the down method.” I don’t mean write unit tests. I just mean when you think everything is correct, run rake db:rollback db:migrate and make sure it all works. It’s easy to have errors in your down methods if you never run them.
It’s fine to rollback, edit, and retry when you’re just working on new code that is private to your repo, but never edit a migration after sharing it with other developers, e.g. after doing a git push. Even safer would be never edit a migration after committing it. If you do this, you’re very likely to create out-of-sync databases for other developers and on production. This is because they may have already run the migration before your changes got pushed, and since they’ve already run it, they’ll never run it again to pick up your changes. Now your database looks different than everyone else’s.
I’ve seen this happen many times, and it’s probably the top cause of frustrations with Rails migrations. Recovering from it generally involves surgery, and unless the surgeon is patient and careful, he’s likely to make things even worse. To avoid the problem, if you’ve pushed a migration that you find is not quite right, always make your changes by adding a new migration, not by editing the old one.
Surgery that can help (if done right) is to delete rows from schema_migrations and manually add/remove/alter tables and columns to get back on track. Whatever the techniques, your goal should be to bring the databases into line with the accepted version history, so that running the migrations from scratch would produce the same database that comes out of your operating room. Otherwise you’re creating a time-bomb that will produce more out-of-sync databases somewhere down the line.
Opinions differ on how to create a new database from scratch, for instance when a new developer joins or you decide to add that CI server. Some people like to create it from the schema.rb, like Athena springing full-grown from Zeus’s head. Others like to start with an empty database then run all the migrations from the beginning of the project (which could be years old). Other just make a dump from somewhere and import it where needed. Whatever your approach, there is value in at least striving for a non-broken migration history, so that ideally you can run all the migrations against an empty database. The closer you are to this ideal, the easier it will be to go back in time or handle branches.
This goal means your migration scripts should not rely on the rest of your source code. It’s common for migrations to use model classes, but don’t! Here is a place you should defy the DRY imperative. The reason is that when you run a 6-month-old migration, your model classes have today’s code, so it’s very easy to have missing methods, changed methods, renamed methods, even missing classes.
The best way to avoid problems is to keep your migrations entirely self-contained. For this I recommend writing data changes (i.e. DML changes) in pure SQL. You can do it like this:
ActiveRecord::Base.connection.execute(<<-EOQ)
UPDATE foo
SET bar = baz
WHERE ick = ack
EOQ
If you don’t like SQL, you’ll just have to deal. It’s good for you. Or change jobs and write Javascript for a living. :-P
Follow these rules to get frustration-free Rails migrations. To sum up, they are:
down method.Defy them to your peril!
Someone posted a question recently to the Postgres mailing list that makes for a great puzzle. He had a table (we’ll call it m) like this:
d | v
---------------+---------------
geography | north
geography | south
industry type | retail
industry type | manufacturing
industry type | wholesale
(5 rows)
Basically this table was a list of “dimensions” and the possible values for each dimension. So there is a “geography” dimension with possible values “north” and “south”, and an “industry type” dimension with values “retail”, “manufacturing”, and “wholesale”. The table could hold more than two dimensions or have more possible values for each dimension.
His goal was to find a query that would give him all possible combinations, assuming that each combination has exactly one value along each dimension. This is roughly the idea behind a cartesian product, so it seems like the thing you’d use a CROSS JOIN for, but here the data is all in one table, with n possible partitions. So how to write a query that works regardless of how many unique dimensions the table holds?
If you want to stop reading here and go work on the puzzle yourself, I wouldn’t blame you. And I’d love to hear what people come up with. I think it’s a really hard puzzle. If you want to see my solution, keep reading.
You can recreate the table with these commands:
CREATE TABLE m (
d VARCHAR(255) NOT NULL,
v VARCHAR(255) NOT NULL,
PRIMARY KEY (d, v)
);
INSERT INTO m
(d, v)
VALUES
('geography', 'north'),
('geography', 'south'),
('industry type', 'retail'),
('industry type', 'manufacturing'),
('industry type', 'wholesale');
My solution may not be the only one, but I’m proud of doing it in pure SQL (no plpgsql), and I think it’s a neat demonstration of several special features in Postgres. Here is the query:
WITH RECURSIVE t(combo, n) AS (
WITH dims AS (SELECT DISTINCT d, row_number() OVER () AS n FROM m GROUP BY d)
SELECT '{}'::text[], 1
UNION ALL
SELECT array_append(t2.combo::text[], m.v::text), t2.n+1
FROM t t2, dims
CROSS JOIN m
WHERE m.d = dims.d AND dims.n = t2.n
)
SELECT combo
FROM t
WHERE n = (SELECT COUNT(DISTINCT d) + 1 FROM m);
The first thing to note is that we’re using a WITH expression. This is called a Common Table Expression (CTE), and it’s basically a way to create a sub-query and give it a name. In our case the “table” is named t, with columns combo and n.
But this CTE is special, because it uses the RECURSIVE keyword. With a recursive CTE, your definition is actually two queries, separated by UNION ALL. The query on top is the starting condition, in our case just a single row. The query on bottom is executed repeatedly until it returns no more rows, and it is allowed to access the results of the previous execution. The full results of the CTE are the starting condition plus all the rows ever returned by the recursive part.
To work through our CTE, let’s look more closely at the starting condition. Here it is:
SELECT '{}'::text[], 1
That first column is a special Postgres feature called an array. An array is pretty much what you’d expect, although unlike in most programming languages they are 1-indexed. They have similar syntax for accessing their elements (although we don’t use it here):
some_array[4]
An array literal is actually a string cast to an array. Here is a literal for the first three positive integers:
'{1, 2, 3}'::int[]
So you can see that our CTE’s initial SELECT contains an empty text array.
The idea is for each array to represent one combination, so we might have {wholesale,north} or {retail,south}. We will grow the array one dimension at a time until all the dimensions are included. We are using arrays because they can have as many elements as we want, whereas a SQL query must have a fixed number of columns.
The second half of our CTE is how we grow the array. Here is the code:
SELECT array_append(t2.combo::text[], m.v::text), t2.n+1
FROM t t2, dims
CROSS JOIN m
WHERE m.d = dims.d AND dims.n = t2.n
So t (aliased to t2) is whatever the previous iteration produced. For each row in t, we want to create several new rows, one for each possible value of the next dimension, tacking those values onto the end of the array. In other words, if t has the row {wholesale}, we want to produce the new rows {wholesale,north} and {wholesale,south}. The CROSS JOIN and array_append accomplish this for us.
We also need some way to stop the recursion. Postgres will quit re-executing our query if it ever produces zero rows. To make sure this happens, we include a second “counter” column, which also serves as a way to grab a different dimension on each iteration. This is what our nested CTE named dims is all about. It’s just a table with one row per dimension, and the rows numbered. Our WHERE clause makes sure that we process a different dimension on each iteration, and stop when there are none left.
The other thing of note is that in our nested CTE (dims), we are using a fancy Postgres feature called a window function. Our window function is the row_number() OVER () part, which, as window functions go, is actually a crashing uninteresting specimen. Rather than turning this post into a window function tutorial, I’ll just say that you should go read about them, because they are wonderful. Basically they let you get values into a non-GROUP-BY query that you’d normally need aggregate functions to compute. They are like a SQL superpower.
Finally, you can see that in the outermost query, we are filtering out the “intermediate” rows from our CTE, so that we only get combinations that include all dimensions.
Here are the results:
combo
-----------------------
{wholesale,north}
{wholesale,south}
{retail,north}
{retail,south}
{manufacturing,north}
{manufacturing,south}
(6 rows)
I hope you enjoyed this puzzle! I’m delighted how it does something impossible in ordinary SQL (return results with a dynamic number of “columns”), and combines three exotic Postgres features you may not have seen before.
A client asked me to provide database access for simple reporting, and we settled on using rails_admin. But I wanted to prevent any changes to the database. To create a completely read-only rails_admin installation, just add this to your config/initializers/rails_admin.rb:
config.actions do
# root actions
dashboard
# collection actions
index
export
# member actions
show
show_in_app
end
Also you should make sure only admin users have access. This will depend on how you identify admins, but it might look something like this (if you are using Devise for authentication):
config.authorize_with do
redirect_to main_app.root_path unless warden.user.admin?
end
I’m working on a Rails project with CKEditor (I wasn’t the one who chose it.), and after upgrading Rails to use the asset pipeline, none of the CKEditor text areas would render. Firefox reported TypeError: x is undefined on a line buried deep in the wastes of minified JS. Googling turned up nothing.
Eventually I noticed we were using a custom toolbar with its own name (StoriDescription), and on a hunch I tried switching to a standard one. Everything worked! So the problem was that I had lost our custom definition by not preserving ckeditor/config.js. Once I put that file into app/assets/javascripts/ckeditor/config.js everything was well. Just thought I’d publish in case it saves someone else a few hours out there!
Lots of people want to index HTML files with Solr or Lucene. One great tool that can extract text out of many document formats is Tika, but it is very heavy-weight. What if you just want to strip the text from html files and have Lucene index it? There is no out-of-the-box solution for this, but with a tiny customization we can get it working.
For the actual text extraction, we’re going to use the excellent Jerico HTML Parser. This is a loose HTML parser that can handle all kinds of malformed markup and still give reasonable results. It’s also got a class dedicated to ripping text out of html documents called TextExtractor.
To use this in Solr, we’re going to implement our own TokenizerFactory. Solr uses this class to get a Tokenizer, which will generate the tokens for your index. The important method here is Tokenizer create(Reader input). This method takes a Reader, a standard Java I/O object, opened on the text of whatever document you’re indexing. Your Tokenizer will use this Reader to get the document’s content.
Our strategy is to return the same Tokenizer we were before (StandardTokenizer for normal text), but to inject Jericho into the mix before passing it the input Reader. So intead of Reader -> StandardTokenizer we’re going to do Reader -> Jericho -> StandardTokenizer. Jericho’s TextExtractor makes this easy for us, because it takes a Reader of HTML and returns a Reader of plain text. So we write an HtmlTokenizerFactory with this implementation:
public Tokenizer create(Reader input) {
try {
return new StandardTokenizer(Version.LUCENE_36, convertReader(input));
} catch (IOException e) {
throw new RuntimeException(e);
}
}
private static Reader convertReader(Reader r) throws IOException {
Source s = new Source(r);
Element elem = s.getNextElement(0, "html");
TextExtractor te = new TextExtractor(elem);
return CharStreamSourceUtil.getReader(te);
}
That’s it! Our HtmlTokenizerFactory returns a StandardTokenizer, but it pre-processes the HTML to extract just the text.
The final step is to define a new Field Type in Solr’s schema.xml file, so we can index fields that have HTML. Just add this snippet to the file:
<fieldType name="text_html" class="solr.TextField">
<analyzer type="index">
<tokenizer class="your.package.name.HtmlTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Here we’re using the same Filters as Solr’s out-of-the-box text_general Field Type, but you could change them to whatever you like. By converting the HTML to plain text right at the beginning, you retain all the power and flexibility of Lucene’s indexing workflow.
To actually define a Field using our new Field Type, add something like this to your schema.xml:
<field name="html_content" type="text_html" indexed="true" stored="false"/>
There are many improvements you can make to this approach. For instance, you could have Jericho process just the body tag instead of the the whole page, perhaps also extracting certain meta tags and adding them to the token stream. Or you could follow Google’s lead and boost the weight of tokens that come from the <title>, <h1>, or similar tags. But hopefully this is a good starting point for your Solr/Lucene applications!
I seem to keep winding up doing projects that require analyzing data stored in Postgres. Often it is helpful to see your data as a histogram: for instance, how many users have placed n orders? R is the best tool for this, but occasionally you want the results as a table pulled out of Postgres. We can use window functions to make this easy. Here is a general-purpose template for a histogram-like SQL query:
SELECT c,
COUNT(*) AS items_count,
SUM(COUNT(*)) OVER (ORDER BY c) AS total_items_count,
(SUM(COUNT(*)) OVER (ORDER BY c))::float / SUM(COUNT(*)) OVER () AS total_items_perc
FROM (SELECT p.id, COUNT(ch.id) c
FROM parent p, child ch
WHERE p.id = ch.parent_id
GROUP BY p.id
) x
GROUP BY x.c
ORDER BY x.c
This gives results like the following:
c | items_count | total_items_count | total_items_perc
-----+-------------+-------------------+-------------------
1 | 1279 | 1279 | 0.161775866430559
2 | 1275 | 2554 | 0.323045788009107
3 | 1076 | 3630 | 0.459144953200101
4 | 886 | 4516 | 0.571211737920567
5 | 653 | 5169 | 0.653807235011384
. . .
Column 2 would be the y-value if you graphed this as a usual histogram. Column 3 or 4 would be the y-value if you graphed it as a cumulative histogram.
Note that the percentage column is wrong if you want to include values where c equals zero. Then you’ll need an outer join in the sub-query. I’ll leave that as an exercise for the reader.
Next: Highlighting Lines in Github