Piping in Ruby with popen3

2011-10-24

If you’re using Ruby to glue together shell commands, you may want to pass some values through a filter and read the results back into Ruby. This is trivial to do in a shell script, but from a language like Perl or Ruby it is really hard.

For example, suppose you have a database full of names, and you want to parse each one into first name, last name, title, suffix, etc. This is a really complicated task, so you’re much better off using a specialized tool. Now it happens there is no such tool in Ruby, but there is a great Perl library called Lingua::EN::NameParse. So you (i.e. I) decide to write a filter in Perl that will read one name per line on stdin and print the name breakdown to stdout in YAML. Then you can read the YAML back into your Ruby script and do whatever you like with it. Viola: structured data!

The problem is that Ruby doesn’t give any easy way to run a command for which you both read and write. There is the popen3 command, which you can call like this:

Open3.popen3([cmd, cmd]) do |stdin, stdout, stderr|
  # ...
end

But if you try that approach, it is probably going to deadlock your code. This is a well-known problem; you’ll encounter the same thing in Perl or Python. The problem is that you need to keep feeding lines to your filter and consuming them at the same time. All the pipes (stdin, stdout, and stderr) have limited buffers, and when one fills up, everything is going to stop. Here is one page that gives a long description of the problem and attempts a (very complicated) solution using select(3). Here is another page that tackles the problem with threads.

But I found an even easier solution. In io/wait there is a module that adds the ready? method to IO objects. ready? is sort of like a non-blocking read, except it doesn’t read anything. It returns true if it’s possible to read without blocking, false if not, and nil if it’s unknown. So you can write your code like this:

require 'io/wait'

yaml = []
errors = []

stdin, stdout, stderr = Open3.popen3([cmd, cmd])

names.each do |name|
  stdin.puts name

  while stdout.ready?
    yaml << stdout.readline
  end
  while stderr.ready?
    errors << stderr.readline
  end
end

# Now get whatever else we still have to read:
stdin.close
stdout.each_line do |line|
  yaml << line
end
stderr.each_line do |line|
  errors << line
end

Don’t forget the require 'io/wait'!

The only other thing you should do is ensure your Perl code uses line-buffering even when not writing to a tty. Just include this near the top of your file:

$| = 1;

Now Perl will print each line as you ask it to, so your Ruby code will get data as you tell Perl to print it.

The Ruby code above still isn’t perfect. If our Perl program writes an incomplete line, it will deadlock. We could fix this by not using readline, but since we have complete control over the Perl program, that seems unnecessarily complicated.

Another problem with my Ruby here is it is hard to debug if Perl quits unexpectedly (e.g. with die). You’ll just see a “Broken pipe” error, probably when you try stdin.puts. The text you give to die will get lost, so it may be challenging to track down the source of the problem. For quick-and-dirty data munging, this isn’t such a problem, but it would be nice to solve somehow.

blog comments powered by Disqus Prev: Disposable Staging Site on Heroku Next: Synchronous Ajax in Rails 3