I thought a script to check our website for broken links might come in handy, so I decided to whip something up in ruby. The script takes a starting page, visits it, follows all its links, and keeps going as long as it’s still in the original domain. There is also an option to exclude certain portions of your site from being checked. I call the script like this:
$ linkcheck -e '^http://jungwirths\.com/gallery/.*' jungwirths.com
It was fun trying out something a little bigger. I’ve written a few other scripts in ruby lately; perhaps I’ll post them later.
All my work so far has been with ruby 1.8. I know 1.9 is out now, so I’d like to get it installed eventually. The Unicode support in 1.8 is pathetic. I found this out by trying to write a od-like program for UTF-8 files that mix English with polytonic Greek. I’m not sure what Unicode is like in 1.9. I haven’t been able to find anything very revealing from googling. It looks like they did something, though it was controversial. There was a lot of argument on the ruby mailing list a couple years ago. But for the life of me, I can’t find anything that explains what the final decisions were.
In my opinion Java gets Unicode just right: all strings are UTF-16, and encoding conversions happen on I/O. For instance, to read a UTF-8 file, you use a FileReader initialized to convert from that encoding. There are also I/O classes to handle binary data. The programmer’s job is simple, because Strings always act the same, and they just do the right thing. It is a very effective use of modularization and separation of concerns. But I don’t think ruby went this route, because for various reasons Unicode is unpopular in Japan. I hope the ruby approach isn’t too painful. I enjoy ruby a lot, but I can’t imagine using it in production with such remedial capabilities. Unicode for me is a real deal-breaker.
Anyway, the code for the linkcheck program is below the fold:
(more…)
