ruby linkcheck


I thought a script to check our website for broken links might come in handy, so I decided to whip something up in ruby. The script takes a starting page, visits it, follows all its links, and keeps going as long as it’s still in the original domain. There is also an option to exclude certain portions of your site from being checked. I call the script like this:

$ linkcheck -e '^http://jungwirths\.com/gallery/.*'

It was fun trying out something a little bigger. I’ve written a few other scripts in ruby lately; perhaps I’ll post them later.

All my work so far has been with ruby 1.8. I know 1.9 is out now, so I’d like to get it installed eventually. The Unicode support in 1.8 is pathetic. I found this out by trying to write a od-like program for UTF-8 files that mix English with polytonic Greek. I’m not sure what Unicode is like in 1.9. I haven’t been able to find anything very revealing from googling. It looks like they did something, though it was controversial. There was a lot of argument on the ruby mailing list a couple years ago. But for the life of me, I can’t find anything that explains what the final decisions were.

In my opinion Java gets Unicode just right: all strings are UTF-16, and encoding conversions happen on I/O. For instance, to read a UTF-8 file, you use a FileReader initialized to convert from that encoding. There are also I/O classes to handle binary data. The programmer’s job is simple, because Strings always act the same, and they just do the right thing. It is a very effective use of modularization and separation of concerns. But I don’t think ruby went this route, because for various reasons Unicode is unpopular in Japan. I hope the ruby approach isn’t too painful. I enjoy ruby a lot, but I can’t imagine using it in production with such remedial capabilities. Unicode for me is a real deal-breaker.

Here is the code:

#!/usr/bin/env ruby

# == Synopsis
# linkcheck: recursively look for broken links on a website
# == Usage
# linkcheck [options] start_url
# -h, --help
#   show help
# -e, --exclude [regex]
#   If any link matches the given regex, linkcheck will test it but not follow further links found there.
#   You can use this to mark off sections of your site that shouldn't be crawled.
#   Multiple -e flags may be used, and crawling will cease if the link matches any of them.

require 'rdoc/Usage'
require 'getoptlong'
require 'net/http'
require 'uri'
require 'hpricot'

# TODO: allow a list of "irrelevant" url parameters, so we don't keep checking the same page.

def within_site?(url)
  $site_domain ==

def parse_links(html)
  doc = Hpricot(html)
  html_links = (doc/"a[@href]").select {|e| e['href'] !~ /^javascript:/ }.collect {|e| e['href']}
  # TODO: do something more intelligent with forms:
  #   - skip POST forms.
  #   - supply data to GET forms, or skip them too?
  html_links.concat((doc/"form[@action]").select {|e| e['action'] !~ /^javascript:/ }.collect {|e| e['action']})
  other_links = (doc/"img[@src]").collect {|e| e['src']}
  # TODO: get css stylesheets
  # TODO: get external javascript
  return [html_links, other_links]

def retrieve_page(url, limit=10)
  raise ArgumentError, 'HTTP redirect too deep' if limit == 0

  res = Net::HTTP.get_response(url)
  case res
  when Net::HTTPSuccess      then res.body
  when Net::HTTPMethodNotAllowed  then '<html></html>'  # good enough; this happens hitting POST forms
  when Net::HTTPRedirection    then retrieve_page(url + res['location'], limit - 1)
    raise "Broken link: #{res.code}: #{res.message}"

def confirmed_good(url)

def confirmed_bad(url)

def record_link(src, url, linklist)
  u = url.to_s
  # We use a hash of hashes instead of a hash of arrays
  # because arrays could permit duplicates of src.
  linklist[u] = {} unless linklist.has_key?(u)
  linklist[u][src ? src : '[initial]'] = 1

def record_good_link(src, url)
  record_link(src, url, $good_links)

def record_bad_link(src, url)
  record_link(src, url, $bad_links)

def check_page(src, url, is_html)
  url.fragment = nil
  return if confirmed_good(url)  # don't bother recording the location of all good links
  if confirmed_bad(url)
    record_bad_link(src, url)
  # puts "checking #{url}: #{is_html}"
    data = retrieve_page(url)
    # p data
    puts "error: #{$!} on #{url}"
    record_bad_link(src, url)

  record_good_link(src, url)
  if (is_html and within_site?(url) and not excluded(url))
    links = parse_links(data)
    # p links
    links[0].each {|u| check_page(url, url + u, true) }
    links[1].each {|u| check_page(url, url + u, false) }

def excluded(url)
  u = url.to_s
  $excludes.each do |ex|
    return true if u =~ ex
  return false

def print_links(hash)
hash.each {|(k,v)|
  puts k
  v.each {|(k2,v2)| puts "\t#{k2}"}

$site_domain = nil
$good_links = {}
$bad_links = {}
# TODO: allow some token to represent the hostname. (will $site_domain work? #{$site_domain}?):
# $excludes = [%r,http://jungwirths\.com/gallery/.*,]
$excludes = []

opts =
            [ '--help', '-h', GetoptLong::NO_ARGUMENT ],
            [ '--exclude', '-e', GetoptLong::REQUIRED_ARGUMENT ]
  opts.each do |opt, arg|
    case opt
    when '--help'
      RDoc::usage 0
    when '--exclude'
      $excludes.push Regexp.compile(arg)
rescue Exception
  puts $!
  RDoc::usage 1
RDoc::usage 1 unless ARGV.length == 1
# p $excludes

starting_url = URI.parse ARGV[0]
starting_url = URI.parse("http://" + ARGV[0]) unless starting_url.scheme
starting_url.path = "/" unless starting_url.path.length > 0
$site_domain =

check_page(nil, starting_url, true)

# now print the results
puts "Good links:"
puts "Bad links:"
blog comments powered by Disqus Prev: Ambigrams, Explosions, and Fractals Next: Another ruby rtouch