ruby linkcheck

2009-03-28

I thought a script to check our website for broken links might come in handy, so I decided to whip something up in ruby. The script takes a starting page, visits it, follows all its links, and keeps going as long as it’s still in the original domain. There is also an option to exclude certain portions of your site from being checked. I call the script like this:

$ linkcheck -e '^http://jungwirths\.com/gallery/.*' jungwirths.com

It was fun trying out something a little bigger. I’ve written a few other scripts in ruby lately; perhaps I’ll post them later.

All my work so far has been with ruby 1.8. I know 1.9 is out now, so I’d like to get it installed eventually. The Unicode support in 1.8 is pathetic. I found this out by trying to write a od-like program for UTF-8 files that mix English with polytonic Greek. I’m not sure what Unicode is like in 1.9. I haven’t been able to find anything very revealing from googling. It looks like they did something, though it was controversial. There was a lot of argument on the ruby mailing list a couple years ago. But for the life of me, I can’t find anything that explains what the final decisions were.

In my opinion Java gets Unicode just right: all strings are UTF-16, and encoding conversions happen on I/O. For instance, to read a UTF-8 file, you use a FileReader initialized to convert from that encoding. There are also I/O classes to handle binary data. The programmer’s job is simple, because Strings always act the same, and they just do the right thing. It is a very effective use of modularization and separation of concerns. But I don’t think ruby went this route, because for various reasons Unicode is unpopular in Japan. I hope the ruby approach isn’t too painful. I enjoy ruby a lot, but I can’t imagine using it in production with such remedial capabilities. Unicode for me is a real deal-breaker.

Here is the code:

#!/usr/bin/env ruby

# == Synopsis
#
# linkcheck: recursively look for broken links on a website
#
# == Usage
#
# linkcheck [options] start_url
#
# -h, --help
#   show help
#
# -e, --exclude [regex]
#   If any link matches the given regex, linkcheck will test it but not follow further links found there.
#   You can use this to mark off sections of your site that shouldn't be crawled.
#   Multiple -e flags may be used, and crawling will cease if the link matches any of them.

require 'rdoc/Usage'
require 'getoptlong'
require 'net/http'
require 'uri'
require 'hpricot'

# TODO: allow a list of "irrelevant" url parameters, so we don't keep checking the same page.

def within_site?(url)
  $site_domain == url.host
end

def parse_links(html)
  doc = Hpricot(html)
  html_links = (doc/"a[@href]").select {|e| e['href'] !~ /^javascript:/ }.collect {|e| e['href']}
  # TODO: do something more intelligent with forms:
  #   - skip POST forms.
  #   - supply data to GET forms, or skip them too?
  html_links.concat((doc/"form[@action]").select {|e| e['action'] !~ /^javascript:/ }.collect {|e| e['action']})
  other_links = (doc/"img[@src]").collect {|e| e['src']}
  # TODO: get css stylesheets
  # TODO: get external javascript
  return [html_links, other_links]
end

def retrieve_page(url, limit=10)
  raise ArgumentError, 'HTTP redirect too deep' if limit == 0

  res = Net::HTTP.get_response(url)
  case res
  when Net::HTTPSuccess      then res.body
  when Net::HTTPMethodNotAllowed  then '<html></html>'  # good enough; this happens hitting POST forms
  when Net::HTTPRedirection    then retrieve_page(url + res['location'], limit - 1)
  else
    raise "Broken link: #{res.code}: #{res.message}"
  end
end

def confirmed_good(url)
  $good_links.has_key?(url.to_s)
end

def confirmed_bad(url)
  $bad_links.has_key?(url.to_s)
end

def record_link(src, url, linklist)
  u = url.to_s
  # We use a hash of hashes instead of a hash of arrays
  # because arrays could permit duplicates of src.
  linklist[u] = {} unless linklist.has_key?(u)
  linklist[u][src ? src : '[initial]'] = 1
end

def record_good_link(src, url)
  record_link(src, url, $good_links)
end

def record_bad_link(src, url)
  record_link(src, url, $bad_links)
end

def check_page(src, url, is_html)
  url.fragment = nil
  return if confirmed_good(url)  # don't bother recording the location of all good links
  if confirmed_bad(url)
    record_bad_link(src, url)
    return
  end
  # puts "checking #{url}: #{is_html}"
  begin
    data = retrieve_page(url)
    # p data
  rescue
    puts "error: #{$!} on #{url}"
    record_bad_link(src, url)
    return
  end

  record_good_link(src, url)
  if (is_html and within_site?(url) and not excluded(url))
    links = parse_links(data)
    # p links
    links[0].each {|u| check_page(url, url + u, true) }
    links[1].each {|u| check_page(url, url + u, false) }
  end
end

def excluded(url)
  u = url.to_s
  $excludes.each do |ex|
    return true if u =~ ex
  end
  return false
end

def print_links(hash)
hash.each {|(k,v)|
  puts k
  v.each {|(k2,v2)| puts "\t#{k2}"}
}
end

$site_domain = nil
$good_links = {}
$bad_links = {}
# TODO: allow some token to represent the hostname. (will $site_domain work? #{$site_domain}?):
# $excludes = [%r,http://jungwirths\.com/gallery/.*,]
$excludes = []

opts = GetoptLong.new(
            [ '--help', '-h', GetoptLong::NO_ARGUMENT ],
            [ '--exclude', '-e', GetoptLong::REQUIRED_ARGUMENT ]
           )
begin
  opts.each do |opt, arg|
    case opt
    when '--help'
      RDoc::usage 0
    when '--exclude'
      $excludes.push Regexp.compile(arg)
    end
  end
rescue Exception
  puts $!
  RDoc::usage 1
end
RDoc::usage 1 unless ARGV.length == 1
# p $excludes

starting_url = URI.parse ARGV[0]
starting_url = URI.parse("http://" + ARGV[0]) unless starting_url.scheme
starting_url.path = "/" unless starting_url.path.length > 0
$site_domain = starting_url.host

check_page(nil, starting_url, true)

# now print the results
puts "Good links:"
print_links($good_links)
puts "Bad links:"
print_links($bad_links)
blog comments powered by Disqus Prev: Ambigrams, Explosions, and Fractals Next: Another ruby rtouch