I thought a script to check our website for broken links might come in handy, so I decided to whip something up in ruby. The script takes a starting page, visits it, follows all its links, and keeps going as long as it’s still in the original domain. There is also an option to exclude certain portions of your site from being checked. I call the script like this:
$ linkcheck -e '^http://jungwirths\.com/gallery/.*' jungwirths.com
It was fun trying out something a little bigger. I’ve written a few other scripts in ruby lately; perhaps I’ll post them later.
All my work so far has been with ruby 1.8. I know 1.9 is out now, so I’d like to get it installed eventually. The Unicode support in 1.8 is pathetic. I found this out by trying to write a od-like program for UTF-8 files that mix English with polytonic Greek. I’m not sure what Unicode is like in 1.9. I haven’t been able to find anything very revealing from googling. It looks like they did something, though it was controversial. There was a lot of argument on the ruby mailing list a couple years ago. But for the life of me, I can’t find anything that explains what the final decisions were.
In my opinion Java gets Unicode just right: all strings are UTF-16, and encoding conversions happen on I/O. For instance, to read a UTF-8 file, you use a FileReader initialized to convert from that encoding. There are also I/O classes to handle binary data. The programmer’s job is simple, because Strings always act the same, and they just do the right thing. It is a very effective use of modularization and separation of concerns. But I don’t think ruby went this route, because for various reasons Unicode is unpopular in Japan. I hope the ruby approach isn’t too painful. I enjoy ruby a lot, but I can’t imagine using it in production with such remedial capabilities. Unicode for me is a real deal-breaker.
Here is the code:
#!/usr/bin/env ruby
# == Synopsis
#
# linkcheck: recursively look for broken links on a website
#
# == Usage
#
# linkcheck [options] start_url
#
# -h, --help
# show help
#
# -e, --exclude [regex]
# If any link matches the given regex, linkcheck will test it but not follow further links found there.
# You can use this to mark off sections of your site that shouldn't be crawled.
# Multiple -e flags may be used, and crawling will cease if the link matches any of them.
require 'rdoc/Usage'
require 'getoptlong'
require 'net/http'
require 'uri'
require 'hpricot'
# TODO: allow a list of "irrelevant" url parameters, so we don't keep checking the same page.
def within_site?(url)
$site_domain == url.host
end
def parse_links(html)
doc = Hpricot(html)
html_links = (doc/"a[@href]").select {|e| e['href'] !~ /^javascript:/ }.collect {|e| e['href']}
# TODO: do something more intelligent with forms:
# - skip POST forms.
# - supply data to GET forms, or skip them too?
html_links.concat((doc/"form[@action]").select {|e| e['action'] !~ /^javascript:/ }.collect {|e| e['action']})
other_links = (doc/"img[@src]").collect {|e| e['src']}
# TODO: get css stylesheets
# TODO: get external javascript
return [html_links, other_links]
end
def retrieve_page(url, limit=10)
raise ArgumentError, 'HTTP redirect too deep' if limit == 0
res = Net::HTTP.get_response(url)
case res
when Net::HTTPSuccess then res.body
when Net::HTTPMethodNotAllowed then '<html></html>' # good enough; this happens hitting POST forms
when Net::HTTPRedirection then retrieve_page(url + res['location'], limit - 1)
else
raise "Broken link: #{res.code}: #{res.message}"
end
end
def confirmed_good(url)
$good_links.has_key?(url.to_s)
end
def confirmed_bad(url)
$bad_links.has_key?(url.to_s)
end
def record_link(src, url, linklist)
u = url.to_s
# We use a hash of hashes instead of a hash of arrays
# because arrays could permit duplicates of src.
linklist[u] = {} unless linklist.has_key?(u)
linklist[u][src ? src : '[initial]'] = 1
end
def record_good_link(src, url)
record_link(src, url, $good_links)
end
def record_bad_link(src, url)
record_link(src, url, $bad_links)
end
def check_page(src, url, is_html)
url.fragment = nil
return if confirmed_good(url) # don't bother recording the location of all good links
if confirmed_bad(url)
record_bad_link(src, url)
return
end
# puts "checking #{url}: #{is_html}"
begin
data = retrieve_page(url)
# p data
rescue
puts "error: #{$!} on #{url}"
record_bad_link(src, url)
return
end
record_good_link(src, url)
if (is_html and within_site?(url) and not excluded(url))
links = parse_links(data)
# p links
links[0].each {|u| check_page(url, url + u, true) }
links[1].each {|u| check_page(url, url + u, false) }
end
end
def excluded(url)
u = url.to_s
$excludes.each do |ex|
return true if u =~ ex
end
return false
end
def print_links(hash)
hash.each {|(k,v)|
puts k
v.each {|(k2,v2)| puts "\t#{k2}"}
}
end
$site_domain = nil
$good_links = {}
$bad_links = {}
# TODO: allow some token to represent the hostname. (will $site_domain work? #{$site_domain}?):
# $excludes = [%r,http://jungwirths\.com/gallery/.*,]
$excludes = []
opts = GetoptLong.new(
[ '--help', '-h', GetoptLong::NO_ARGUMENT ],
[ '--exclude', '-e', GetoptLong::REQUIRED_ARGUMENT ]
)
begin
opts.each do |opt, arg|
case opt
when '--help'
RDoc::usage 0
when '--exclude'
$excludes.push Regexp.compile(arg)
end
end
rescue Exception
puts $!
RDoc::usage 1
end
RDoc::usage 1 unless ARGV.length == 1
# p $excludes
starting_url = URI.parse ARGV[0]
starting_url = URI.parse("http://" + ARGV[0]) unless starting_url.scheme
starting_url.path = "/" unless starting_url.path.length > 0
$site_domain = starting_url.host
check_page(nil, starting_url, true)
# now print the results
puts "Good links:"
print_links($good_links)
puts "Bad links:"
print_links($bad_links)
blog comments powered by Disqus
Prev: Ambigrams, Explosions, and Fractals
Next: Another ruby rtouch