Archive for March, 2010

rsync and –exclude-from

Saturday, March 20th, 2010

Lately I’ve been looking for a better backup solution, and I settled on rsync. If you want a good tutorial, you can find one easily enough by googling. But I had a lot of trouble finding good documentation about the --exclude-from option. What I found was either too cursory or just plain wrong. So I did some tests, and here are my results. I’m pleased to say that rsync’s --exclude-from option has very full functionality!

This option names a file, as in --exclude-from=my-excludes.txt. That file contains a list of patterns, and if one of the patterns matches a given file or directory, rsync leaves that file/directory out of the backup.

These patterns may be full filenames, as in log, or they may use wildcards, as in *.bak. If a directory is excluded, then all the files within it are excluded, too. (This sounds obvious, but I’ve seen other programs with exclude files that just leave out the directory!)

You can even include bits of a path in your exclude file to further limit what gets excluded. For instance, if you say photos/thumb, then the thumb directory will be excluded whenever it appears inside a folder called photos. It will not be excluded if it appears in, say, body-parts. Also, rsync will still include your photos folders; just their thumbs folders will be left out.

Finally (and this is the part I was really interested in and found wrongly-documented elsewhere), you can include specific paths. You do this by starting the path with a slash: /root/tmp. This forward slash does not indicate a file at the root of your filesystem, like a normal unix path. Rather it tells rsync to anchor this path to the base of the top-level source folder.

Note that if you’re using this feature, you must use it consistently with how you specify your source folder. Remember that in rsync, if you name your source folder without a trailing slash, rsync copies that folder and everything in it, whereas if you give it a trailing slash, rsync copies just the folders’ contents. Well, anchored paths in the --exclude-from file must take account of this. Suppose your source folder is root. Then your excludes file would have a line like /root/tmp. If if your source folder is root/, then your excludes file needs a line like /tmp.

To put this all into an example, suppose you have this file structure and you want to exclude all the red files:


root/
     important.txt
     important.txt.bak
     mysql/
           ecom.db
           log/mysql.log
     photos/thumb/
     tmp/note.txt
     www/
         index.html
         log/www.log
         photos/thumb/
         tmp/index-ver2.html

In other words, you want to exclude:

  • All *.bak files.
  • All log directories.
  • All photo/thumb directories.
  • The top-level tmp directory.

Then you could either call rsync like this:


rsync -avz --exclude-from=excludes.txt root backups

with excludes.txt as follows:


*.bak
log
photos/thumb
/root/tmp

Or you could call rsync like this:


rsync -avz --exclude-from=excludes.txt root/ backups

with an excludes.txt like this:


*.bak
log
photos/thumb
/tmp

UPDATE: One more note: You might think that ending a line in the excludes file with a trailing slash would work analogously to naming a source directory with a trailing slash: “include this directory but not its contents.” That way if you were backing up your whole filesystem, you could have a line like /mnt/ that would create a /mnt directory but ignore all its contents. But in fact, rsync just seems to ignore trailing slashes in the excludes file. If you want to exclude everything in the /mnt directory, you need a line like this instead: /mnt/*.

A Linq Catalog

Tuesday, March 16th, 2010

I thought I’d review the special methods available in Linq. In the API reference, they can be found under the Enumerable<T> class. Here they are:


static IEnumerable<T> Empty<T>()
static IEnumerable<T> Repeat<T>(T element, int count)

T[] ToArray()
List<T> ToList()
Dictionary<K,T> ToDictionary(Func<T,K> keySelector)    // puts every element in a dictionary
                                                       // using keys generated by keySelector
Dictionary<K,T> ToDictionary(Func<T,K> keySelector, IEqualityComparer<K>)
Dictionary<K,R> ToDictionary(Func<T,K> keySelector, Func<T,R> elemSelector)
                               // transforms each T to an R and puts the Rs in the dictionary
Dictionary<K,R> ToDictionary(Func<T,K> keySelector, Func<T,R> elemSelector, IEqualityComparer<K>)
Lookup<K,T> ToLookup(Func<T,K> keySelector)           // a Lookup is like a Dictionary except
                                                      // each key points to a list of values.
                                                      // In Perl terms it's a hash of arrays.
Lookup<K,T> ToLookup(Func<T,K> keySelector, IEqualityComparer<K>)
Lookup<K,R> ToLookup(Func<T,K> keySelector, Func<T,R> elemSelector)
Lookup<K,R> ToLookup(Func<T,K> keySelector, Func<T,R> elemSelector, IEqualityComparer<K>)

T Aggregate(Func<T,T,T> adder)        // like python's reduce

T ElementAt(int index)
T ElementAtOrDefault(int index)       // returns this[i] or default(T) if i >= this.Count

T First()
T First(Func<T,Boolean>)              // the first element that satisfies the predicate
T FirstOrDefault()
T FirstOrDefault(Func<T,Boolean>)
T Last()
T Last(Func<T,Boolean>)               // the last element that satisfies the predicate
T LastOrDefault()
T LastOrDefault(Func<T,Boolean>)

T Single()                        // returns the only element of the sequence,
                                  // or throws an exception
T Single(Func<T,Boolean> f)       // returns the only element for which f returns true.
                                  // Exception if there is not one and only one.
T SingleOrDefault()               // default(T) if empty. Still exception if more than one T.
T SingleOrDefault(Func<T,Boolean> f)

bool Contains(T)
bool Contains(T, IEqualityComparer<T>)
bool All(Func<T, bool> predicate)       // true if predicate is always true
bool Any(Func<T, bool> predicate)       // true if predicate is ever true
bool Any()                              // true if the sequence contains any elements
bool SequenceEqual(IEnumerable<T>)      // true if all the elements of the two sequences
                                        // are equal, using T's default equality comparer.
bool SequenceEqual(IEnumerable<T>, IEqualityComparer<T>)

int Count()
int Count(Func<T,Boolean>)    // counts how many elements satisfy the predicate.
long LongCount()
long LongCount(Func<T,Boolean>)

int Average()
int Average(Func<T,int> transform)    // calls transform to alter each element before
                                      // including it in the average.
                                      // You could use this to get a weighted average, e.g.
int Max()
int Max(Func<T,int> transform)        // uses transform to turn each T into an int
int Min()
int Min(Func<T,int> transform)
int Sum()
int Sum(Func<T,int> transform)
// ... lots of overloads ...

IEnumerable<T> DefaultIfEmpty()          // returns [default(T)] if empty
IEnumerable<T> DefaultIfEmpty(T elem)    // returns [elem] if empty.

IEnumerable<T> AsEnumerable()    // forces the compiler to call IEnumerable<T> methods
IEnumerable<R> Cast<R>()         // casts each element to an R
IEnumerable<R> OfType<R>()       // filters out any elements not of type R

IEnumerable<T> Intersect(IEnumerable<T> compare)    // the standard set operation
IEnumerable<T> Intersect(IEnumerable<T> compare, IEqualityComparer<T>)
IEnumerable<T> Union(IEnumerable<T> compare)        // the standard set operation
IEnumerable<T> Union(IEnumerable<T> compare, IEqualityComparer<T>)
IEnumerable<T> Except(IEnumerable<T> compare)       // returns this - compare
IEnumerable<T> Except(IEnumerable<T> compare, IEqualiterComperer<T>)

IEnumerable<T> Distinct()                      // like SQL distinct keyword
IEnumerable<T> Distinct(IEqualityComperer<T>)
IEnumerable<T> Concat(IEnumerable<T>)          // like Union, but with list semantics
                                               // instead of set
IEnumerable<T> Range(int start, int count)

IEnumerable<T> Skip(int n)                       // skips n elements; returns the rest
IEnumerable<T> SkipWhile(Func<T,Boolean> f)      // skips while f is true; returns the rest
IEnumerable<T> SkipWhile(Func<T,int,Boolean> f)  // f also gets the 0-based index
IEnumerable<T> Take(int n)                       // returns Range(0,n)
IEnumerable<T> TakeWhile(Func<T,Boolean> f)      // takes while f is true; skips the rest
IEnumerable<T> TakeWhile(Func<T,int,Boolean> f)

IEnumerable<T> While(Func<T,Boolean> f)     // gets the elements where f is true.
IEnumerable<T> While(Func<T,int,Boolean> f) // f also knows the 0-based index.

IEnumerable<IGrouping<K,T>> GroupBy(Func<T,K> f)    // f generates a key for each element
IEnumerable<IGrouping<K,R>> GroupBy(Func<T,K> f, Func<K, IEnumerable<T>, R> resultSelector)
                       // resultSelector gets an enumeration of values with a given key,
                       // and it returns a single value of type R, e.g. representing
                       // their min value. Note that you could use anonymous classes
                       // to stick several values into R, e.g. a min, max, and average.
                       // There is a nice example here:
                       // http://msdn.microsoft.com/en-us/library/bb549393.aspx.
... lots more overloads ...

IOrderedEnumerable<T> OrderBy(Func<T,K> keySelector)    // orders the elements based on keys
                                                        // generated by keySelector
IOrderedEnumerable<T> OrderBy(Func<T,K> keySelector, IComparer<K> comp)
                                // orders the elements based on keys generated by keySelector
                                // and compared by comp
IOrderedEnumerable<T> OrderByDescending(Func<T,K> keySelector)
IOrderedEnumerable<T> OrderByDescending(Func<T,K> keySelector, IComparer<K> comp)
IEnumerable<T> Reverse()
IOrderedEnumerable<T> ThenBy(Func<T,K> keySelector)     // sub-sorts elements deemed equal
                                                        // by a previous sort
IOrderedEnumerable<T> ThenBy(Func<T,K> keySelector, IComparer<K> comp)
IOrderedEnumerable<T> ThenByDescending(Func<T,K> keySelector)
IOrderedEnumerable<T> ThenByDescending(Func<T,K> keySelector, IComparer<K> comp)

IEnumerable<T> Join(IEnumerable<I> inner, Func<O,K> outerKeySelector,
        Func<I,K> innerKeySelector, Func<O,I,R> resultSelector)
                // joins two sequences as an inner equijoin.
                // There is no method for outer joins,
                // but GroupJoin can get used to get that effect.
IEnumerable<O,I,K,R> GroupJoin(IEnumerable<I> inner, Func<O,K> outerKeySelector,
        Func<I,K> innerKeySelector, Func<O, IEnumerable<I>, R> resultSelector)
                // joins the two sequences (see Join) and groups the results (see GroupBy).
                // Unlike with Join, this function can be used to do a left outer join,
                // because even when there is no value in inner matching a value in outer,
                // the resultSelector gets called for each outer value,
                // just with an empty IEnuemrable<I>.

IEnumerable<R> Select(Func<T,R> f)                  // uses f to transform each T to an R
IEnumerable<R> Select(Func<T,int,R> f)              // f also knows the 0-based index of each T
IEnumerable<R> SelectMany(Func<T,IEnumerable<R> f)  // f can return many Rs. All the Rs
                                                    // are flattened into one sequence.
IEnumerable<R> SelectMany(Func<T,int,IEnumerable<R> f)
IEnumerable<R> SelectMany(Func<T,IEnumerable<U> f, Func<T,U,R> g)
                // f generates an intermediate value of type U; g uses both a T and U
                // to get an R. This is really only useful if U doesn't have access to T.
                // See the example code here:
                // http://msdn.microsoft.com/en-us/library/bb534631.aspx.
IEnumerable<R> SelectMany(Func<T,int,IEnumerable<U> f, Func<T,U,R> g)

Equals and compareTo in Subclasses

Monday, March 15th, 2010

The other day I read this interesting paper on contructing a correct equals method when subclassing. It is about Java, but it applies equally well to C#. They cite Josh Bloch’s book Effective Java, who writes:

There is no way to extend an instantiable class and add a value component while preserving the equals contract, unless you are willing to forgo the benefits of object-oriented abstraction.

I read this book a while back—maybe six or seven years ago now. At the time I thought it was invaluable. It seemed like the Java version of Expert C Programming: Deep C Secrets (the fish book). But when I think back on it now, all I can remember is the author’s ongoing struggle to overcome Java’s limitations and contradictions. It seemed he was constantly recommending more and more verbose code to get around problems in the underlying language. I guess that’s not Bloch’s fault, but Java just seems to be like that. It’s the reason my resume has a crowded half-page of Java TLAs.

Anyway, contrary to Bloch’s commonly-accepted denial, the authors of the paper present a way to write an equals method that preserves the contract of equals even when extending from another non-abstract class and adding more state. Their solution isn’t even that verbose or twisted. It’s worth a read. Basically they recommend this (some parts changed for brevity):


public class Point {
    public int x;
    public int y;

    public boolean equals(Object other) {
        boolean result = false;
        if (other instanceof Point) {
            Point that = (Point)other;
            result = that.canEqual(this) && this.x == that.x && this.y == that.y;
        }
        return result;
    }

    public boolean canEqual(Object other) {
        return (other instanceof Point);
    }

    public int hashCode() {
        return 41 * (41 + x) + y;
    }
}

public class ColoredPoint extends Point {
    public Color color;

    public boolean equals(Object other) {
        boolean result = false;
        if (other instanceof ColoredPoint) {
            ColoredPoint that = (ColoredPoint)other;
            result = that.canEqual(this) && color.equals(that.color) && super.equals(that);
        }
        return result;
    }

    public boolean canEqual(Object other) {
        return (other instanceof ColoredPoint);
    }

    public int hashCode() {
        return 41 * super.hashCode() + color.hashCode();
    }
}

The trick here is that canEquals method. It is not really a public method; it is only called from within the equals method. But note that an object doesn’t call it’s own canEquals method; it calls it on the other object. This lets the two objects agree that they are really equal, and it solves the problem of non-symmetric implementations of equals (where a.equals(b) != b.equals(a)). This is a common problem, because a Point might think it equals a ColoredPoint, whereas the ColoredPoint knows it doesn’t equal the Point.

The naive way of ensuring symmetry would be to replace instanceof with a comparison of the object’s actual class. But this is too crude, because it means you can’t use anonymous classes. For instance, a Point should still equal an anonymous class instance like this one:


Point pAnon = new Point() {
    public void overrideSomeMethod() {
        // ...
    }
}

With canEquals, the anonymous class simple inherits canEquals from Point, and the two objects will still agree on their equality. I think this is a really nice solution to a thorny problem.

The forum discussion about the paper (which is almost as good as the paper itself) argues that Java ought to support an Equalator interface as a parallel to Comparator<T>. The idea is that just as you can override the “natural ordering” of a class, you should be able to override its “natural equivalence.” This would let you instantiate a Set, HashMap, etc. with an Equalator to get a different notion of equals than usual. Just as objects may sort differently in different contexts, so they may “be equal” differently in different contexts, depending on what you care about. Who hasn’t run into the need for a Set based on reference identity, for example? Apache Collections provides just such a class.

The need for an Equalator seems most pressing in classes like TreeSet that use compareTo rather than equals to test for set duplicates. If you use a TreeSet with a Comparator that is not consistent with equals, that the TreeSet will appear to violate the set contract, because you could have a.equals(b) but s.contains(a) != s.contains(b). I went to bed thinking it’s a shame Sun hasn’t added this Equalator concept.

But as I was thinking about it over the night, I started to believe Sun is right to leave it out, at least in so far as it pertains to classes like TreeSet that use compareTo instead of equals. Basically, the TreeSet’s Comparator is already operating as an Equalator here. Why do you need an Equalator, too? What problem would it solve that isn’t already solved by the Comparator? If you passed an Equalator to a TreeSet, it wouldn’t change this code problem: a.equals(b) && (s.contains(a) != s.contains(b)). The whole point of an Equalator is to impose a different notion of equality on a limited context, and with TreeSet a Comparator is sufficient for that.

Of course, that’s not to say an Equalator wouldn’t be useful in a regular Set or Map. It turns out that C# does have the Equalator idea, but it’s called IEqualityComparer<T>. It doesn’t seem to be used much, but Dictionary<K,V> and HashSet<T> both support it.

I actually came across this paper while thinking about how C#’s CompareTo<T> can work in a class heirarchy. As in Java, this method must be “consistent with Equals.” That is, whenever Equals returns true, CompareTo must return 0, and whenever CompareTo returns 0, Equals must return true. Put into code, (a.CompareTo(b) == 0) == a.Equals(b). The CompareTo method is tricky because it’s parameterized: its signature is bool CompareTo(T o), where T comes from IComparable<T>.

So what happens if you have Base : IComparable<Base> and Subclass : Base, IComparable<Subclass>? My instinct is you’re asking for trouble, although when I think it through it seems that the compiler will choose the method based on the current static type, not the instance’s actual type, so maybe you’d be okay. If your code is interested in comparing Bases, you’ll call that method; you’ll only call CompareTo(Subclass o) if you’re explicitly comparing Subclasses. So maybe everything will work out, but I’m still uneasy.

I also see that C# 4.0 is supporting new keywords for co- and contra-variance in generic parameters. So we get IEnumerable<out T> and IComparable<in T>. This means that if you implement IEnumerator<Subclass> GetEnumerator, you also fulfill the contract for IEnumerable<Base>, and if you implement CompareTo(Base o), your subclass doesn’t have to implement CompareTo(Subclass o) in order to fulfill the contract for IComparable<Subclass>. I hope I’ve got that right!

The first part—covariant return types—seems like the bigger deal here. (I only wish it were full covariant return types as in C++!) But the part about IComparable seems nice, too. It should save a bit of code, because it means that if you have a full-featured base class and you want to write a quick subclass on top of it, you can still use your subclass in things that require an IComparable<Subclass> (like List<Subclass>) without writing another CompareTo implementation.

Viewing Unit Test Output in Visual Studio

Thursday, March 11th, 2010

I’ve been using Visual Studio 2008 to write unit tests. I had a couple failing in very strange ways, and I wanted to add some debugging statements to watch the state of the code. I usually find this more useful than running a debugger. (If that seems weak, then I’ll just say that K & R agrees with me!) I can print whatever state I want and I don’t have to deal with breakpoints. Furthermore, once those debugging messages are in there, I can leave them there so they help out next time.

The problem was that I couldn’t figure out how to see Console output for unit tests. I kept trying to open various Output windows: Build output, Debug output, Refactor output. No Test output! Googling didn’t turn up anything. Today I was hitting the same problem, so I thought I’d try Googling one more time. I must have hit the magic search term combination, because this time all my top results were relevant.

It turns out to see a test’s output, you just double-click on the test summary line, and all the output is down at the bottom of that window. You get Console.Out messages and (more importantly) {Trace,Debug}.WriteLine(). Wonderful!

Sort by File Size with du

Tuesday, March 2nd, 2010

Here is a handy Perl script I wrote a while back to sort the output from du -sh by file size. The standard sort command can’t do this because it doesn’t know how to compare values like “488M” and “5.0K.” My code will sort any lines where this values appear in the first field. I’m sure the Perl could be more compressed, but keeping it easy to read like this is more my style:


#!/usr/bin/perl -w
use strict;
use Data::Dumper;

my @lines;

while (<>) {
        chomp;
        push @lines, [unabbrev($_), $_];
        # print "$_: ", unabbrev($_), "\n";
}

# print Dumper \@lines;

for my $line (reverse sort { return $a->[0] <=> $b->[0] } @lines) {
        print $line->[1], "\n";
}

sub unabbrev {
        my $val = shift;
        if ($val =~ m/^\s*(\d+(\.\d+)?)([KMGB]?)/) {
                if ($3 eq 'K') {
                        $val = $1 * 1000;
                } elsif ($3 eq 'M') {
                        $val = $1 * 1000000;
                } elsif ($3 eq 'G') {
                        $val = $1 * 1000000000;
                } else { # B or nothing
                        $val = $1;
                }
        }
        return $val;
}

It’d be fun to re-write this in ruby—or even better, add it as a feature to GNU sort.

UPDATE: Reading the documentation for GNU coreutils (which contains sort), I see that sort does have an -h option (for –human-numeric-sort). Strangely, this option is not documented in the man page on Ubuntu 9.10 and is unrecognized by /usr/bin/sort. I guess I’ve got an old version.

If anyone is looking for a small open source project, sort’s implementation of this option could still be improved. Right now, according to the online docs, “values with different precisions like 6000K and 5M will be sorted incorrectly.” It’d be great if it fully implemented the rules for block size used by other coreutils programs.