Archive for February, 2010

C# XmlTextReader Tutorial

Saturday, February 27th, 2010

I needed to read a big XML file into an object structure. I wanted it to be fast and use a low memory footprint. I also wanted the XML to stay pretty clean to make future support easy. Because of the speed and memory requirements, DOM and XPath were out. I toyed around with XmlSerializer, but it didn’t quite give me the XML I wanted—too ugly—and I didn’t like cluttering my classes with xml serialization attributes. That doesn’t belong in the objects, does it? And then the code to structure the XML is scattered around everywhere. And finally, what if I need to serialize the objects in different ways at different times?

So I thought I’d try my hand at XmlTextReader, which is a bit like SAX but more “push” than “pull.” It’s not based on callbacks, and you don’t have to manage state so attentively. Coming from the Java world, I’m used to SAX. I actually like it. The state machine thing isn’t so hard, really. So I was pretty excited about XmlTextReader. It looked like it would have the advantages of SAX but be easier to use.

Having now written some code with XmlTextReader, I’m still pretty happy with it, but I’m a bit disappointed that Microsoft junked up the API so much. It seems gappy and needlessly complicated. But having learned it, I thought I’d set it all down in writing.

One approach, which is fairly SAX-like, is to put everything into a read loop and switch on element names. That might look like this:


XmlTextReader r = new XmlTextReader(stream);
r.WhiteSpaceHandling = WhiteSpaceHandling.None;
while (r.Read()) {
  if (r.NodeType.Equals(XmlNodeType.Element)) {
    switch (r.LocalName) {
	  case "this":
	    // processing ...
		break;
      case "that":
	    // processing ...
		break;
	  // more cases ...
	}
  }
}

If you wanted, you could stop there. The Read() method gives you one node at a time, and you handle each element. You could add code to handle endElements also, or comments, or whatever, just as in SAX. But there are lots of other methods to make things easier.

If you’ve never done this sort of thing before, you should know that an XML document consists of nodes. A node can be a start element, an end element, a run of text, a comment, a processing instruction, even whitespace. By setting WhitespaceHandling to None above, we told the XmlTextReader to skip whitespace, so Read() doesn’t report it. Attributes are nodes too, but the Read loop doesn’t emit them, either. Instead you use special methods to get at attributes when you’re positioned on their containing element. As you can see, some nodes have children (e.g. some elements), whereas other nodes are leaves (text, attributes, other elements).

One tricky thing is to keep track of your current position in the document. In describing the various methods below, I’ll try to pay attention to how they affect the document position. The Read() element advances one node.

First let’s talk about some methods that look useful but you should probably avoid. One is ReadElementString(). This advances one element and returns the contents of the next element as a string. A variant is ReadElementString(elemName), which verifies that the current element matches the given name. I didn’t find these methods too useful. The first gives you no checking to see if you’re actually reading the right thing. The second checks the wrong thing. I don’t want to test the current element and then read the contents of the next one. The test needs to be the name of the next element. Both of these methods read the element blindly; the latter just looks backwards a bit.

Another method to avoid is ReadString(). This returns a string when positioned on either an element or a text node. That’s handy, but it doesn’t skip comments and processing instructions very well. For that we’ll need a different method.

One method that looks great is MoveToElement(elemName). But if you think this will scan through the document until it finds your desired element, you’re wrong. Instead, this is used in attribute processing to move up from the attributes back to their owning element. Alas!

One method that really is handy is MoveToContent. This skips over comments, whitespace, processing instructions, and documentType nodes. In all your processing, you should be aware that users may throw comments into weird places. Robust XML parsing doesn’t get tripped up by comments. So this call is quite useful.

To get the contents of an element, ignoring any comments, you want the ReadContentAsXXX() methods and the ReadElementContentAsXXX() methods. These methods skip comments and processing instructions and automatically convert entities. This is just the sort of friendly assistance you want from your XML parser. As for the XXX, you have a lot of options. It could be String, Int, DateTime, even Object.

The difference between ReadConentAs and ReadElementContentAs is this: the first must be positioned on a text node, whereas the latter can be positioned on either a text node or the text’s containing element. If you call ReadContentAs on an element, you get an exception. In practice, I think ReadElementContentAs is the more useful family of methods. Also, when it returns, the reader is positioned at whatever follows the endElement node of what you just read. If you’re ignoring whitespace, this could be the next element. (Or it could be a comment, etc.)

Then there are a few methods for moving around in the document. I think this is where the API really falls short, but here’s what we’ve got: ReadToFollowing, ReadToNextSibling, and ReadToDescendant. All these methods take an element name and return true if it is found, leaving you positioned on the element node (and ready to call ReadElementContentAs). If they can’t find what you want, they return false, and then your new position in the document is however far they’ve searched. You’ll be at EOF for ReadToFollowing, or the end tag of the current/parent element for the other two. This is a bit of a disappointment. If you’re looking for a required element, you could just throw an exception, but what about finding optional elements? It’s no big deal the element wasn’t there, but you’re way off course.

One other useful method is Skip(). This skips the children of the current node.

Here is some code making use of these elements to parse a fairly simply document. The document describes a family, with elements for headOfHousehold, spouse, and children. Each person has a name, birthdate, and sex. Let’s treat the last two of these as optional. The spouse node can also have an optional marriageDate element. We could indicate the XML structure like this, where wildcards have their usual meaning (? = 0 or 1, * = 0 or more, + = 1 or more):


family
  headOfHousehold
    name
      birthdate?
      sex?
  spouse?
    name
    birthdate?
    sex?
    marriageDate?
  children?
    child+
      name
      birthdate?
      sex?

Here is some code that processes such a file by building up an object structure, then prints the results:


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;

namespace XmlTextReaderTest {

    public enum Sex { MALE, FEMALE }

    public class Person {
        public string Name { get; set; }
        public DateTime Birthdate { get; set; }
        public Sex Sex { get; set; }
        public DateTime? MarriageDate { get; set; }

        public override string ToString() {
            return String.Format("{0}: {1}, {2:yyyy-MM-dd}, {3:yyyy-MM-dd}", Name, Sex, Birthdate, MarriageDate);
        }
    }

    public class Family {
        public Person Head { get; set; }
        public Person Spouse { get; set; }
        public List<Person> Children { get; private set; }

        public Family() {
            Children = new List<Person>();
        }

        public override String ToString() {
            String ret = "Head:\n\t" + Head + "\nSpouse:\n\t" + Spouse + "\n";
            foreach (Person p in Children) {
                ret += "Child:\n\t" + p + "\n";
            }
            return ret;
        }
    }

    public class Program {
        static void Main(string[] args) {
            string xmlDoc = @"<?xml version='1.0' encoding='utf-8'?>
<family>
    <headOfHousehold>
      <name id='asdf'>Paul Jungwirth</name>
      <!-- not my real birthday of course: -->
      <birthdate>1975-02-08</birthdate>
      <sex>male</sex>
    </headOfHousehold>
    <spouse>
      <name>Arielle Jungwirth</name>
      <birthdate>1979-11-11</birthdate>
      <sex>female</sex>
      <marriageDate>2006-09-09</marriageDate>
    </spouse>
    <children>
      <child>
        <name>James Jungwirth</name>
        <birthdate>2007-12-31</birthdate>
        <sex>male</sex>
      </child>
      <child>
        <name>Miriam Jungwirth</name>
        <birthdate>2010-01-20</birthdate>
        <sex>female</sex>
      </child>
    </children>
</family>";
            Family f = ParseFamily(new MemoryStream(Encoding.Default.GetBytes(xmlDoc)));
            Console.Write(f);
            Console.ReadLine();
        }

        public static Family ParseFamily(Stream stream) {
            Family f = new Family();
            XmlTextReader r = new XmlTextReader(stream);
            r.WhitespaceHandling = WhitespaceHandling.None;
            r.MoveToContent();

            while (r.Read()) {
                // Console.WriteLine(r.NodeType + ": " + r.LocalName + "\n");
                if (r.NodeType.Equals(XmlNodeType.Element)) {
                    switch (r.LocalName) {
                        case "headOfHousehold":
                            f.Head = ParsePerson(r);
                            break;
                        case "spouse":
                            f.Spouse = ParsePerson(r);
                            f.Head.MarriageDate = f.Spouse.MarriageDate;
                            break;
                        case "child":
                            f.Children.Add(ParsePerson(r));
                            break;
                        default:
                            // ignore other nodes
                            break;
                    }
                }
            }

            return f;
        }

        public static Person ParsePerson(XmlTextReader r) {
            Person p = new Person();

            // Right now we're pointing to the person's containing element, e.g. headOfHousehold.
            // Read past that, then read until we get to a new start element.
            r.Read();

            r.MoveToContent();
            if (r.LocalName.Equals("name")) p.Name = r.ReadElementContentAsString();
            else throw new InvalidDataException("no name for person");

            r.MoveToContent();
            if (r.LocalName.Equals("birthdate")) p.Birthdate = r.ReadElementContentAsDateTime();

            r.MoveToContent();
            if (r.LocalName.Equals("sex")) p.Sex = (Sex)Enum.Parse(typeof(Sex), r.ReadElementContentAsString(), true);

            r.MoveToContent();
            if (r.LocalName.Equals("marriageDate")) p.MarriageDate = r.ReadElementContentAsDateTime();

            return p;
        }
    }
}

This code demonstrates on a small scale the pattern I use to parse XML documents and keep the code manageable. I read through the whole document using our Read/switch loop, and I call out to helper functions to build objects representing significant chunks. These methods may call other helper functions or (as here) just navigate through the XML to pick out primitives. Each chunk of code is self-contained, and you’re never looking at more than a page or so.

You can see that in building each Person, I use MoveToContent and then test the name of the next element. Calling ReadElementContentAs takes me past the endElement, so afterwards I’m ready to read some more. If I’m already on an element, MoveToContent won’t advance at all, so it’s safe to call twice in the case when an optional element is missing.

You could also implement ParsePerson as a second Read/switch loop. That would mean the child elements can come in any order, but you’d have to verify at the end that you got data for all the required ones. You also may not know when to exit, if the name of your endElement can vary as in this example (e.g. “spouse” vs. “child”).

Sorting Out the Confusion: 32- vs. 64-Bit, CLR vs. Native, C# vs. C++

Saturday, February 13th, 2010

I’ve been trying to learn how things work on Windows based on whether you write code in C# or C++, target a 32- or 64-bit platform, and produce files with either native code or one of the CLR options. One of my focuses is the interaction between exes and dlls. I think I’ve got things mostly straightened out, so this is what I’ve learned.

First, the basics: a 32-bit platform can run 32-bit apps, but not 64-bit apps. A 64-bit platform can run either, but 32-bit apps run in an emulation environment called WOW64 (Windows on Windows 64). When Windows starts your app, it decides whether WOW64 is necessary. You can tell whether your app is running in WOW64 using this C++ code:


#include "stdafx.h"
#include <windows.h>

typedef BOOL (WINAPI *LPFN_ISWOW64PROCESS) (HANDLE, PBOOL);

LPFN_ISWOW64PROCESS fnIsWow64Process;

BOOL isWow64() {
    BOOL ret = FALSE;
    fnIsWow64Process = (LPFN_ISWOW64PROCESS)GetProcAddress(
        GetModuleHandle(TEXT("kernel32")), "IsWow64Process");

    if (NULL != fnIsWow64Process) {
        if (!fnIsWow64Process(GetCurrentProcess(), &ret)) {
            printf("Got some error\n");
        }
    }
    return ret;
}

int _tmain(int argc, _TCHAR* argv[]) {
    if (isWow64()) {
        printf("Running under WOW64.\n");
    } else {
        printf("NOT running under WOW64.\n");
    }
    scanf("press return");
    return 0;
}

It’s easy enough to call isWow64 from C#, like so:


[DllImport]("IsWow64Dll.dll")]
static extern bool isWow64();

static void Main(String[] args) {
    Console.WriteLine(isWow64().ToString());
    Console.ReadLine();
}

Visual Studio lets you build files for either 32- or 64-bit platforms. I’ve already written how to build for 32 or 64 bits in C++. C# actually provides three options: 32-bits, 64-bits, or “Any CPU.” We can use a tool called corflags to see what results we get depending on which option we choose. Corflags comes with Visual Studio and can be run by choosing the special DOS prompt command under Visual Studio in the Start menu. This is a little different from the regular DOS prompt: it has a specially-tailored environment for running Visual Studio’s command-line utilities. From there, you can ask corflags to report information about any exe or dll, like this:


C:\> corflags myapp.exe
Microsoft (R) .NET Framework CorFlags Conversion Tool.  Version  3.5.21022.8
Copyright (c) Microsoft Corporation.  All rights reserved.

Version   : v2.0.50727
CLR Header: 2.5
PE        : PE32
CorFlags  : 3
ILONLY    : 1
32BIT     : 1
Signed    : 0

We’re mostly interested in three values: PE, 32BIT, and ILONLY. There is also a line labelled “Signed,” which I’m not interested in right now. Finally, the “CorFlags” line appears to be a combination of the four other values.

PE specifies whether or not the file can run on 32-bit platforms. It is either PE32 or PE32+. A PE32+ file cannot run on a 32-bit machine.

Next there is the 32BIT flag. This is a little different from PE. If PE indicates whether your app can run as 32 bits, then 32BIT indicates whether it must run as 32 bits. If this flag is 0, your app can run on a 64-bit machine without WOW64. But if the flag is 1, then your app has to run under WOW64. Here is a table showing how the bits are set depending on your compiler’s /platform setting:

Compiler Option PE 32BIT
x86 PE32 1
Any CPU PE32 0
x64 PE32+ 0

From this table, you can see that the corflags example above is inspecting a C# app built for the x86 platform. Note that you could never have a file that is PE32+ with the 32BIT flag set, because then one flag would require 32 bits and the other 64.

To put all this together, a 32-bit machine can run anything with a PE set to PE32, but nothing with a PE of PE32+. A 64-bit machine can run your file in 64-bit mode as long as 32BIT is 0, but if 32BIT is 1 then it must use WOW64.

The ILONLY flag indicates that your file contains only MSIL opcodes (recently renamed to CIL), with no native assembly instructions. A C# app will always have this flag set (unless you use something like ngen to compile down to machine language—an approach with some distribution problems), but a C++ app’s setting depends on your compiler options (described below).

When it comes to loading dlls, these flags control whether your app loads the dll successfully or gets a BadImageFormatException. Basically, a 32-bit app can only load 32-bit dlls, and a 64-bit app can only load 64-bit dlls. But what about apps compiled as “Any CPU”? In that case, you can only load dlls matching whatever bitness you’re currently running as. Of course, if you’re running on a 32-bit machine, there is no complication, because everything is 32-bit already.

But on a 64-bit machine, you may have problems. Windows will not use WOW64 for your app, because it claims to support 64-bit operation. But if your app has a dependency on a 32-bit dll, then you’ll get a BadImageFormatException, because the 32-bit dll only works in WOW64. The choice to use WOW64 happens only when starting your app. You can’t run an app natively and load just the dlls in WOW64. So you get the exception.

The solution is to tell Windows that your app must start in WOW64 from the beginning. You should probably do this by building your app for x86, not Any CPU, but if that is somehow a problem (e.g. you don’t have the code), then you can use corflags to set the 32BIT flag. You just type something like this:


corflags /32BIT+ myapp.exe

For C++ applications, you can do something similar with the linker’s /clrimagetype flag.

Another choice, at least when writing in C++, is how to support the CLR. You can choose among four options: native (the default), /clr, /clr:pure, and /clr:safe. The first one is simple enough: you get a file with machine language instructions. The other three give you a file that is partially or entirely composed of MSIL. Using /clr will produce a CLR header and mostly MSIL code, but with some native code mixed in. Specifically, you get native data types but MSIL functions, unless the function uses something unsupported like function pointers. (Everyday pointers to data are supported.) You can also use #pragma unmanaged to force native code. Because these files have some native code, they must be built for a specific platform, either x86 or x64.

The /clr:pure option does what it sounds like: it gives you a file of entirely MSIL. Nonetheless, it must be built for either x86 or x64. This option is said to be equivalent to a C# project with unsafe code.

Microsoft’s documentation on the /clr and /clr:pure flags says that they can only produce x86 files, but my tests prove this to be false. If I build the C++ version of the WOW64-tester, using x64 and /clr compilation options, then it reports that it is not running in WOW64. So apparently you can in fact produce x64 applications with these options.

The last one, /clr:safe, enforces code that is verifiably type-safe—but I’m not sure what all that means. I’ve read that if you use this option, your file can run on any platform, like building as Any CPU in C#. This option requires that you use Microsoft’s C++/CLI language, formerly known as Managed C++. I know nothing about this, but people say it’s virtually a new language. I tried to build a Hello World app with printf and got innumerable compile errors, so I wasn’t able to run any tests on what this option produces.

There is also a /clr:oldSyntax option, which is like /clr:safe but with the old Managed C++ syntax rather than C++/CLI. Since Managed C++ is deprecated, I’m not sure why you’d use this for new code.

I don’t know what the /clr* options mean for P/Invoke. If I build a dll with /clr or /clr:pure, does that mean I can call its exported functions from C# without a DllImport statement? I haven’t tried. Using DllImport on these dlls doesn’t cause problems, though.

You can use a tool called dumpbin to see which /clr options were used to produce a given file. Dumpbin comes with Visual Studio and runs from the command-line:


dumpbin /CLRHEADER myapp.exe

This will print (among other things) a Flags value, which is 0 if the file was build with /clr, 1 with /clr:safe, and 3 with /clr:pure.

I’m also curious about the interaction between WOW64 and the CLR. If I run a 32-bit C# app on x64, then which comes first: WOW64 or the CLR? Is there a 64-bit CLR that can JIT-compile to either 32- or 64-bit code? Or do I have two CLRs, one for 32 bits and one for 64, and the former runs under WOW64? I suspect the answer is the latter, but I’m not sure how to tell. Either way my code is running in WOW64, so the check I described above won’t tell me anything.

I created a table to keep track of the data from all my tests. Here it is:

exe machine type binary contents managed? contains assembly? can run on x86? can run on x64? can use x86 C# dll? can use x64 C# dll? can use “Any CPU” C# dll? can use x86 C++ native dll? can use x64 C++ native dll? can use x86 C++ /clr dll? can use x64 C++ /clr dll? can use x86 C++ /clr:pure dll? can use x64 C++ /clr:pure dll?
x86 C# exe i386 MSIL yes yes yes, in CLR yes, in WOW64+CLR yes no yes yes no yes no yes no
x64 C# exe x64 MSIL yes yes no yes, in CLR no yes yes no2 yes no yes no yes
“Any CPU” C# exe i386 MSIL yes yes yes, in CLR yes, in CLR only on x861 only on x64 yes only on x861 only on x64 only on x86 only on x64 only on x86 only on x64
x86 C++ exe i386 asm no no yes, natively yes, in WOW64 ? ? ? ? ? ? ? ? ?
x64 C++ exe x64 asm no no no yes, natively ? ? ? ? ? ? ? ? ?
x86 C++ /clr exe i386 MSIL, mostly ? ? yes, in CLR yes, in WOW64+CLR ? ? ? ? ? ? ? ? ?
x64 C++ /clr exe x64 MSIL, mostly ? ? no yes, in CLR ? ? ? ? ? ? ? ? ?
x86 C++ /clr:pure exe i386 MSIL yes ? yes, in CLR yes, in WOW64+CLR ? ? ? ? ? ? ? ? ?
x64 C++ /clr:pure exe x64 MSIL yes ? no yes, in CLR ? ? ? ? ? ? ? ? ?

1There may be some option like the linker’s /CLRHEADER for C# apps.
2Can’t run on x86, and won’t be in WOW64 on x64. But see note 1.

There are still some gaps in this table. I’m not that concerned about the interactions between C++ exes and C++ dlls, so I’ve left those cells blank. I’ve also left some cell blanks regarding when C++ files are managed/unmanaged and when they contain an assembly. If I figure any of this out, I’ll update the table.

One final notable tool is ildasm (IL-disassembler), which also comes with Visual Studio. This lets you inspect the IL of an exe or dll. Most of it is over my head, but it’s intetesting to see what your code becomes.

Decline in Inverse and Leveraged ETFs

Friday, February 5th, 2010

So this post isn’t about programming, but it has a lot of math, and it’s about understanding how something works. Lately my investing has used some ETFs, but I’ve read that if your ETF is inverse or leverage, then it gradually loses money over time—simply because of the mathematics. I wanted to investigate just how this worked.

First, a bit of background: an ETF is an Exchange-Traded Fund, a pretty new thing. It’s not a stock, but a type of derivative used to track something else, such as a sector of stocks. In this regard it’s a bit like an index, but more flexible. So you could have an ETF tracking the performance of pharmaceutical stocks, or financial stocks, or the S&P 500. They are a little different from index-based mutual funds. You can trade them any time during the day, and they don’t have the high minimum purchases required by some mutual funds. On the other hand, you pay a commission, unlike a non-load mutual fund. (In some cases this may not be true.)

But the really interesting thing about them is that they can be leveraged or inverse. A leveraged fund aims to achieve some multiple of the daily change of the underlying index. So while the SPY fund tracks the S&P 500, you can get 2x the S&P 500 (both up and down) with the SSO fund. This greater volatility gives you the potential for more profit with each trade, diminishing the relative size of the commission. It’s a bit like you invested twice as many dollars as you actually have.

An inverse fund is a leveraged fund with a negative multiplier. You can get funds at -1x, -2x, or I think even -3x. This essentially lets you short the market, something difficult for retail investors to manage. Also, unlike with shorting, you don’t risk an unlimited amount of money. You only risk the money you use to buy the ETF.

The fly in the ointment is that ETFs, when leveraged or inverse, gradually decline in value. Let’s say we have an index that start at 100, declines to 80 (-20%), then goes back up to 100 (+25%). Now let’s say we were tracking it with four ETFs, a 1x, a -1x, a 2x, and a -2x. They all start at 40. Here are the results:

Fund t0 t1 t2
Index 100 80 100
1x ETF 40 32 40
-1x ETF 40 48 36
2x ETF 40 24 36
-2x ETF 40 56 28

As you can see, all our ETFs lost money, except the 1x version. Conspicuously, the -1x and 2x ETF lost the same amount, but the -2x ETF lost more. If we had reversed the order of changes (100 to 125 to 100), we would see the same results.

So I set out to analyze the mathematics behind all this. Let’s start with some equations describing our targeted fund, the index. I’ll call its initial value T. Our intermediate value will be T’. Our final value will be T”—but of course this is equal to T. Let’s call the first percent change d, and the second percent change d’. That gives us these equations:

T’ = T + dT
T = T” = T’ + d’T’

If we substitute T for T’ and solve for d’, we get:

T = (T + dT) + d’(T + dT)
0 = dT + d’(T + dt)
-dT / (T + dT) = d’
d’ = -d / (1 + d)

Now let’s write out the equations for our derived fund, D (the ETF). We’ll use D, D’, D” to correspond with T, T’, T”. The multiplier of our ETF will be f. So in a 1x ETF, f = 1, but in a -2x ETF, f = -2. That gives us these equations:

D’ = D + fdD
D” = D’ + fd’D’

Substituting, we get:

D” = (D + fdD) + f * (-d/(1+d)) * (D + fdD)

So far so good, but I’m really interested in the percent lost for each up-and-down cycle. The absolute loss is D – D”, so the percent loss is (D – D”) / D. Let’s call this L. That gives us:

L = (D – D”) / D
L = (D – ((D + fdD) + f * (-d/(1+d)) * (D + fdD))) / D
L = (D – D – fdD – f * (-d/(1+d)) * (D + fdD)) / D
L = -fd – f * (-d/(1+d) * (1 + fd)
L = fd/(1+d) * (1 + fd) – fd
L = fd(1+fd) / (1+d) – fd(1+d) / (1+d)
L = (fd(1+fd) – fd(1+d)) / (1+d)
L = fd(1+fd-1-d) / (1+d)
L = fd(fd-d) / (1+d)
L = fd2(f-1) / (1+d)
L = f(f-1) * d2 / (1+d)

That final equation tells us that the loss is proportional to f and d. Well, d is kind of obvious: the more our underlying index changes, the more the ETF will change. But f is quite interesting. The more leveraged we are, the more we lose on each cycle. Although neither factor is as simple to write out as we might like, consider this. f(f-1) is almost f*f, or f2:

f*(f-1) < f*f
f*(f-1) < f2

The effect of f is almost a square. We could imagine this as f1.9, although that isn’t quite right.

At first glance, it seems we could make a similar simplification with d, since d2/(d+1) is a little less than d2/d:

d2/(d+1) < d2/d
d2/(d+1) < d

So d’s effect would be a little less than d: sort of like d0.9. But this is misleading, because d2/(d+1) only approximates d for large values, and our d will (probably) never go above 1. In fact it’s only interesting for values like 0.01 or 0.1 or maybe (gulp) 0.5. At that range, we get values like:

d d2/(d+1)
0.01 0.00009
0.02 0.00039
0.03 0.00087
0.04 0.00153
0.05 0.00238
0.10 0.00909
0.20 0.03333
0.30 0.06923
0.40 0.11428
0.50 0.16666

The graph looks like this:

Or at close range, like this:

All this means that comparatively speaking, f has a greater effect than d.

We could chart f’s effect for different levels of leverage:

f f(f-1)
1 0
2 2
3 6
4 12
5 20
6 30
-1 2
-2 6
-3 12
-4 20
-5 30

In other words, each “step” up in leverage increases your loss, and going inverse counts as one additional “step.”

By multiplying these values with those from d’s table, you can see your loss for each cycle. For a 2x or -1x ETF, you lose about 2% of your investment for each 10% cycle, or 0.02% for each 1% cycle. For a -2x ETF, you lose about 6% for a 10% cycle or 0.06% for a 1% cycle. Best not to stay in these investments for too long!

The next question is: how long is right? I guess you could look at the last few years to count the number of cycles, and try to figure out an ETF’s theoretical decline if the market had no net change. But the conventional wisdom answer seems to be about a week, and this sounds right to me. You’re really betting on the direction of the next move, and if you get that wrong, it’s going to cost you. So to make money with these ETFs, you need to get not just the direction right, but the timing, too. It’s hard enough just to get the direction! This need to be right about timing makes them risky for the same reason options are risky: your bet only has so long to play out.

Now here is another thought. Take a look at a graph of f(f-1):

As you can see, between 0 and 1, this graph dips below 0. That means that if you had a x0.5 ETF, say, you would actually make money over time. I wonder if there are practical reasons to prevent this, or if the return is just too small, so you’d do better putting your money in a bank. Anyway, it’s an interesting thought. . . .

Variation on C Password Dictionary Variator

Wednesday, February 3rd, 2010

So I was thinking, since my password app generates too many possibilities, what if we limited its output so that any given letter was translated into the same variant each time it appears in the word? In other words, if your word was “aa” and your leet.txt file had “a A @,” then you’d get just “aa,” “AA,” and “@@,” not “aA,” “a@,” “Aa,” etc. So I made that change, and it seems so reasonable, I made it the default. To get all combinations, use the “-a” option. Here is the code:


/**
 * pw-vary.c - implements pw-vary, a program to obfuscate passwords by replacing letters with leet.
 *
 * Makes a lot of assumptions about limits. Also assumes ASCII strings. Pretty lazy!
 *
 * Copyright (c) 2009 by Paul A Jungwirth
 */
#include
#include
#include

#define LINE_LEN		1024
#define MAX_VARIANTS	20
#define LONGEST_RESULT	4092

#define is_whitespace(c) (c == ' ' || c == '\t' || c == '\n' || c == '\r')
#define set_or_die(v, s) if (!v) { perror(s); exit(EXIT_FAILURE); }
#define unset_or_die(v, s) if (v) { perror(s); exit(EXIT_FAILURE); }

#ifdef DEBUG
#define DEBUGGING(s) s;
#else
#define DEBUGGING(s)
#endif

typedef struct t_variant {
	char letter;
	char *variants[MAX_VARIANTS];
	int count;
	char line[LINE_LEN];
	int maxlen;
} t_variant;

static t_variant *init_variant(char letter) {
	t_variant* ret;

	ret = (t_variant*)malloc(sizeof(t_variant));
	set_or_die(ret, NULL);

	ret->letter = letter;
	ret->count = 1;
	ret->maxlen = 1;
	ret->variants[0] = ret->line;

	return ret;
}

static char *find_line_start(char *line) {
	int i = 0;
	char c;

	c = line[i];
	while (c) {
		if (!is_whitespace(c)) return &line[i];
		c = line[i];
		i++;
	}

	return NULL;
}

static void read_leet(t_variant **variants, FILE *f) {
	char line[LINE_LEN];
	char *line_start;
	int len;
	int i, j;
	char first_letter, c;
	int in_whitespace;
	t_variant *vs;

	while (fgets(line, LINE_LEN, f)) {
		// strip leading white space and check for contents
		line_start = find_line_start(line);
		if (!line_start) continue;

		// Create a new entry in variants for the letter.
		first_letter = line_start[0];
		variants[first_letter] = init_variant(first_letter);
		vs = variants[first_letter];

		// List all the possible variants.
		strncpy(vs->line, line_start, LINE_LEN);
		line_start = vs->line;
		len = strlen(line_start);
		i = j = 1;
		in_whitespace = 1;
		while (i < len && j variants[j] = &line_start[i];
					vs->count++;
					in_whitespace = 0;
					j++;
				}
			}
			i++;
		}

		// find the longest variant for this letter:
		for (i = 0; vs->variants[i]; i++) {
			len = strlen(vs->variants[i]);
			if (len > vs->maxlen) vs->maxlen = len;
		}
	}
	unset_or_die(ferror(f), "reading leet file");
}

/*@unused@*/
static void print_variants(t_variant **variants) {
	int i, j;

	for (i = 0; i variants[j]) {
				puts(variants[i]->variants[j]);
				j++;
			}
			puts("====");
		}
	}
}

/**
 * do_word_all and do_word_selective:
 * Each iterates through all the variant combinations for the given word.
 *
 * do_word_all does this with two arrays, each the same length as the
 * original string:
 *   - maxes holds the number of variants for each letter in the string.
 *   - indices holds the current variant number to be printed. We keep
 *     incrementing this array, "carrying" when necessary, just like
 *     counting.
 * This algorithm tests all possible combinations.
 *
 * do_word_selective causes a slightly more complex variation:
 * instead of testing all combinations, we assume that all instances of a
 * given letter use written with the same variant, cutting down on the
 * number of combinations to print. In this version, we rely on three arrays:
 *   - maxes & indices are only as long as the number of *unique* letters.
 *     We use them for "counting" just as before.
 *   - poses has one entry for each char in the string, and it stores
 *     an int for indexing into the two other arrays.
 */

static void do_word_selective(char *buffer, t_variant **variants, char *orig) {
	char c;
	char *end;
	int i, j, k;
	int len, newlen, len2, maxes_len;
	int *maxes;
	int *indices;
	int *poses;
	t_variant *vs;

	orig = find_line_start(orig);
	if (orig) {
		end = strpbrk(orig, "\n\r");
		if (end) end[0] = '';
		// puts(orig);

		// initialize maxes & indices
		// TODO: pass these as buffers to speed things up
		len = strlen(orig);
		maxes = (int*)malloc(sizeof(int) * len);
		set_or_die(maxes, NULL);
		indices = (int*)calloc(len, sizeof(int));	// start with all zeroes.
		set_or_die(indices, NULL);
		poses = (int*)calloc(len, sizeof(int));
		set_or_die(poses, NULL);

		newlen = 1; // start with 1 for the .
		j = 0;
		for (i = 0; i count;
				} else {
					maxes[j] = 1;
				}
				poses[i] = j;
				j++;
			} else {
				// point wherever the first letter points
				poses[i] = poses[k];
			}
			newlen += maxes[poses[i]];
		}
		maxes_len = j;

		// don't proceed if newlen is greater than LONGEST_RESULT (improbable)
		if (newlen >= LONGEST_RESULT) {
			fprintf(stderr, "Result for \"%s\" too long: %d characters\n", orig, newlen);
		} else {
			j = 0;
			while (j >= 0) {
				// Construct and print the string using our current position, represented by indices.
				buffer[0] = '';
				for (i = 0; i variants[indices[poses[i]]]);
					} else {
						len2 = strlen(buffer);
						buffer[len2] = orig[i];
						buffer[len2 + 1] = '';
					}
				}
				puts(buffer);

				// Now add one to our position.
				j = maxes_len - 1;
				do {
					indices[j] = (indices[j] + 1) % maxes[j];
				} while (indices[j] == 0 && j-- > 0); // short-circuit: j only decremented when we carry

				// when j is -1, we've overflowed; hence we've printed all combinations.
			}
		}

		free(maxes);
		free(indices);
		free(poses);
	}
}

static void do_word_all(char *buffer, t_variant **variants, char *orig) {
	char *end;
	int i, j;
	int len, newlen, len2;
	int *maxes;
	int *indices;
	t_variant *vs;

	orig = find_line_start(orig);
	if (orig) {
		end = strpbrk(orig, "\n\r");
		if (end) end[0] = '';
		// puts(orig);

		// initialize maxes & indices
		// TODO: pass these as buffers to speed things up
		len = strlen(orig);
		maxes = (int*)malloc(sizeof(int) * len);
		set_or_die(maxes, NULL);
		indices = (int*)calloc(len, sizeof(int));	// start with all zeroes.
		set_or_die(indices, NULL);

		newlen = 1; // start with 1 for the .
		for (i = 0; i count;
				newlen += vs->maxlen;
			} else {
				maxes[i] = 1;
				newlen++;
			}
		}

		// don't proceed if newlen is greater than LONGEST_RESULT (improbable)
		if (newlen >= LONGEST_RESULT) {
			fprintf(stderr, "Result for \"%s\" too long: %d characters\n", orig, newlen);
		} else {
			j = 0;
			while (j >= 0) {
				// Construct and print the string using our current position, represented by indices.
				buffer[0] = '';
				for (i = 0; i variants[indices[i]]);
					} else {
						len2 = strlen(buffer);
						buffer[len2] = orig[i];
						buffer[len2 + 1] = '';
					}
				}
				puts(buffer);

				// Now add one to our position.
				j = len - 1;
				do {
					indices[j] = (indices[j] + 1) % maxes[j];
				} while (indices[j] == 0 && j-- > 0); // short-circuit: j only decremented when we carry

				// when j is -1, we've overflowed; hence we've printed all combinations.
			}
		}

		free(maxes);
		free(indices);
	}
}

static void do_all_words(t_variant **variants, FILE *f, int do_all) {
	char line[1024];
	char buffer[LONGEST_RESULT];

	while (fgets(line, 1024, f)) {
		if (do_all) do_word_all(buffer, variants, line);
		else do_word_selective(buffer, variants, line);
	}
	unset_or_die(ferror(f), "reading file");
}

int main(int argc, char **argv) {
	t_variant *variants[255];		/* 255 pointers, each to an array of 20 pointers to strings. */
	FILE *f;
	int i;
	int arg_start, do_all;

	// Argument processing:
	if (argc > 1 && (strcmp(argv[1], "-a") == 0)) {
		do_all = 1;
		arg_start = 2;
	} else {
		do_all = 0;
		arg_start = 1;
	}

	bzero(variants, sizeof(variants));

	f = fopen("leet.txt", "r");
	set_or_die(f, "opening leet file");
	read_leet(variants, f);
	unset_or_die(fclose(f), "closing leet file");
	DEBUGGING(print_variants(variants));

	// read words from argv or, if no args, from stdin
	if (argc > arg_start) {
		for (i = arg_start; i < argc; i++) {
			f = fopen(argv[i], "r");
			set_or_die(f, argv[i]);
			do_all_words(variants, f, do_all);
			unset_or_die(fclose(f), argv[i]);
		}
	} else {
		do_all_words(variants, stdin, do_all);
	}

	return EXIT_SUCCESS;
}

C Version of Password Dictionary Variator

Wednesday, February 3rd, 2010

I wrote a program to expand password dictionaries the other day, but due to its recursive algorithm it choked on words greater than 11 characters long, such as “electroencephalograph’s,” the longest word in my /usr/share/dict/words. So now I’ve tweaked the algorithm and also reimplemented the whole thing in C. It’s not real pretty C, but I tried to aim more for performance. It doesn’t crash on “electroencephalograph’s” anymore, but generating just the combinations for that one word (all 61,917,364,224 of them) took 43,499 seconds of CPU time (42653.61 user, 846.95 system). And that’s not even the time-consuming part! Just imagine running that many combinations through aircrack, all to test a single word.

Anyway, here is the code:


/**
 * pw-vary.c - implements pw-vary, a program to obfuscate passwords by replacing letters with leet.
 *
 * Makes a lot of assumptions about limits. Also assumes ASCII strings. Pretty lazy!
 *
 * Copyright (c) 2009 by Paul A Jungwirth
 */
#include
#include
#include

#define LINE_LEN	1024
#define MAX_VARIANTS	20
#define LONGEST_RESULT	4092

#define is_whitespace(c) (c == ' ' || c == '\t' || c == '\n' || c == '\r')
#define set_or_die(v, s) if (!v) { perror(s); exit(EXIT_FAILURE); }
#define unset_or_die(v, s) if (v) { perror(s); exit(EXIT_FAILURE); }

#ifdef DEBUG
#define DEBUGGING(s) s;
#else
#define DEBUGGING(s)
#endif

typedef struct t_variant {
	char letter;
	char *variants[MAX_VARIANTS];
	int count;
	char line[LINE_LEN];
	int maxlen;
} t_variant;

static t_variant *init_variant(char letter) {
	t_variant* ret;

	ret = (t_variant*)malloc(sizeof(t_variant));
	set_or_die(ret, NULL);

	ret->letter = letter;
	ret->count = 1;
	ret->maxlen = 1;
	ret->variants[0] = ret->line;

	return ret;
}

static char *find_line_start(char *line) {
	int i = 0;
	char c;

	c = line[i];
	while (c) {
		if (!is_whitespace(c)) return &line[i];
		c = line[i];
		i++;
	}

	return NULL;
}

static void read_leet(t_variant **variants, FILE *f) {
	char line[LINE_LEN];
	char *line_start;
	int len;
	int i, j;
	char first_letter, c;
	int in_whitespace;
	t_variant *vs;

	while (fgets(line, LINE_LEN, f)) {
		// strip leading white space and check for contents
		line_start = find_line_start(line);
		if (!line_start) continue;

		// Create a new entry in variants for the letter.
		first_letter = line_start[0];
		variants[first_letter] = init_variant(first_letter);
		vs = variants[first_letter];

		// List all the possible variants.
		strncpy(vs->line, line_start, LINE_LEN);
		line_start = vs->line;
		len = strlen(line_start);
		i = j = 1;
		in_whitespace = 1;
		while (i < len && j variants[j] = &line_start[i];
					vs->count++;
					in_whitespace = 0;
					j++;
				}
			}
			i++;
		}

		// find the longest variant for this letter:
		for (i = 0; vs->variants[i]; i++) {
			len = strlen(vs->variants[i]);
			if (len > vs->maxlen) vs->maxlen = len;
		}
	}
	unset_or_die(ferror(f), "reading leet file");
}

/*@unused@*/
static void print_variants(t_variant **variants) {
	int i, j;

	for (i = 0; i variants[j]) {
				puts(variants[i]->variants[j]);
				j++;
			}
			puts("====");
		}
	}
}

static void do_word(char *buffer, t_variant **variants, char *orig) {
	char *end;
	int i, j;
	int len, newlen, len2;
	int *maxes;
	int *indices;
	t_variant *vs;

	orig = find_line_start(orig);
	if (orig) {
		end = strpbrk(orig, "\n\r");
		if (end) end[0] = '';
		// puts(orig);

		// initialize maxes & indices
		// TODO: pass these as buffers to speed things up
		len = strlen(orig);
		maxes = (int*)malloc(sizeof(int) * len);
		set_or_die(maxes, NULL);
		indices = (int*)calloc(len, sizeof(int));	// start with all zeroes.
		set_or_die(indices, NULL);

		newlen = 1; // start with 1 for the .
		for (i = 0; i count;
				newlen += vs->maxlen;
			} else {
				maxes[i] = 1;
				newlen++;
			}
		}

		// don't proceed if newlen is greater than LONGEST_RESULT (improbable)
		if (newlen >= LONGEST_RESULT) {
			fprintf(stderr, "Result for \"%s\" too long: %d characters\n", orig, newlen);
		} else {
			j = 0;
			while (j >= 0) {
				// Construct and print the string using our current position, represented by indices.
				buffer[0] = '';
				for (i = 0; i variants[indices[i]]);
					} else {
						len2 = strlen(buffer);
						buffer[len2] = orig[i];
						buffer[len2 + 1] = '';
					}
				}
				puts(buffer);

				// Now add one to our position.
				j = len - 1;
				do {
					indices[j] = (indices[j] + 1) % maxes[j];
				} while (indices[j] == 0 && j-- > 0); // short-circuit: j only decremented when we carry

				// when j is -1, we've overflowed; hence we've printed all combinations.
			}
		}

		free(maxes);
		free(indices);
	}
}

static void do_all_words(t_variant **variants, FILE *f) {
	char line[1024];
	char buffer[LONGEST_RESULT];

	while (fgets(line, 1024, f)) {
		do_word(buffer, variants, line);
	}
	unset_or_die(ferror(f), "reading file");
}

int main(int argc, char **argv) {
	t_variant *variants[255];		/* 255 pointers, each to an array of 20 pointers to strings. */
	FILE *f;
	int i;

	bzero(variants, sizeof(variants));

	f = fopen("leet.txt", "r");
	set_or_die(f, "opening leet file");
	read_leet(variants, f);
	unset_or_die(fclose(f), "closing leet file");
	DEBUGGING(print_variants(variants));

	// read words from argv or, if no args, from stdin
	if (argc > 1) {
		for (i = 1; i < argc; i++) {
			f = fopen(argv[i], "r");
			set_or_die(f, argv[i]);
			do_all_words(variants, f);
			unset_or_die(fclose(f), argv[i]);
		}
	} else {
		do_all_words(variants, stdin);
	}

	return EXIT_SUCCESS;
}