I needed to read a big XML file into an object structure. I wanted it to be fast and use a low memory footprint. I also wanted the XML to stay pretty clean to make future support easy. Because of the speed and memory requirements, DOM and XPath were out. I toyed around with XmlSerializer, but it didn’t quite give me the XML I wanted—too ugly—and I didn’t like cluttering my classes with xml serialization attributes. That doesn’t belong in the objects, does it? And then the code to structure the XML is scattered around everywhere. And finally, what if I need to serialize the objects in different ways at different times?
So I thought I’d try my hand at XmlTextReader, which is a bit like SAX but more “push” than “pull.” It’s not based on callbacks, and you don’t have to manage state so attentively. Coming from the Java world, I’m used to SAX. I actually like it. The state machine thing isn’t so hard, really. So I was pretty excited about XmlTextReader. It looked like it would have the advantages of SAX but be easier to use.
Having now written some code with XmlTextReader, I’m still pretty happy with it, but I’m a bit disappointed that Microsoft junked up the API so much. It seems gappy and needlessly complicated. But having learned it, I thought I’d set it all down in writing.
One approach, which is fairly SAX-like, is to put everything into a read loop and switch on element names. That might look like this:
XmlTextReader r = new XmlTextReader(stream);
r.WhiteSpaceHandling = WhiteSpaceHandling.None;
while (r.Read()) {
if (r.NodeType.Equals(XmlNodeType.Element)) {
switch (r.LocalName) {
case "this":
// processing ...
break;
case "that":
// processing ...
break;
// more cases ...
}
}
}
If you wanted, you could stop there. The Read() method gives you one node at a time, and you handle each element. You could add code to handle endElements also, or comments, or whatever, just as in SAX. But there are lots of other methods to make things easier.
If you’ve never done this sort of thing before, you should know that an XML document consists of nodes. A node can be a start element, an end element, a run of text, a comment, a processing instruction, even whitespace. By setting WhitespaceHandling to None above, we told the XmlTextReader to skip whitespace, so Read() doesn’t report it. Attributes are nodes too, but the Read loop doesn’t emit them, either. Instead you use special methods to get at attributes when you’re positioned on their containing element. As you can see, some nodes have children (e.g. some elements), whereas other nodes are leaves (text, attributes, other elements).
One tricky thing is to keep track of your current position in the document. In describing the various methods below, I’ll try to pay attention to how they affect the document position. The Read() element advances one node.
First let’s talk about some methods that look useful but you should probably avoid. One is ReadElementString(). This advances one element and returns the contents of the next element as a string. A variant is ReadElementString(elemName), which verifies that the current element matches the given name. I didn’t find these methods too useful. The first gives you no checking to see if you’re actually reading the right thing. The second checks the wrong thing. I don’t want to test the current element and then read the contents of the next one. The test needs to be the name of the next element. Both of these methods read the element blindly; the latter just looks backwards a bit.
Another method to avoid is ReadString(). This returns a string when positioned on either an element or a text node. That’s handy, but it doesn’t skip comments and processing instructions very well. For that we’ll need a different method.
One method that looks great is MoveToElement(elemName). But if you think this will scan through the document until it finds your desired element, you’re wrong. Instead, this is used in attribute processing to move up from the attributes back to their owning element. Alas!
One method that really is handy is MoveToContent. This skips over comments, whitespace, processing instructions, and documentType nodes. In all your processing, you should be aware that users may throw comments into weird places. Robust XML parsing doesn’t get tripped up by comments. So this call is quite useful.
To get the contents of an element, ignoring any comments, you want the ReadContentAsXXX() methods and the ReadElementContentAsXXX() methods. These methods skip comments and processing instructions and automatically convert entities. This is just the sort of friendly assistance you want from your XML parser. As for the XXX, you have a lot of options. It could be String, Int, DateTime, even Object.
The difference between ReadConentAs and ReadElementContentAs is this: the first must be positioned on a text node, whereas the latter can be positioned on either a text node or the text’s containing element. If you call ReadContentAs on an element, you get an exception. In practice, I think ReadElementContentAs is the more useful family of methods. Also, when it returns, the reader is positioned at whatever follows the endElement node of what you just read. If you’re ignoring whitespace, this could be the next element. (Or it could be a comment, etc.)
Then there are a few methods for moving around in the document. I think this is where the API really falls short, but here’s what we’ve got: ReadToFollowing, ReadToNextSibling, and ReadToDescendant. All these methods take an element name and return true if it is found, leaving you positioned on the element node (and ready to call ReadElementContentAs). If they can’t find what you want, they return false, and then your new position in the document is however far they’ve searched. You’ll be at EOF for ReadToFollowing, or the end tag of the current/parent element for the other two. This is a bit of a disappointment. If you’re looking for a required element, you could just throw an exception, but what about finding optional elements? It’s no big deal the element wasn’t there, but you’re way off course.
One other useful method is Skip(). This skips the children of the current node.
Here is some code making use of these elements to parse a fairly simply document. The document describes a family, with elements for headOfHousehold, spouse, and children. Each person has a name, birthdate, and sex. Let’s treat the last two of these as optional. The spouse node can also have an optional marriageDate element. We could indicate the XML structure like this, where wildcards have their usual meaning (? = 0 or 1, * = 0 or more, + = 1 or more):
family
headOfHousehold
name
birthdate?
sex?
spouse?
name
birthdate?
sex?
marriageDate?
children?
child+
name
birthdate?
sex?
Here is some code that processes such a file by building up an object structure, then prints the results:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
namespace XmlTextReaderTest {
public enum Sex { MALE, FEMALE }
public class Person {
public string Name { get; set; }
public DateTime Birthdate { get; set; }
public Sex Sex { get; set; }
public DateTime? MarriageDate { get; set; }
public override string ToString() {
return String.Format("{0}: {1}, {2:yyyy-MM-dd}, {3:yyyy-MM-dd}", Name, Sex, Birthdate, MarriageDate);
}
}
public class Family {
public Person Head { get; set; }
public Person Spouse { get; set; }
public List<Person> Children { get; private set; }
public Family() {
Children = new List<Person>();
}
public override String ToString() {
String ret = "Head:\n\t" + Head + "\nSpouse:\n\t" + Spouse + "\n";
foreach (Person p in Children) {
ret += "Child:\n\t" + p + "\n";
}
return ret;
}
}
public class Program {
static void Main(string[] args) {
string xmlDoc = @"<?xml version='1.0' encoding='utf-8'?>
<family>
<headOfHousehold>
<name id='asdf'>Paul Jungwirth</name>
<!-- not my real birthday of course: -->
<birthdate>1975-02-08</birthdate>
<sex>male</sex>
</headOfHousehold>
<spouse>
<name>Arielle Jungwirth</name>
<birthdate>1979-11-11</birthdate>
<sex>female</sex>
<marriageDate>2006-09-09</marriageDate>
</spouse>
<children>
<child>
<name>James Jungwirth</name>
<birthdate>2007-12-31</birthdate>
<sex>male</sex>
</child>
<child>
<name>Miriam Jungwirth</name>
<birthdate>2010-01-20</birthdate>
<sex>female</sex>
</child>
</children>
</family>";
Family f = ParseFamily(new MemoryStream(Encoding.Default.GetBytes(xmlDoc)));
Console.Write(f);
Console.ReadLine();
}
public static Family ParseFamily(Stream stream) {
Family f = new Family();
XmlTextReader r = new XmlTextReader(stream);
r.WhitespaceHandling = WhitespaceHandling.None;
r.MoveToContent();
while (r.Read()) {
// Console.WriteLine(r.NodeType + ": " + r.LocalName + "\n");
if (r.NodeType.Equals(XmlNodeType.Element)) {
switch (r.LocalName) {
case "headOfHousehold":
f.Head = ParsePerson(r);
break;
case "spouse":
f.Spouse = ParsePerson(r);
f.Head.MarriageDate = f.Spouse.MarriageDate;
break;
case "child":
f.Children.Add(ParsePerson(r));
break;
default:
// ignore other nodes
break;
}
}
}
return f;
}
public static Person ParsePerson(XmlTextReader r) {
Person p = new Person();
// Right now we're pointing to the person's containing element, e.g. headOfHousehold.
// Read past that, then read until we get to a new start element.
r.Read();
r.MoveToContent();
if (r.LocalName.Equals("name")) p.Name = r.ReadElementContentAsString();
else throw new InvalidDataException("no name for person");
r.MoveToContent();
if (r.LocalName.Equals("birthdate")) p.Birthdate = r.ReadElementContentAsDateTime();
r.MoveToContent();
if (r.LocalName.Equals("sex")) p.Sex = (Sex)Enum.Parse(typeof(Sex), r.ReadElementContentAsString(), true);
r.MoveToContent();
if (r.LocalName.Equals("marriageDate")) p.MarriageDate = r.ReadElementContentAsDateTime();
return p;
}
}
}
This code demonstrates on a small scale the pattern I use to parse XML documents and keep the code manageable. I read through the whole document using our Read/switch loop, and I call out to helper functions to build objects representing significant chunks. These methods may call other helper functions or (as here) just navigate through the XML to pick out primitives. Each chunk of code is self-contained, and you’re never looking at more than a page or so.
You can see that in building each Person, I use MoveToContent and then test the name of the next element. Calling ReadElementContentAs takes me past the endElement, so afterwards I’m ready to read some more. If I’m already on an element, MoveToContent won’t advance at all, so it’s safe to call twice in the case when an optional element is missing.
You could also implement ParsePerson as a second Read/switch loop. That would mean the child elements can come in any order, but you’d have to verify at the end that you got data for all the required ones. You also may not know when to exit, if the name of your endElement can vary as in this example (e.g. “spouse” vs. “child”).
blog comments powered by Disqus Prev: Sort by File Size with du Next: Sorting Out the Confusion: 32- vs. 64-Bit, CLR vs. Native, C# vs. C++