FREE Subscription to Dr. Dobb’s Digest: Same Great Content, New Digital Edition
Site Archive (Complete)
Dobbs M-Dev
Email
Print
Reprint

add to:
Del.icio.us
Digg
Google
Furl
Slashdot
Y! MyWeb
Blink
July 01, 2003
Parsing XML Files in .NET Using C#

Five different parsing techniques are available in .NET, and each has its own advantages

(Page 1 of 16)
James McCaffrey
The .NET Framework provides several ways to extract data from an XML file into memory. We'll demonstrate the best uses of five fundamentally different techniques.
Parsing XML Files in .NET Using C#

Download the code for this issue

Parsing XML files is an unglamorous task that can be time consuming and tricky. In the days before .NET, programmers were forced to read XML as a text file line by line and then use string functions and possibly regular expressions. This is a time-consuming and error-prone process, and just not very much fun.

While I was writing .NET test automation that had test case data stored in XML files, I discovered that the .NET Framework provides powerful new ways of parsing XML. But in conversations with colleagues, I also discovered that there are a variety of opinions on which way of parsing XML files is the best.

I set out to determine how many different ways there are to parse XML using .NET and to understand the pros and cons of each technique. After some experimentation, I learned that there are five fundamentally different ways to parse XML, and that the "best" method depends both on the particular development situation you are in and on the style of programming you prefer.

In the sections that follow, I will demonstrate how to parse a testCases.xml file using five different techniques. Each technique is based on a different .NET Framework class and its associated methods:

  • XmlTextReader
  • XmlDocument

  • XPathDocument

  • XmlSerializer

  • DataSet

After I explain each technique so you can modify my examples to suit your needs, I will give you guidance on which technique should be used in which situation. Knowing these five methods for parsing XML files will be a valuable addition to your .NET skill set. I'm assuming that you're familiar with C#, VS.NET, the creation and use of class libraries, and have a working knowledge of XML files.

The XML File to Parse and the Goal

Let's examine the testCases.xml file that we will use for all five parsing examples. The file contents are shown in Listing 1.

Note that each of the three test cases has five data items: id, kind, arg1, arg2, and expected. Some of the data is stored as XML attributes (id and kind), and arg1 and arg2 are stored as XML elements two levels deep relative to the root node (suite). Extracting attribute data and dealing with nested elements are key tasks regardless of which parsing strategy we use.

The goal is to parse our XML test cases file and extract the data into memory in a form that we can use easily. The memory structure we will use for four of the five parsing methods is shown in Listing 2. (The method that employs an XmlSerializer object requires a slightly different memory structure and will be presented later.)

Because four of the five techniques will use these definitions, for convenience we can put the code in a .NET class library named "CommonLib." A TestCase object will hold the five data parts of each test case, and a Suite object will hold a collection of TestCase objects and provide a way to display it.

Once the XML data is parsed and stored, the result can be represented as shown in Figure 1. The data can now be easily accessed and manipulated.

Parsing XML with XmlTextReader

Of the five ways to parse an XML file, the most traditional technique is to use the XmlTextReader class. The example code is shown in Listing 3.

After creating a new C# Console Application Project in Visual Studio .NET, we add a Project Reference to the CommonLib.dll file that contains definitions for TestCase and Suite classes. We start by creating a Suite object to hold the XML data and an XmlTextReader object to parse the XML file.

The key to understanding this technique is to understand the Read() and ReadElementString() methods of XmlTextReader. To an XmlTextReader object, an XML file is a sequence of nodes. For example,

<?xml version="1.0" ?>
<foo>
  <bar>99</bar>
</foo>
has 6 nodes: the XML declaration, <foo>, <bar>, 99, </bar>, and </foo>.

The Read() method advances one node at a time. Unlike many Read() methods in other classes, the System.XmlTextReader.Read() does not return significant data. The ReadElementString() method, on the other hand, returns the data between the begin and end tags of its argument, and advances to the next node after the end tag. Because XML attributes are not nodes, we have to extract attribute data using the GetAttribute() method.

Figure 2 shows the output of running this program. You can see that we have successfully parsed the data from testCases.xml into memory.

The statement xtr.WhitespaceHandling = WhitespaceHandling.None; is important because without it you would have to Read() over newline characters and blank lines.

The main loop control structure that I used is not elegant but is more readable than the alternatives:

while (!xtr.EOF) //load loop
      {
        if (xtr.Name == "suite" && !xtr.IsStartElement()) break;
It exits when we are at EOF or an </suite> tag.

When marching through the XML file, you can either Read() your way one node at a time or get a bit more sophisticated with code like the following:

while (xtr.Name != "testcase" || !xtr.IsStartElement() ) 
          xtr.Read();          // advance to <testcase> tag
The choice of technique you use is purely a matter of style.

Parsing an XML file with XmlTextReader has a traditional, pre-.NET feel. You walk sequentially through the file using Read(), and extract data with ReadElementString() and GetAttribute(). Using XmlTextReader is straightforward and effective and is appropriate when the structure of your XML file is relatively simple and consistent. Compared to other techniques we will see in this article, XmlTextReader operates at a lower level of abstraction, meaning it is up to you as a programmer to keep track of where you are in the XML file and Read() correctly.

Parsing XML with XmlDocument

The second of five ways to parse an XML file is to use the XmlDocument class. The example code is shown in Listing 4.

XmlDocument objects are based on the notion of XML nodes and child nodes. Instead of sequentially navigating through a file, we select sets of nodes with the SelectNodes() method or individual nodes with the SelectSingleNode() method. Notice that because XML attributes are not nodes, we must get their data with an Attributes.GetNamedItem() method applied to a node.

After loading the XmlDocument, we fetch all the test case nodes at once with:

XmlNodeList nodelist = xd.SelectNodes("/suite/testcase");
Then we iterate through this list of nodes and fetch each <input> node with:

XmlNode n = node.SelectSingleNode("inputs");
and then extract the arg1 (and similarly arg2) value using:

tc.arg1 = n.ChildNodes.Item(0).InnerText;
In this statement, n is the <inputs> node; ChildNodes.Item(0) is the first element of <inputs>, i.e., <arg1>; and InnerText is the value between <arg1> and </arg1>.

The output from running this program is shown in Figure 3. Notice it is identical to the output from running the XmlTextReader technique and, in fact, all the other techniques presented in this article.

The XmlDocument class is modeled on the W3C XML Document Object Model and has a different feel to it than many .NET Framework classes that you are familiar with. Using the XmlDocument class is appropriate if you need to extract data in a nonsequential manner, or if you are already using XmlDocument objects and want to maintain a consistent look and feel to your application's code.

Let me note that in discussions with my colleagues, there was often some confusion about the role of the XmlDataDocument class. It is derived from the XmlDocument class and is intended for use in conjunction with DataSet objects. So, in this example, you could use the XmlDataDocument class but would not gain anything.

Parsing XML with XPathDocument

The third technique to parse an XML file is to use the XPathDocument class. The example code is shown in Listing 5.

Using an XPathDocument object to parse XML has a hybrid feel that is part procedural (as in XmlTextReader) and part functional (as in XmlDocument). You can select parts of the document using the Select() method of an XPathNavigator object and also move through the document using the MoveNext() method of an XPathNodeIterator object.

After loading the XPathDocument object, we get what is in essence a reference to the first <testcase> node into an XPathNodeIterator object with:

XPathNavigator xpn = xpd.CreateNavigator();
XPathNodeIterator xpi = xpn.Select("/suite/testcase");
Because XPathDocument does not maintain "node identity," we must iterate through each <testcase> node with this loop:

while (xpi.MoveNext())
Similarly, we have to iterate through the children with:

while (tcChild.MoveNext())
The XPathDocument class is optimized for XPath data model queries. So using it is particularly appropriate when the XML file to parse is deeply nested or has a complex structure. You might also consider using XPathDocument if other parts of your application code use that class so that you maintain a consistent coding look and feel.

Parsing XML with XmlSerializer

The fourth technique we will use to parse an XML file is the XmlSerializer object. The example code is shown in Listing 6.

Using the XmlSerializer class is significantly different from using any of the other classes because the in-memory data store is different from the CommonLib.Suite we used for all other examples. In fact, observe that pulling the XML data into memory is accomplished in a single statement:

SerializerLib.Suite s = (SerializerLib.Suite)xs.Deserialize(sr);
I created a class library named "SerializerLib" to hold the definition for a Suite class that corresponds to the testCases.xml file so that the XmlSerializer object can store the XML data into it. The trick, of course, is to set up this Suite class.

Creating the Suite class is done with the help of the xsd.exe command-line tool. You will find it in your Program Files\Microsoft Visual Studio .NET\FrameworkSDK\bin folder. I used xsd.exe to generate a Suite class and then modified it slightly by changing some names and adding a Display() method.

The screen shot in Figure 4 shows how I generated the file testCases.cs, which contains a Suite definition that you can use directly or modify as I did. Listings 7 and 8 show the classes generated by XSD and my modified classes in the SerializerLib library.

Using the XmlSerializer class gives a very elegant solution to the problem of parsing an XML file. Compared with the other four techniques in this article, XmlSerializer operates at the highest level of abstraction, meaning that the algorithmic details are largely hidden from you. But this gives you less control over the parsing and lends an air of magic to the process.

Most of the code I write is test automation, and using XmlSerializer is my default technique for parsing XML. XmlSerializer is most appropriate for situations not covered by the other four techniques in this article: fine-grained control is not required, the application program does not use other XmlDocument objects, the XML file is not deeply nested, and the application is not primarily an ADO .NET application (as we will see in our next example).

Parsing XML with DataSet

The fifth and final method we will use to parse an XML file into memory uses the DataSet class. The example code is shown in Listing 9.

We start by reading the XML file directly into a System.Data.DataSet object using the ReadXml() method. A DataSet object can be thought of as an in-memory relational database. The XML data ends up in two tables, "testcase" and "inputs," that are related through a relation "testcase_inputs." The key to using this DataSet technique is to know the way to determine how the XML data gets stored into the DataSet object.

Although we could create a custom DataSet object with completely known characteristics, it is much quicker to let the ReadXml() method do the work and then examine the result. I wrote a helper function DisplayInfo() that accepts a DataSet as an argument and displays the information we need to extract the data from the DataSet's tables.

To keep the main parse program uncluttered, I put DisplayInfo() into a class library named "InfoLib." The code is shown in Listing 10. The output from running the parse program is shown in Figure 5.

The first table, "testcase," holds the data that is one level deep from the XML root: id, kind, and expected. The second table, "inputs," holds data that is two levels deep: arg1 and arg2. In general, if your XML file is n levels deep, ReadXml() will generate n tables.

Extracting the data from the parent test case table is easy. We just iterate through each row of the table and access by column name. To get the data from the child table inputs, we get an array of rows using the GetChildRows method:

DataRow[] children = row.GetChildRows("testcase_inputs");  // relation name
Because each <testcase> node has only one <inputs> child node, the children array will only have one row.

The trickiest aspect of this technique is to extract the child data:

tc.arg1 = (children[0]["arg1"]).ToString();  // there is only 1 row in children

Using the DataSet class to parse an XML file has a very relational database feel. Compared with other techniques in this article, it operates at a middle level of abstraction. The ReadXml() method hides a lot of details but you must traverse through relational tables.

Using DataSet to parse XML files is particularly appropriate when your application program is using ADO .NET classes so that you maintain a consistent look and feel. Using a DataSet object has high overhead and would not be a good choice if performance is an issue. Because each level of an XML file generates a table, if your XML file is deeply nested then using DataSet would not be a good choice.

Further Discussion

There are several related issues not yet covered: namespaces, generalization, error handling, validation, filtering, and performance. In the context of parsing XML data files, XML namespaces are a mechanism to prevent name clashes. Each of the techniques we've used can deal with namespaces. The MSDN Library will give you all the information you need to handle XML files with namespaces.

The techniques we have seen were not written to be particularly general. If you have a different XML structure, you will have to write different code. There is always a trade-off between writing code for a specific situation and making the code more generalized.

The code in this article does not have any error handling. Parsing XML files is quite error prone and in a production scenario, you would need to add lots of try-catch blocks to create a robust parser.

Additionally, I didn't address XML validation with schema files, but once again, in a production environment you would need to generate XML schema files and validate your XML data files against them before attempting to parse. It is possible to add validation to your parsing code, but I recommend validating before parsing.

In every example, we have read all the XML data into memory. In many cases, you will want to filter and just read in some data. All the techniques in this article can be modified to provide front-end filtering. The XPathDocument class has especially nice filtering capabilities by way of XPath syntax.

If performance is an issue — usually in the case where you are parsing many small XML files — you will have to run some timing measurements to determine if your chosen technique is fast enough. Performance is too tricky to make many general statements and the only way to know if your performance is acceptable is to try your code. As a guideline, however, XmlTextReader has the best performance characteristics.

A Key Skill

XML data files are a key component of Microsoft's .NET developer environment. The ability to parse data from XML files into memory is a key skill in a .NET setting. Each of the five techniques, based on the XmlTextReader, XmlDocument, XPathDocument, XmlSerializer, and DataSet classes, is significantly different in terms of coding mechanics, coding mind set, and scenarios for usage. The .NET Framework gives you great flexibility in parsing XML data files and makes this essential task much easier and less error prone than using non-.NET techniques.

References

XML in .NET Overview, http://msdn.microsoft.com/msdnmag/issues/01/01/xml/xml.asp

Consume XML C# app, http://msdn.microsoft.com/library/en-us/vcedit/html/

vcwlkVisualCApplicationsConsumingXMLData.asp

XML Schema, http://msdn.microsoft.com/msdnmag/issues/02/04/xml/xml0204.asp

XML Namespaces, http://msdn.microsoft.com/msdnmag/issues/01/07/xml/default.aspx


Dr. James McCaffrey works for Volt Information Sciences Inc. where he manages technical training for software engineers working at Microsoft's Redmond, WA campus. He has worked on several Microsoft products, including Internet Explorer and MSN Search. James can be reached at jmccaffrey@volt.com or v-jammc@microsoft.com.

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 Next Page
TOP 5 ARTICLES
No Top Articles.



MICROSITES
FEATURED TOPIC

ADDITIONAL TOPICS

INFO-LINK