Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

A C++ Template Wrapper for the XML SAX API


March 04:

Two common techniques are often used to parse XML documents — the Simple API for XML (SAX) and the Document Object Model (DOM). When used to manipulate XML files, the DOM reads a file, breaks it into individual objects (such as elements, attributes, and comments), and creates a tree structure of the document in memory. You can reference and manipulate each object or node easily. However, because it reads the entire file into memory, DOM consumes significant memory when the document is large.

Unlike DOM, SAX parsing is event based. It reads a section of an XML document, generates events as it finds specific symbols in the XML document, and then moves on to the next section. By processing documents in this serial fashion, SAX uses much less memory than DOM and is better for processing large documents. The straightforward way to use SAX is to create different functions to respond to those events. Those functions generally contain multiple if-else-if blocks to process different elements found in the XML document. This leads to extensive code duplication in multischema environments and presents a high degree of coupling between parsing code and business logic. In this article, I present a mechanism to decouple processing, eliminate duplication, and further simplify the development interface by hiding the details of SAX. The complete code for this is available at http://www.cuj.com/code/.

SAX models the information set through a set of abstract programmatic interfaces including ContentHandler, DTDHandler, LexicalHandler, ErrorHandler, and the like. Each interface models the information set as an ordered sequence of method calls. For example, consider this simple document:

<b>


<code></p>



<?xml version="1."?></p>



<BookedOrder></p>



<OrderID>1000</OrderID></p>



</BookedOrder></p>



<text></b></p>

As the SAX parser processes this document, it generates a sequence of events such as this:

<b>


<code></p>



StartDocument();</p>



StartElement("BookedOrder")</p>



StartElement("OrderID")</p>



Characters("1000")</p>



EndElement("OrderID"")</p>



EndElement("BookedOrder")</p>



EndDocument()</p>



<text></b></p>

These methods are the main components of the ContentHandler interface, which is also the primary interface of SAX. Here I focus on ContentHandler's implementation and use. ISAXContentHandler is Microsoft's implementation of this interface. All code presented in this article centers around this interface. For details on ISAXContentHandler and Microsoft's XML Parser (MSXML), refer to the MSDN library [1].

The Brute-Force Way

Typical examples for using the SAX API rely upon brute force: You create different functions to respond to those events and do relevant processing in each function after comparing element names. Figure 1 is a sample XML document with some simplified customer orders. Listing 1 is an implementation of ISAXContentHandler named as CustomerOrderHandler. In the member function startElement(), it simply records the name of the element currently being parsed into the member variable m_CurrentTag. This variable is later used in the characters() function to decide the proper conversion for each element by comparing it with pwchLocalName, which is the tag name associated with the element content. In the endElement() event, the entire customer order is processed after the CustomerOrder element is ended. In this example, I just print it out.

While this approach is straightforward, it has two disadvantages. First, not only is the parsing of different elements mixed together, but the business-processing code is also tightly coupled with the parsing code. This violates a basic object-oriented principle — the separation of concerns — and leads to potential maintenance headaches. Second, in most enterprise applications you would need to parse multiple XML documents with different schemas. For example, other than customer order, you may also need to parse XML documents for replenishment order, purchase order, and so on. As with CustomerOrderHandler, you would need to write a ReplenishmentOrderHandler for replenishment orders, PurchaseOrderHandler for purchase orders, and so on. The list could be long, depending on the number of XML schemas you need to deal with. Furthermore, you would also find that these handlers are similar and that you're writing similar multiple if-else-if blocks for those events of ContentHandler over and over again. As the number of XML schemas grows, it then becomes tedious and, as copy-and-paste would most likely be used, can lead to subtle bugs that are hard to locate.

The Template Wrapper

Design patterns are proven solutions for reoccurring problems. When facing a design issue, I always resort to applicable design patterns first, if any. To address the design concerns of the brute-force technique, first take a look at some characteristics of XML documents. XML documents could often be represented as a tree structure (remember, DOM always does this). In this tree, the root would be a document-level element composed of multiple composite elements. Those composite elements could further be composed of other composite elements and/or atomic elements. These elements would be the nodes and leaves of the tree. Applying the divide-and-conquer mechanism, parsing an XML document could be accomplished by fully parsing all its component elements. Accordingly, the parsing of a composite element could be accomplished by fully parsing all of its child elements. On the other hand, you'd like to treat all elements uniformly so that you don't need to write similar code multiple times. These characteristics make it a good candidate for the composite pattern.

Using the terms defined in the book Design Patterns [2], it is fairly easy to get the major participants for this instance of the composite pattern — the Document, Composite, and Leaf elements. Since my main concern is the ContentHandler interface, I got DocumentContentHandler, CompositeContentHandler, and LeafContentHandler, which all implement the ISAXContentHandler interface. Figure 2 shows a typical mapping from an XML document to this composite structure. Clients use ISAXContentHandler to interact with elements in this composite structure. If the element is Leaf, its content is converted directly. If the element is Composite, it delegates the processing to its child components with the possibility of performing additional processing before and/or after the delegation. The processing for each element will be well encapsulated into its own content handler. It could minimize the ripple effect if XML schemas are changed. You would only need to modify/add/remove the relevant handler without touching other elements. Decoupling the parsing for different elements helps to alleviate some design concerns, but still does not reduce the code duplication that may occur when implementing Document/Composite/Leaf content handlers.

A C++ template would be a natural choice to make the code generic. In this case, you need to see if you can determine the appropriate template type. Typically, the data contained in an XML document represents some abstract business information set that can often be mapped to a business class. Those Leaf elements could usually be mapped to a primitive type variable, such as an integer, string, and so on, which would be the member variable of the relevant business class. By using the business class mapped from the XML data as the template type, the parsing of different XML documents would simply become defining different business classes. Application developers need not worry about the details of XML parsing any more since it has been wrapped in those template handler classes. Obviously, it would be much easier for you to deal with the business class rather than angles and brackets in XML documents. Furthermore, since most business logic centers around the business class, you can define a functor that processes the relevant business objects and use this functor as another template type. Now the business logic is nicely decoupled from the parsing. If you need additional business logic for this business data, you just need to define another functor. In this way, application programmers can then focus on the business logic.

Listing 2 shows the major functions of the DocumentContentHandler<Data, Functor> class. The template type Data is the business class related to the composite elements described in the XML document, and Functor is a class that provides the following function:

<b>


<code></p>



class Functor</p>



{</p>



//......</p>



void operator() (const Data& data);</p>



// .....</p>



};</p>



<text></b></p>

Of course, Functor could be a template class as well.

The main responsibility of DocumentContentHandler is to provide the entrance point of parsing, member function Parse(), and maintain two member variables defined as following:

<b>


<code></p>



map<XmlString, BaseContentHandler*>  m_AllHandlers;</p>



stack<BaseContentHandler*>           m_ActiveHandlers;</p>



<text></b></p>

All relevant content handlers are stored in the map m_AllHandlers. Instead of being used for multibranch if-else-if statements, those element names are now used as search keys for m_AllHandlers. Whenever an element is being parsed, the content handler defined just for that element is going to be found within m_AllHandlers based on its name, set as the active handler, and used for the appropriate processing. Since the parsing for a composite element will not be finished until the parsing for all its child components is finished, there could be multiple active handlers. To be able to go back to the composite element handler to do possible post-processing, the stack variable m_ActiveHandlers is used to keep track of all handlers currently being used. The content handler will be found and pushed into this stack by the startElement() function and popped out by the endElement() function. The first-in-last-out (FILO) feature of this stack-based mechanism [3] fits right into the way those elements need to be parsed, which is that the parsing for a composite element would be started before its child elements and ended after its child elements. Jim Beveridge [3] presents a similar stack-based SAX parsing mechanism that only decouples the parsing among different XML elements.

The content of the map m_AllHandlers is created in the constructor of DocumentContentHandler by using the binding information provided by the instantiating Data class.

Listing 3 presents the major functions of class CompositeContentHandler<Data, Functor>. The template types Data and Functor are the same types passed from the DocumentContentHandler class. Take a look at the following three member variables first:

<b>


<code></p>



vector<ContentHandlerPtr>  m_ChildHandlers; </p>



Functor*                   m_pFunctor; </p>



Data                       m_Data;</p>



<text></b></p>

The container m_ChildHandlers stores all LeafContentHandlers for all child elements. The business logic is performed by the functor specified by m_pFunctor against the data m_Data in the function endElement. As you can see in endElement, after the parsing for all child elements is completed, the CompositeContentHandler collects the data from its child handlers, transfers them into the embedded business object m_Data, and then dispatches the object to the prehooked functor m_pFunctor. Another important effect is that m_pFunctor keeps receiving the business object until the entire document is parsed. A loop is not necessary. This enables the business processor to work closely with parsing logic while remaining decoupled.

Like m_pAllHandlers in DocumentContentHandler, the content of the vector m_ContainedHanlders is created in the constructor.

The following code shows the main functionality of class LeafContentHandler<T>, where T is expected to be some primitive type such as an integer, double, or string.

<b>


<code></p>



template <typename T></p>



class LeafContentHandler : public BaseContentHandler</p>



{</p>



// ....</p>



T      m_Data;</p>



public:</p>



// ....</p>



virtual HRESULT STDMETHODCALLTYPE characters(</p>



/* [in] */ wchar_t __RPC_FAR *pwchChars,</p>



/* [in] */ int cchChars)</p>



{</p>



XmlString strTemp(pwchChars, cchChars);</p>



m_Data = xmlstring_cast<T>(strTemp);</p>



return S_OK;</p>



}</p>



};</p>



<text></b></p>

The major responsibility of LeafContentHandler is to convert the element content from string format to the expected data type and store it in the variable m_Data, which is later retrieved by its CompositeContentHandler. In the function characters(), the template function xmlstring_cast() would convert the content into the value of template type T.

Although these template classes hide all SAX details, it is still tedious to manually define every handler and make sure to instantiate those template classes with the correct types. Simulating Recordset binding entries and relevant macros such as BEGIN_ADO_ENTRY in Visual C++'s extension to ADO, similar data structure XmlBindingEntry and macros are defined as:

<b>


<code></p>



struct XmlBindingEntry</p>



{</p>



XmlDataType  eDataType;</p>



unsigned long ulBufferOffset;</p>



};</p>



#define BEGIN_XML_BINDING(ElementTag, Class) public: \</p>



typedef Class XMLElementClass; \</p>



static XmlString GetClassName() { return XmlString(ElementTag);        } \</p>



const XmlBindingEntry* GetXmlBindingEntries() { \</p>



static const XmlBindingEntry rgXMLBindingEntries[] = { </p>



#define XML_ELEMENT(TagName, DataType, Buffer)\</p>



{ TagName, \</p>



DataType, \</p>



offsetof(XMLElementClass, Buffer) \</p>



},</p>



#define END_XML_BINDING()   {L"", eXML_EMPTY, 0}};\</p>



return rgXMLBindingEntries;}</p>



<text></b></p>

The macros BEGIN_XML_BINDING, XML_ELEMENT, and END_XML_BINDING are used to describe the XML schema in the format of XmlBindingEntry and build the binding entry array rgXMLBindingEntries for each associated business class. The DocumentContentHandler and CompositeContentHandler read the binding information from their instantiating business class, create the relevant handlers, and insert them into the map m_AllHandlers and vector m_ContainedHandlers, respectively.

An Example

Here's how the sample XML file in Figure 1 could be parsed through this wrapper step-by-step:

Step 1. Create the business class that maps to the data defined in the XML data and use the macros to describe its schema. Listing 4 shows the class CustomerOrder.

Step 2. Create the functor to process the business object defined by Step 1. Listing 5 shows the functor CustomerOrderProcessor. For simplicity, the functor only prints the content of the object to the screen. In reality, you can put the desired processing in this class, such as transforming them into another format, writing them into database, and so on. You can also define different functors for different processing and hook the right one with the wrapper classes either statically or dynamically.

Step 3. Instantiate the wrapper class with the class and functor defined in the former steps and start to parse the specified XML document, as in Listing 6, which results in this output:

<b>


<code></p>



Order 1000 from XYZ Inc.: Steel 2.8 Tons has been processed.</p>



Order 1001 from ABC Inc.: Plastic 30.5 Rolls has been processed.</p>



All Customer Order Processing Has Been Finished.</p>



<text></b></p>

No multibranch if-else-if statements, no complicated-looking SAX API, and no duplicate code. More importantly, you can focus on the business class and its logic.

Conclusion

Applying design patterns and the power of C++ templates not only hides the details of the SAX API, but also decouples the parsing logic and business logic. Therefore, you can spend more time on business logic than dealing with the XML parsing API. An obvious expansion on this idea would be moving past simple formed data and into other events of ContentHandler as well as other SAX interfaces.

Acknowledgments

Thanks to Jeffrey Burch of AOL and Mike Wayne of InterEnable Corp. for providing technical review.

References

[1] Microsoft XML 3.0: SAX2 Developer's Guide, Microsoft MSDN Library, October 2001.

[2] Gamma, Erich, et al. Design Patterns, Addison-Wesley,1995.

[3] Beveridge, Jim. "Transporting Data with XML: The C++ Problem." XML Magazine, Summer 2000.


Yingjun Zhang is a lead software engineer at Preston Aviation Solutions, a Boeing company. His main interests are software architecture and generic and object-oriented programming in C++. He can be reached at zhang_yingjun@ hotmail.com.



Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.