Friday, January 1, 2010

A Recursive XML Parser in C#

Introduction:

In recent weeks I have been asked to write an XML Parser to parse XML files without knowing the XML schema on which the file is based. This is a basic problem and the solution is basically a recursive approach.

I did a tentative google and didn't get anything interesting. So I decided to use brute force (i.e. create my own parser) and 45 minutes later I had the solution. Two weeks ago I was asked for an almost identical solution by a different client. I had to spend another 45 min to reproduce the same thing again. It looks like I will be expected to reproduce the same thing again soon. So I might as well dump the solution here to same myself 45 min and maybe help some other folks that might need to solve the same problem.

The Basic Problem:

You are given an XML file and asked to parse the content and display it in the form of a tree structure. For instance if Fig 1 below is the content of the XML file, the solution should display an output like Fig 2.

Fig 1: The Content of the XML File



Fig 2: The Output



The Solution:

Due to the nature of the problem (i.e. the unknown XML Schema factor) the solution is conveniently a recursive one. The objective is to iterate through all the nodes and extract the node names, the attributes of each node and somehow calculate the number of the tabs prepended to each node name in order to imitate the appearance of a tree structure.

In order to enable the parser to keep track of the information for each node, a class called VisualNode is created as shown in Fig 3. A VisualNode is a lightweight representation of an XML node. It is named "Visual" node simply because this was the user will see on the output screen.

Fig 3: The VisualNode Class



To make life easier for ourselves, we have overrided the ToString() method of this class.

The Parser:

This is the heart of the program. The class constructor accepts the xml fully qualified filename and loads it in an XMLDocument object called _XMLDoc. This object is used in GetXMLNodeList() method to extract the XML node tree.

Fig 4: The Parser Class



The GetVisualNodeList() is where the real work starts. It first declares and instantiates a generic list called visualNodesList(). This is needed in order to help the node parser to keep track of the parsed nodes. We will iterate through this list in order to print out the final output.

Next, we iterate through the available nodes in the node tree. At this stage there maybe 0, 1, 2 or more nodes. If there are 0 nodes or 3 or more nodes or just the XML Declaraion node, then the xml file is not valid. If there is only one node, it must be the root node of the document. If there are 2 nodes, they should be the XML Declaration as well as the root node. If the file is missing or the invalid, an exception will be thrown. If the file is a vaild XML file and has an XML Declaration node, we will ignore that node as it is not required in the output.

Once the root node of the XML file (e.g. people node) is read, it is passed to ParseNode(...) method. ParseNode is a recursive function (i.e. it calls itself) and is called to parse every single XML node in the document. ParseNode takes 3 parameters, the XML Node, the VisualNodeList and the number of tabs. The interesting parameter is the number of tabs (i.e. indent parameter). We just know that for the first node, indent is equal to zero (hence Fig 4 line 22) and for each child the indent should be one more than the parents indent value (hence Fig 4 Line 49). Apart from these 2 facts, we leave it entirely to recursion to handle the number of tabs and it does a good job of it too. (Acutally this is one of the few places that ++indent will not work. You must use indent + 1. This will allow the recursion to control the value of the indent).

The Driver:

The final piece of the puzzle is of course the driver (the main program). As shown in Fig 5, its pretty striaght forward.

Fig 5: The Driver



We have created an object of XMLParse type and used to a get a reference to its VisualNodeList object. Then we iterate through this list and use the overrided ToString() method of the VisualNodes to display each node on the console screen. We have wrapped this in a try-catch block to ensure that file access and xml validity is handled appropriately.

With season's greetings!
01/01/10

3 comments:

rabaumann said...

I took your recursive XML parser and duplicated it within MS VS2010 as a Application console C# program. It gave me 2 errors for the XMLParser.cs. "Method must have a return type" on lines 13 and 14. Which are these two lines:

private XmlParser() { }
public XmlParser(string xmlFilePath)

Did you run across this ??? Do you have a solution ???

NasserO said...

Hi Rabaumann;

These are the constructors of the class. They can't have return types. Have you changed the name of the class by any chance? The class name MUST be the same as the constructor name.

Please check that, if you have changed the name of class, change it back to XmlParser (case sensitive).

If you still had problems let me know.

Thanks
Nassero

roni schuetz said...

great post - thanks for sharing