Thursday, March 4, 2010

Working with large xml files in c# .net

Working with large (huge) xml files is always a pain in the … The reason? These files can’t be loaded in to memory. On my desktop, where I have 2 gigs memory, I can’t open the file in even notepad. I was presented with a challenge recently to manipulate one such large xml file. The xml file was of 550+ MB. I know many would say I have seen bigger xml files than this. But the heart of the matter is if I can’t open 550+ MB file in notepad or in xmldocument in c#, then I can’t open any file bigger than this. And hence the logic to play with these files would remain same.

The scenario: We have an xml file from which we want to remove a single node without removing its children. In the below sample xml fragment, the node has to be removed. The children nodes, must then be attached to ( node’s parent) node.








One
Two

100.22
GoodDay



3
4
Five

200.09
CrackJack






Proposed Solution: To start with, I tried to work with xmlDocument, because this is the easiest way to manipulate xml data. But as I mentioned earlier, the xml file size is too large (675 MB). When I try to pass the xml file and create an object of xmlDocument, the system stops responding for couple of minutes and then throws an out of memory exception. Then I switched to xmlReader and xmlWriter provided by .net. In the below code what I am trying to do is to read the xml file using xmlReader and then write to a new xml file using xmlWriter. The xmlReader reads the xml file one element at a time. In the process I remove node and add nodes to its parent .



///
/// This method creates xmlReader with the large xml file. It also creates new xml file for writing
/// processed xml data. In here, I loop through xmlReader (large xmlfile) and process it further.
///

private void XMLReaderRealScan1()
{
//take xmlReaderSettings to remove white space.
XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreWhitespace = true;
//XmlWriterSettings xws = new XmlWriterSettings();
//xws.Indent = true; //indenting would increase the file size. It went beyond 2 gigs.

XmlReader xR = XmlReader.Create("D:\\xmlfile\\large.xml", settings);
XmlWriter xW = XmlWriter.Create("D:\\xmlfile\\large_New.xml");

string sNode = "";
//read the xml file...
while (xR.Read())
{
//since xmlReader does not give handle on the node/attribute, we have to determine with
//the help of NodeType. I have mentioned possible nodetypes being used in our xml file.
switch (xR.NodeType)
{
case XmlNodeType.Element:
if (xR.Name == "Features")
{
//if it is features node, then dont add the node in the xmlwriter. instead call
//another method to write all its children nodes to parent node.
RemoveFeaturesNode(xR.ReadSubtree(), xW);
}
else
{
//write start element in the new xml file using xmlWriter.
xW.WriteStartElement(xR.Name);
//also write all the attributes in the node.
while (xR.MoveToNextAttribute())
xW.WriteAttributes(xR, false);
}
break;
case XmlNodeType.Text:
//write the text in the node
xW.WriteString(xR.Value);
break;
case XmlNodeType.CDATA:
break;
case XmlNodeType.ProcessingInstruction:
xW.WriteProcessingInstruction(xR.Name, xR.Value);
break;
case XmlNodeType.Comment:
xW.WriteComment(xR.Value);
break;
case XmlNodeType.Whitespace:
xW.WriteWhitespace(xR.Value);
break;
case XmlNodeType.SignificantWhitespace:
break;
case XmlNodeType.EndElement:
//if the nodeType is features, then dont write it to the processed xml file.
if (xR.Name != "Features")
{
xW.WriteEndElement();
}
break;
}
}
xW.Close();
}

///
/// this method will consider all the child nodes of node and write them to xml file using xmlwriter.
///

/// xmlreader.ReadSubTree must be passed
/// the xml writer to write xml file
private void RemoveFeaturesNode(XmlReader xmlrd, XmlWriter xW)
{
while (xmlrd.Read())
{
//the xR.ReadSubTree (which is passed to this method as xmlrd), will give node as well
//as all the children nodes. and hence we will have to omit node.
if (xmlrd.Name != "Features")
{
switch (xmlrd.NodeType)
{
case XmlNodeType.Element:
xW.WriteStartElement(xmlrd.Name);
while (xmlrd.MoveToNextAttribute())
xW.WriteAttributes(xmlrd, false);
break;
case XmlNodeType.Text:
xW.WriteString(xmlrd.Value);
break;
case XmlNodeType.EndElement:
xW.WriteEndElement();
break;
}
}
}
}


The Output: The output would look something like this. With this approach, we have achieved an one time activity to change the large xml file without taking it into memory.







One
Two
100.22
GoodDay


3
4
Five
200.09
CrackJack








Feel free to contact me in case you need help.
-Vighnesh Bendre

3 comments:

Orlando Junior said...

Hi, Vighnesh.

I'm trying to process a large XML file with XPath.
Can you help me? How can I proceed with this?

Regards

Anonymous said...

Hi Vighnesh,

I have large xml file and I want to write it into another xml file with some upadates i.e. by adding extra nodes. How can I proceed.My email is rekhaingulkar@gmail.com

Mark said...

Hello Vighnesh,
I have one small question, if u can help me...
I wonder why is neccessary to use the function 'RemoveFeaturesNode()', I mean when I found the node I want to remove, instead of calling RemoveFeaturesNode(), just do nothing, just don't write currenty node in XmlWriter.

Thanks.