Skip to main content

Working with large xml files in c# .net

Working with large (huge) xml files is always a pain in the … The reason? These files can’t be loaded in to memory. On my desktop, where I have 2 gigs memory, I can’t open the file in even notepad. I was presented with a challenge recently to manipulate one such large xml file. The xml file was of 550+ MB. I know many would say I have seen bigger xml files than this. But the heart of the matter is if I can’t open 550+ MB file in notepad or in xmldocument in c#, then I can’t open any file bigger than this. And hence the logic to play with these files would remain same.

The scenario: We have an xml file from which we want to remove a single node without removing its children. In the below sample xml fragment, the node has to be removed. The children nodes, must then be attached to ( node’s parent) node.








One
Two

100.22
GoodDay



3
4
Five

200.09
CrackJack






Proposed Solution: To start with, I tried to work with xmlDocument, because this is the easiest way to manipulate xml data. But as I mentioned earlier, the xml file size is too large (675 MB). When I try to pass the xml file and create an object of xmlDocument, the system stops responding for couple of minutes and then throws an out of memory exception. Then I switched to xmlReader and xmlWriter provided by .net. In the below code what I am trying to do is to read the xml file using xmlReader and then write to a new xml file using xmlWriter. The xmlReader reads the xml file one element at a time. In the process I remove node and add nodes to its parent .



///
/// This method creates xmlReader with the large xml file. It also creates new xml file for writing
/// processed xml data. In here, I loop through xmlReader (large xmlfile) and process it further.
///

private void XMLReaderRealScan1()
{
//take xmlReaderSettings to remove white space.
XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreWhitespace = true;
//XmlWriterSettings xws = new XmlWriterSettings();
//xws.Indent = true; //indenting would increase the file size. It went beyond 2 gigs.

XmlReader xR = XmlReader.Create("D:\\xmlfile\\large.xml", settings);
XmlWriter xW = XmlWriter.Create("D:\\xmlfile\\large_New.xml");

string sNode = "";
//read the xml file...
while (xR.Read())
{
//since xmlReader does not give handle on the node/attribute, we have to determine with
//the help of NodeType. I have mentioned possible nodetypes being used in our xml file.
switch (xR.NodeType)
{
case XmlNodeType.Element:
if (xR.Name == "Features")
{
//if it is features node, then dont add the node in the xmlwriter. instead call
//another method to write all its children nodes to parent node.
RemoveFeaturesNode(xR.ReadSubtree(), xW);
}
else
{
//write start element in the new xml file using xmlWriter.
xW.WriteStartElement(xR.Name);
//also write all the attributes in the node.
while (xR.MoveToNextAttribute())
xW.WriteAttributes(xR, false);
}
break;
case XmlNodeType.Text:
//write the text in the node
xW.WriteString(xR.Value);
break;
case XmlNodeType.CDATA:
break;
case XmlNodeType.ProcessingInstruction:
xW.WriteProcessingInstruction(xR.Name, xR.Value);
break;
case XmlNodeType.Comment:
xW.WriteComment(xR.Value);
break;
case XmlNodeType.Whitespace:
xW.WriteWhitespace(xR.Value);
break;
case XmlNodeType.SignificantWhitespace:
break;
case XmlNodeType.EndElement:
//if the nodeType is features, then dont write it to the processed xml file.
if (xR.Name != "Features")
{
xW.WriteEndElement();
}
break;
}
}
xW.Close();
}

///
/// this method will consider all the child nodes of node and write them to xml file using xmlwriter.
///

/// xmlreader.ReadSubTree must be passed
/// the xml writer to write xml file
private void RemoveFeaturesNode(XmlReader xmlrd, XmlWriter xW)
{
while (xmlrd.Read())
{
//the xR.ReadSubTree (which is passed to this method as xmlrd), will give node as well
//as all the children nodes. and hence we will have to omit node.
if (xmlrd.Name != "Features")
{
switch (xmlrd.NodeType)
{
case XmlNodeType.Element:
xW.WriteStartElement(xmlrd.Name);
while (xmlrd.MoveToNextAttribute())
xW.WriteAttributes(xmlrd, false);
break;
case XmlNodeType.Text:
xW.WriteString(xmlrd.Value);
break;
case XmlNodeType.EndElement:
xW.WriteEndElement();
break;
}
}
}
}


The Output: The output would look something like this. With this approach, we have achieved an one time activity to change the large xml file without taking it into memory.







One
Two
100.22
GoodDay


3
4
Five
200.09
CrackJack








Feel free to contact me in case you need help.
-Vighnesh Bendre

Comments

Anonymous said…
Hi, Vighnesh.

I'm trying to process a large XML file with XPath.
Can you help me? How can I proceed with this?

Regards
Anonymous said…
Hi Vighnesh,

I have large xml file and I want to write it into another xml file with some upadates i.e. by adding extra nodes. How can I proceed.My email is rekhaingulkar@gmail.com
Mark said…
Hello Vighnesh,
I have one small question, if u can help me...
I wonder why is neccessary to use the function 'RemoveFeaturesNode()', I mean when I found the node I want to remove, instead of calling RemoveFeaturesNode(), just do nothing, just don't write currenty node in XmlWriter.

Thanks.

Popular posts from this blog

Upgrade and Migration for SharePoint Foundation 2010

      1.1 Introduction Microsoft SharePoint Foundation 2010 has been designed for scale and performance and as such requires new hardware and software requirements. There are 3 major steps while upgrading. 1. Plan and Prepare 2. Perform a database attach upgrade 3. Verify upgrade 1.2 Plan and Prepare   Before we run any process to upgrade from Windows SharePoint Services 3.0 to Microsoft SharePoint Foundation 2010, we have to determine which upgrade approach to take. In our scenario, Database Attach Upgrade seems to be appropriate approach to follow. We can upgrade the content for the environment on a separate farm. The result is that you do not upgrade any of the services or farm settings. You can upgrade the databases in any order and upgrade several databases at the same time. While each database is being upgraded, the content in that database is not available to users. 1.2.1 Upgrade Approach A database attach upgrade enables you to move to...

Object Oriented Analysis & Design (OOAD) and Unified Modelling Language (UML)

Part 1 – Identifying Use Cases – Use Case Diagrams Recently I went through OOAD and UML training. The OOAD and UML tutorial was very impressive and I decided to share it with you. Object Oriented Analysis & Design and Unified Modelling Language is very important in a life cycle of a project. Previously I was involved in project requirement study and technical design. But this time, I learned the tricks of the trade. I discovered different tips for identifying Use Cases, Actors and Classes . In this series of posts, I am planning to take you through the process of involvement of UML in Requirement analysis and Design phase. This series will include 3 parts... Part 1. Identifying Use Cases – Use Case Diagrams Part 2. Realizing Use Cases – Sequence Diagrams Part 3. Identifying Classes – Class Diagrams For this purpose we will take commonly available sample requirement – Student Registration process. From this requirement we will identify the ACTORS and US...