Working with large xml files in c# .net

Working with large (huge) xml files is always a pain in the … The reason? These files can’t be loaded in to memory. On my desktop, where I have 2 gigs memory, I can’t open the file in even notepad. I was presented with a challenge recently to manipulate one such large xml file. The xml file was of 550+ MB. I know many would say I have seen bigger xml files than this. But the heart of the matter is if I can’t open 550+ MB file in notepad or in xmldocument in c#, then I can’t open any file bigger than this. And hence the logic to play with these files would remain same.

The scenario: We have an xml file from which we want to remove a single node without removing its children. In the below sample xml fragment, the node has to be removed. The children nodes, must then be attached to ( node’s parent) node.

Proposed Solution: To start with, I tried to work with xmlDocument, because this is the easiest way to manipulate xml data. But as I mentioned earlier, the xml file size is too large (675 MB). When I try to pass the xml file and create an object of xmlDocument, the system stops responding for couple of minutes and then throws an out of memory exception. Then I switched to xmlReader and xmlWriter provided by .net. In the below code what I am trying to do is to read the xml file using xmlReader and then write to a new xml file using xmlWriter. The xmlReader reads the xml file one element at a time. In the process I remove node and add nodes to its parent .


/// 
/// This method creates xmlReader with the large xml file. It also creates new xml file for writing 
/// processed xml data. In here, I loop through xmlReader (large xmlfile) and process it further.
/// 

        private void XMLReaderRealScan1()
        {
            //take xmlReaderSettings to remove white space.
            XmlReaderSettings settings = new XmlReaderSettings();
            settings.IgnoreWhitespace = true;
            //XmlWriterSettings xws = new XmlWriterSettings();
            //xws.Indent = true; //indenting would increase the file size. It went beyond 2 gigs. 

            XmlReader xR = XmlReader.Create("D:\\xmlfile\\large.xml", settings);
            XmlWriter xW = XmlWriter.Create("D:\\xmlfile\\large_New.xml"); 

            string sNode = "";
            //read the xml file...
            while (xR.Read())
            {
                //since xmlReader does not give handle on the node/attribute, we have to determine with 
                //the help of NodeType. I have mentioned possible nodetypes being used in our xml file. 
                switch (xR.NodeType)
                {
                    case XmlNodeType.Element:
                        if (xR.Name == "Features")
                        {
                            //if it is features node, then dont add the node in the xmlwriter. instead call
                            //another method to write all its children nodes to  parent node.
                            RemoveFeaturesNode(xR.ReadSubtree(), xW);
                        }
                        else
                        {
                            //write start element in the new xml file using xmlWriter.
                            xW.WriteStartElement(xR.Name);
                            //also write all the attributes in the node.
                            while (xR.MoveToNextAttribute())
                                xW.WriteAttributes(xR, false);
                        }
                        break;
                    case XmlNodeType.Text:
                        //write the text in the node
                        xW.WriteString(xR.Value);
                        break;
                    case XmlNodeType.CDATA:
                        break;
                    case XmlNodeType.ProcessingInstruction:
                        xW.WriteProcessingInstruction(xR.Name, xR.Value);
                        break;
                    case XmlNodeType.Comment:
                        xW.WriteComment(xR.Value);
                        break;
                    case XmlNodeType.Whitespace:
                        xW.WriteWhitespace(xR.Value);
                        break;
                    case XmlNodeType.SignificantWhitespace:
                        break;
                    case XmlNodeType.EndElement:
                        //if the nodeType is features, then dont write it to the processed xml file. 
                        if (xR.Name != "Features")
                        {
                            xW.WriteEndElement();
                        }
                        break;
                }
            }
            xW.Close();
        } 

        /// 
        /// this method will consider all the child nodes of  node and write them to xml file using xmlwriter. 
        /// 

        /// xmlreader.ReadSubTree must be passed
        /// the xml writer to write xml file
        private void RemoveFeaturesNode(XmlReader xmlrd, XmlWriter xW)
        {
            while (xmlrd.Read())
            {
                //the xR.ReadSubTree (which is passed to this method as xmlrd), will give  node as well
                //as all the children nodes. and hence we will have to omit  node.
                if (xmlrd.Name != "Features")
                {
                    switch (xmlrd.NodeType)
                    {
                        case XmlNodeType.Element:
                            xW.WriteStartElement(xmlrd.Name);
                            while (xmlrd.MoveToNextAttribute())
                                xW.WriteAttributes(xmlrd, false);
                            break;
                        case XmlNodeType.Text:
                            xW.WriteString(xmlrd.Value);
                            break;
                        case XmlNodeType.EndElement:
                            xW.WriteEndElement();
                            break;
                    }
                }
            }
        }

The Output: The output would look something like this. With this approach, we have achieved an one time activity to change the large xml file without taking it into memory.

Feel free to contact me in case you need help.
-Vighnesh Bendre

Error while activating feature - SharePoint 2010

Hi all, While I was working on SharePoint 2010 recently I came across some issues. I am putting them across so that anyone facing the same issue may find solution easily. Error: Error occurred in deployment step 'Activate Features': Feature with Id xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' is not installed in this farm, and cannot be added to this scope. Solutions: I was getting this error while ‘deploy’ing a feature from Visual Studio 2010. I searched the entire project for this GUID which is mentioned in the error. I did not find it anywhere. While troubleshooting, I opened the physical folder and opened ‘Feature1.feature’ file in notepad. This xml file was using the before mentioned GUID. I was not able to find it in VS 2010 :( I copied the GUID mentioned in the ‘Feature1.feature’ file and pasted it in the Feature ID section in ‘Feature1.Template.xml’ file. This solved the problem. Why I thought to mention it is because the exception thrown is confusing. It does not tell you...

Anonymous said…

Hi, Vighnesh.

I'm trying to process a large XML file with XPath.
Can you help me? How can I proceed with this?

Regards

August 7, 2012 at 8:46 PM

Hi Vighnesh,

I have large xml file and I want to write it into another xml file with some upadates i.e. by adding extra nodes. How can I proceed.My email is rekhaingulkar@gmail.com

May 19, 2014 at 10:53 AM

Mark said…

Hello Vighnesh,
I have one small question, if u can help me...
I wonder why is neccessary to use the function 'RemoveFeaturesNode()', I mean when I found the node I want to remove, instead of calling RemoveFeaturesNode(), just do nothing, just don't write currenty node in XmlWriter.

Thanks.

October 8, 2014 at 2:44 PM

Vighnesh Bendre - MS technologies

Search This Blog

Working with large xml files in c# .net

Labels

Comments

Popular posts from this blog

Error while activating feature - SharePoint 2010

Upgrade and Migration for SharePoint Foundation 2010