If you're not adverse to Java, the XOM XML library has a nice NodeFactory class that you can override and control the processing of the XML document. For instance, it will let you parse a very large XML document like <root> <rec></rec> <rec></rec> ... </root> only keeping a <rec> at a time in memory. You control the node building process so can throw away the one's you're done with. It's friendlier than SAX and what I use for processing very large documents. Cf. http://www.xom.nu/apidocs/nu/xom/NodeFactory.html Kevin On Fri, Jun 8, 2012 at 2:36 PM, Kyle Banerjee <[log in to unmask]> wrote: > I'm working on a script that needs to be able to crosswalk at least a > couple hundred XML files regularly, some of which are quite large. > > I've thought of a number of ways to go about this, but I wanted to bounce > this off the list since I'm sure people here deal with this problem all the > time. My goal is to make something that's easy to read/maintain without > pegging the CPU and consuming too much memory. > > The performance and load I'm seeing from running the files through LibXML > and SimpleXML on the large files is completely unacceptable. SAX is not out > of the question, but I'm trying to avoid it if possible to keep the code > more compact and easier to read. > > I'm tempted to streamedit out all line breaks since they occur in > unpredictable places and put new ones at the end of each record into a temp > file. Then I can read the temp file one line at a time and process using > SimpleXML. That way, there's no need to load giant files into memory, > create huge arrays, etc and the code would be easy enough for a 6th grader > to follow. My proposed method doesn't sound very efficient to me, but it > should consume predictable resources which don't increase with file size. > > How do you guys deal with large XML files? Thanks, > > kyle > > <rant>Why the heck does the XML spec require a root element, > particularly since large files usually consist of a large number of > records/documents? This makes it absolutely impossible to process a file of > any size without resorting to SAX or string parsing -- which takes away > many of the advantages you'd normally have with an XML structure. </rant> > > -- > ---------------------------------------------------------- > Kyle Banerjee > Digital Services Program Manager > Orbis Cascade Alliance > <[log in to unmask]>[log in to unmask] / 503.999.9787