LISTSERV 16.5 - CODE4LIB Archives

When I need to deal with huge XML files, I use Perl's XML::Parser in
"stream" mode. It's blazing fast, but I have to admit, the code isn't very
pretty.

There's also XML::LibXML::SAX<http://search.cpan.org/dist/XML-LibXML/lib/XML/LibXML/SAX.pod>,
but I can't seem to find any substantive documentation on how this works.
(If anyone has any sample code that uses this, I'd love to see it. Please
e-mail me off-list as I don't want to de-rail this thread.)

Teemu's suggestion about XML::LibXML::Reader is definitely worth
considering. I've never clocked it against XML::Parser, but it seems like
it *should* be fast. And as Teemu demonstrated, it allows you to write nice
compact code.

Ron




On Fri, Jun 8, 2012 at 2:36 PM, Kyle Banerjee <[log in to unmask]>wrote:

> I'm working on a script that needs to be able to crosswalk at least a
> couple hundred XML files regularly, some of which are quite large.
>
> I've thought of a number of ways to go about this, but I wanted to bounce
> this off the list since I'm sure people here deal with this problem all the
> time. My goal is to make something that's easy to read/maintain without
> pegging the CPU and consuming too much memory.
>
> The performance and load I'm seeing from running the files through LibXML
> and SimpleXML on the large files is completely unacceptable. SAX is not out
> of the question, but I'm trying to avoid it if possible to keep the code
> more compact and easier to read.
>
> I'm tempted to streamedit out all line breaks since they occur in
> unpredictable places and put new ones at the end of each record into a temp
> file. Then I can read the temp file one line at a time and process using
> SimpleXML. That way, there's no need to load giant files into memory,
> create huge arrays, etc and the code would be easy enough for a 6th grader
> to follow. My proposed method doesn't sound very efficient to me, but it
> should consume predictable resources which don't increase with file size.
>
> How do you guys deal with large XML files? Thanks,
>
> kyle
>
> <rant>Why the heck does the XML spec require a root element,
> particularly since large files usually consist of a large number of
> records/documents? This makes it absolutely impossible to process a file of
> any size without resorting to SAX or string parsing -- which takes away
> many of the advantages you'd normally have with an XML structure. </rant>
>
> --
> ----------------------------------------------------------
> Kyle Banerjee
> Digital Services Program Manager
> Orbis Cascade Alliance
> <[log in to unmask]>[log in to unmask] / 503.999.9787
>