LISTSERV 16.5 - CODE4LIB Archives

This is something I've dealt with. And for a variety of reasons, we
went with the streaming parser. I'm not sure about the quality of your
data, but we have to be prepared for seriously messed up data. There
was no way I was going to develop a process that would try to load a
15 million record file, and the whole process could fail at record 14
million due to a syntax or encoding error. Nope, nope, nope. So, not
only do we use a streaming parser, it's a two stage streaming parser.
First, we have a stage that finds record boundaries and creates a well
formed version of it. Then the parser for the actual record is called
to extract the data for crosswalking.

/dev

--
Devon Smith
Consulting Software Engineer
OCLC Research
http://www.oclc.org/research/people/smith.htm

On Fri, Jun 8, 2012 at 2:36 PM, Kyle Banerjee <[log in to unmask]> wrote:
> I'm working on a script that needs to be able to crosswalk at least a
> couple hundred XML files regularly, some of which are quite large.
>
> I've thought of a number of ways to go about this, but I wanted to bounce
> this off the list since I'm sure people here deal with this problem all the
> time. My goal is to make something that's easy to read/maintain without
> pegging the CPU and consuming too much memory.
>
> The performance and load I'm seeing from running the files through LibXML
> and SimpleXML on the large files is completely unacceptable. SAX is not out
> of the question, but I'm trying to avoid it if possible to keep the code
> more compact and easier to read.
>
> I'm tempted to streamedit out all line breaks since they occur in
> unpredictable places and put new ones at the end of each record into a temp
> file. Then I can read the temp file one line at a time and process using
> SimpleXML. That way, there's no need to load giant files into memory,
> create huge arrays, etc and the code would be easy enough for a 6th grader
> to follow. My proposed method doesn't sound very efficient to me, but it
> should consume predictable resources which don't increase with file size.
>
> How do you guys deal with large XML files? Thanks,
>
> kyle
>
> <rant>Why the heck does the XML spec require a root element,
> particularly since large files usually consist of a large number of
> records/documents? This makes it absolutely impossible to process a file of
> any size without resorting to SAX or string parsing -- which takes away
> many of the advantages you'd normally have with an XML structure. </rant>
>
> --
> ----------------------------------------------------------
> Kyle Banerjee
> Digital Services Program Manager
> Orbis Cascade Alliance
> <[log in to unmask]>[log in to unmask] / 503.999.9787



-- 
Sent from my GMail account.