On 09.06.2012 00:00, Kyle Banerjee wrote: >> >> Since you mentioned SimpleXML, Kyle, I assume you're using PHP? >> > > Actually I'm using perl. For reasons not related to XML parsing, it is the > preferred (but not mandatory) language. > > Based on a few tests and manual inspection, it looks like the ticket for me > is going have a two stage process where the first stage converts the file > to valid XML and the second cuts through it with SAX. > > Originally, I was trying to avoid SAX, but the process has been prettier > than expected so far. The XML has not been prettier than expected -- > it contains a number of issues including outright invalid XML, invalid > characters, and hand coded HTML within some elements (i.e. string data not > encoded as such). Gotta love library data. But screwed up stuff is > employment security. If things actually worked, I'd be redundant... > > kyle Since you're using perl I think you mean XML::Simple which is a DOM-parser. You also mentioned LibXML and are considering SAX-parsing so I assume you've only used DOM-parsing then? How about using an XML reader, kind of like SAX but a whole lot cleaner and easier - something like: use XML::LibXML::Reader; my $reader = XML::LibXML::Reader->new(location => $filename_or_uri); while ( $reader->read ) { next unless $reader->name eq 'record' && $reader->nodeType eq XML_READER_TYPE_ELEMENT; my $dom = XML::LibXML->load_xml( string => $reader->readOuterXml ); ...do something with the record element's dom-tree... } Documentation [https://metacpan.org/module/XML::LibXML::Reader] HTH -- Teemu Nuutinen, Digital Services, Helsinki University Library