On 09.06.2012 00:00, Kyle Banerjee wrote:
>>
>> Since you mentioned SimpleXML, Kyle, I assume you're using PHP?
>>
>
> Actually I'm using perl. For reasons not related to XML parsing, it is the
> preferred (but not mandatory) language.
>
> Based on a few tests and manual inspection, it looks like the ticket for me
> is going have a two stage process where the first stage converts the file
> to valid XML and the second cuts through it with SAX.
>
> Originally, I was trying to avoid SAX, but the process has been prettier
> than expected so far. The XML has not been prettier than expected --
> it contains a number of issues including outright invalid XML, invalid
> characters, and hand coded HTML within some elements (i.e. string data not
> encoded as such). Gotta love library data. But screwed up stuff is
> employment security. If things actually worked, I'd be redundant...
>
> kyle
Since you're using perl I think you mean XML::Simple which is a
DOM-parser. You also mentioned LibXML and are considering SAX-parsing so
I assume you've only used DOM-parsing then? How about using an XML
reader, kind of like SAX but a whole lot cleaner and easier - something
like:
use XML::LibXML::Reader;
my $reader = XML::LibXML::Reader->new(location => $filename_or_uri);
while ( $reader->read ) {
next unless $reader->name eq 'record' &&
$reader->nodeType eq XML_READER_TYPE_ELEMENT;
my $dom = XML::LibXML->load_xml( string => $reader->readOuterXml );
...do something with the record element's dom-tree...
}
Documentation [https://metacpan.org/module/XML::LibXML::Reader]
HTH
--
Teemu Nuutinen, Digital Services, Helsinki University Library
|