Steve, I'm not sure if you were hoping for a ruby-related answer to
your question (since you mentioned Nokogiri), but if you are, take a
look at ruby-marc' GenericPullParser [1] as an example of using a SAX
parser for this sort of thing. It doesn't quite answer your question,
but I think it might provide some guidance.
Basically, I think you're still going to have to use the SAX parser to
create record objects where you can build up your hierarchy logic and
then simply move onto the next record if the conditions aren't met.
Even though you'd still need to build your objects, I think streaming
over the XML (and constructed objects) will still be pretty fast and
efficient.
-Ross.
1. https://github.com/ruby-marc/ruby-marc/blob/master/lib/marc/xml_parsers.rb#L27
On Fri, Jun 8, 2012 at 8:07 PM, Steve Meyer <[log in to unmask]> wrote:
> It is also worth noting that you can usually do SAX-style parsing in
> most XML parsing libraries that are normally associated with DOM style
> parsing and conveniences like XPath selectors. For example, Nokogiri
> does SAX and it is *very* fast:
>
> http://nokogiri.org/Nokogiri/XML/SAX/Document.html
>
> As a related question, when folks do SAX-style parsing and need to
> select highly conditional and deeply nested elements (think getting
> MODS title data only when a parent element's attribute matches a
> condition and it is all nested in a big METS wrapper), how are you
> keeping track of those nesting and conditional rules? I have relied on
> using a few booleans that get set and unset to track state, but it
> often feels sloppy.
>
> -steve
>
> On Fri, Jun 8, 2012 at 2:41 PM, Ethan Gruber <[log in to unmask]> wrote:
>> but I have gotten noticeably better
>> performance from Saxon/XSLT2 than PHP with DOMDocument or SimpleXML or
>> nokogiri and hpricot in Ruby.
|