I'm working on a Perl-based OAI harvester and have run a problem. The module that I'm using - Net::OAI::Harvester - does a great job of parsing out the different OAI tagged "fields" so that they can be put into a MySQL table of retrieved OAI records for searching.
Unfortunately, in using the University of Michigan OAI Toolkit, I have found that at least one repository has repeated tags. In particular, multiple identifier tags. This presents a problem in that it seems that Net::OAI::Harvester gets the first (and, as far as I know how to use it, only the first) instance of a tag. In addition to the loss of data (which is always bad), it is made worse here by the fact that the repository that I'm trying to harvest usually places the URL to connect to the repository item in the second identifier tag. That being the case, the URL does not get saved to the database and the harvest is less-than-useful to our users.
Does anyone know a way in which Net::OAI::Harvester can be used with oai_dc records in a way where multiple instances of a tag can be captured and then concatenated with the first one.
I have spent some time trying a number of different approaches, including trying different libraries (such as XML::LibXML and XML::SAX::Parser), but I can't seem to get it to work with the input I get inside the Net::OAI::Harvester module, which has been run through the Storable module).
Unfortunately, the documentation that I have been able to find on the Web does not provide information on any methods that I could use.
Would it make more sense just to move to the University of Michigan Toolkit to harvest the XML records? I would prefer to continue with the Net::OAI::Harvester module if I can in that it allows me to be flexible in what sorts of schemas I'm able to harvest, not just unqualified Dublin Core.
That being said, I do have one other question: Is there a way within the Net::OAI::Harvester to output the actual metadata structure that's being harvested?
Thanks in advance for any assistance that you can provide!