Print

Print


On 11/20/2013 11:18 AM, Jonathan Rochkind wrote:
> On 11/20/13 11:40 AM, Scott Prater wrote:

> I would suggest one or the other -- the default of leaving bad bytes in
> your ruby strings is asking for trouble, and you probably don't want to
> do it, but was made the default for backwards compat reasons with older
> versions of ruby-marc. (See why I am reluctant to add another default
> that we don't think hardly anyone would actually want? :) )

Thanks for your usage suggestions and work on this, Jonathan.  I work 
mostly with marc4j, not ruby-marc, so I'm pretty unfamiliar with the 
capabilities of the gem.  My comments are more oriented towards general 
error handling when processing MARC streams.

I think the issue comes down to a distinction between a stream and a 
record.  Ideally, the ruby-marc library would keep pointers to which 
record it is in, where the record begins, and where the record ends in 
the stream.  If a valid header and end-of-record delimiter are in place, 
then the library should be able to reject the record if it contains 
garbage in between those two points, without compromising the integrity 
of the entire stream.  So my final output would not contain bad data; 
it would simply be missing some records, records that contained bad data.

Here's some (partial) pseudo ruby code of how I'd like to handle it:

count=0
reader = MARC::Reader.new('marc8.dat')
writer = MARC::XMLWriter.new('marc-utf8.xml')
for record in reader
   count+=1
   begin
      utf8rec = record.convert_to_utf()
      writer.write(utf8rec)
   rescue => exception
      log exception, "Skipping record #{count}"
   end
   ... now read the next record...
end

This example doesn't capture the exception if the next record can't be 
retrieved, because the stream is corrupt, but that would be the other 
addition I'd make.  The larger point is that reading a MARC stream 
should be handled as reading a sequence of MARC records encoded in that 
stream -- one bad record does not automatically invalidate the entire 
stream; it only invalidates it if the next record can't be found.

-- Scott

-- 
Scott Prater
Shared Development Group
General Library System
University of Wisconsin - Madison
[log in to unmask]
5-5415