LISTSERV 16.5 - CODE4LIB Archives

On 11/20/13 11:40 AM, Scott Prater wrote:
> Not sure what the details of our issue was on Monday -- but we do have
> records that are supposedly encoded in UTF-8, but nonetheless contain
> invalid characters.

Oh, and I'd clarify, if you haven't figured it out already, if those are 
ISO 2709 binary records, you can ask the reader to do different things 
there in that case (already avail in current ruby-marc release):

# raise:
MARC::Reader("something.marc", :validate_encoding => true)

# replace with unicode replacement char:
MARC::Reader("something.marc", :invalid => :replace)

This is already available in present ruby-marc release.

I would suggest one or the other -- the default of leaving bad bytes in 
your ruby strings is asking for trouble, and you probably don't want to 
do it, but was made the default for backwards compat reasons with older 
versions of ruby-marc. (See why I am reluctant to add another default 
that we don't think hardly anyone would actually want? :) )

Oh, and you may also want to explicitly specify the expected encoding to 
avoid confusing:

MARC::Reader("something.marc", :external_encoding => "UTF-8", 
:validate_encoding => true)

(It will also work with any other encoding recognized by ruby, for those 
with legacy, possibly international, data).

This stuff is confusing to explain, there are so many permutations and 
combinations of circumstances involved.  But I'll try to improve the 
ruby-marc docs on this stuff, as part of adding the yet more options for 
MARC8 handling.