On 11/20/13 11:40 AM, Scott Prater wrote:
> Not sure what the details of our issue was on Monday -- but we do have
> records that are supposedly encoded in UTF-8, but nonetheless contain
> invalid characters.
Oh, and I'd clarify, if you haven't figured it out already, if those are
ISO 2709 binary records, you can ask the reader to do different things
there in that case (already avail in current ruby-marc release):
# raise:
MARC::Reader("something.marc", :validate_encoding => true)
# replace with unicode replacement char:
MARC::Reader("something.marc", :invalid => :replace)
This is already available in present ruby-marc release.
I would suggest one or the other -- the default of leaving bad bytes in
your ruby strings is asking for trouble, and you probably don't want to
do it, but was made the default for backwards compat reasons with older
versions of ruby-marc. (See why I am reluctant to add another default
that we don't think hardly anyone would actually want? :) )
Oh, and you may also want to explicitly specify the expected encoding to
avoid confusing:
MARC::Reader("something.marc", :external_encoding => "UTF-8",
:validate_encoding => true)
(It will also work with any other encoding recognized by ruby, for those
with legacy, possibly international, data).
This stuff is confusing to explain, there are so many permutations and
combinations of circumstances involved. But I'll try to improve the
ruby-marc docs on this stuff, as part of adding the yet more options for
MARC8 handling.
|