LISTSERV 16.5 - CODE4LIB Archives

I have implemented fairly complete and robust proper support for 
character encodings in ruby-marc when reading 'binary' marc under ruby 1.9.

It's currently in a git branch, not yet released, and not yet in git 
master. https://github.com/ruby-marc/ruby-marc/tree/char_encodings

If anyone who uses this (or doesn't) has a chance to beta test it, it 
would be appreciated. One way to test, checkout with git, switch to 
'char_encodings' branch, and `rake install` to install as a gem to your 
system.  These changes should _only_ effect use under ruby 1.9, and only 
effect reading in 'binary' (ISO 2709) marc.

The new functionality is pretty extensively covered by automated tests, 
but there are some weird and complex interactions that can occur 
depending on exactly what you're doing, bugs are possible. It was 
somewhat more complicated than one might expect to implement a complete 
solution here, in part because we _do_ have international users who use 
ruby-marc, with encodings that are neither MARC8 nor UTF8, and in fact 
non-MARC21.

If any of the other committers (or anyone else) wants to code review, 
you are welcome to.

POSSIBLE BACKWARDS INCOMPAT

Some previous 0.4.x versions, when running under ruby 1.9 only, would 
automatically _transcode_ non-unicode encodings to UTF-8 for you under 
the hood. The new version no longer does so automatically (although you 
can ask it to). It was not tenable to support that backwards compatibly.

Everything else _ought_ to be backwards compatible with previous 0.4.x 
ruby-marc under ruby 1.9, fixing many problems.

NEW FEATURES

All applying to ruby 1.9 only, and to reading binary MARC only.

* Do a pretty good job of setting encodings properly for your ruby 
environment, especially under standard UTF-8 usage.

* You _can_ and _do have to_ provide an argument for reading non-UTF8 
encodings. (but sadly no support for marc8).

* You can ask MARC::Reader to transcode to a different encoding when 
loading marc.

* You can ask MARC::Reader to replace bytes that are illegal in the 
believed source encoding with a replacement character (or the empty 
string) to avoid ruby "invalid UTF-8 byte" exceptions later, and 
sanitize your input.

New features documented in inline comments, see at:
http://rubydoc.info/github/ruby-marc/ruby-marc/MARC/Reader

I had trouble making the docs concise, sorry, I think I've been pounding 
my head against this stuff so much realizing how complicated it ends up 
being that I wasn't sure what to leave out.