Jonathan -- thanks so much for tackling this monster.

I haven't done anything but run the tests, and while they all pass in MRI
1.9, they all fail under jruby 1.6.7 in 1.9 mode, and one fails
(test_read_write(XMLTest)) in 1.8 mode.

I'll try to find some time to look and see what you're doing that might be
MRI specific.

On Thu, Apr 19, 2012 at 5:55 PM, Jonathan Rochkind <[log in to unmask]> wrote:

> I have implemented fairly complete and robust proper support for character
> encodings in ruby-marc when reading 'binary' marc under ruby 1.9.
> It's currently in a git branch, not yet released, and not yet in git
> master.**ruby-marc/tree/char_encodings<>
> If anyone who uses this (or doesn't) has a chance to beta test it, it
> would be appreciated. One way to test, checkout with git, switch to
> 'char_encodings' branch, and `rake install` to install as a gem to your
> system.  These changes should _only_ effect use under ruby 1.9, and only
> effect reading in 'binary' (ISO 2709) marc.
> The new functionality is pretty extensively covered by automated tests,
> but there are some weird and complex interactions that can occur depending
> on exactly what you're doing, bugs are possible. It was somewhat more
> complicated than one might expect to implement a complete solution here, in
> part because we _do_ have international users who use ruby-marc, with
> encodings that are neither MARC8 nor UTF8, and in fact non-MARC21.
> If any of the other committers (or anyone else) wants to code review, you
> are welcome to.
> Some previous 0.4.x versions, when running under ruby 1.9 only, would
> automatically _transcode_ non-unicode encodings to UTF-8 for you under the
> hood. The new version no longer does so automatically (although you can ask
> it to). It was not tenable to support that backwards compatibly.
> Everything else _ought_ to be backwards compatible with previous 0.4.x
> ruby-marc under ruby 1.9, fixing many problems.
> All applying to ruby 1.9 only, and to reading binary MARC only.
> * Do a pretty good job of setting encodings properly for your ruby
> environment, especially under standard UTF-8 usage.
> * You _can_ and _do have to_ provide an argument for reading non-UTF8
> encodings. (but sadly no support for marc8).
> * You can ask MARC::Reader to transcode to a different encoding when
> loading marc.
> * You can ask MARC::Reader to replace bytes that are illegal in the
> believed source encoding with a replacement character (or the empty string)
> to avoid ruby "invalid UTF-8 byte" exceptions later, and sanitize your
> input.
> New features documented in inline comments, see at:
> I had trouble making the docs concise, sorry, I think I've been pounding
> my head against this stuff so much realizing how complicated it ends up
> being that I wasn't sure what to leave out.

Bill Dueber
Library Systems Programmer
University of Michigan Library