I have implemented fairly complete and robust proper support for
character encodings in ruby-marc when reading 'binary' marc under ruby 1.9.
It's currently in a git branch, not yet released, and not yet in git
master. https://github.com/ruby-marc/ruby-marc/tree/char_encodings
If anyone who uses this (or doesn't) has a chance to beta test it, it
would be appreciated. One way to test, checkout with git, switch to
'char_encodings' branch, and `rake install` to install as a gem to your
system. These changes should _only_ effect use under ruby 1.9, and only
effect reading in 'binary' (ISO 2709) marc.
The new functionality is pretty extensively covered by automated tests,
but there are some weird and complex interactions that can occur
depending on exactly what you're doing, bugs are possible. It was
somewhat more complicated than one might expect to implement a complete
solution here, in part because we _do_ have international users who use
ruby-marc, with encodings that are neither MARC8 nor UTF8, and in fact
non-MARC21.
If any of the other committers (or anyone else) wants to code review,
you are welcome to.
POSSIBLE BACKWARDS INCOMPAT
Some previous 0.4.x versions, when running under ruby 1.9 only, would
automatically _transcode_ non-unicode encodings to UTF-8 for you under
the hood. The new version no longer does so automatically (although you
can ask it to). It was not tenable to support that backwards compatibly.
Everything else _ought_ to be backwards compatible with previous 0.4.x
ruby-marc under ruby 1.9, fixing many problems.
NEW FEATURES
All applying to ruby 1.9 only, and to reading binary MARC only.
* Do a pretty good job of setting encodings properly for your ruby
environment, especially under standard UTF-8 usage.
* You _can_ and _do have to_ provide an argument for reading non-UTF8
encodings. (but sadly no support for marc8).
* You can ask MARC::Reader to transcode to a different encoding when
loading marc.
* You can ask MARC::Reader to replace bytes that are illegal in the
believed source encoding with a replacement character (or the empty
string) to avoid ruby "invalid UTF-8 byte" exceptions later, and
sanitize your input.
New features documented in inline comments, see at:
http://rubydoc.info/github/ruby-marc/ruby-marc/MARC/Reader
I had trouble making the docs concise, sorry, I think I've been pounding
my head against this stuff so much realizing how complicated it ends up
being that I wasn't sure what to leave out.
|