Jonathan -- thanks so much for tackling this monster. I haven't done anything but run the tests, and while they all pass in MRI 1.9, they all fail under jruby 1.6.7 in 1.9 mode, and one fails (test_read_write(XMLTest)) in 1.8 mode. I'll try to find some time to look and see what you're doing that might be MRI specific. On Thu, Apr 19, 2012 at 5:55 PM, Jonathan Rochkind <[log in to unmask]> wrote: > I have implemented fairly complete and robust proper support for character > encodings in ruby-marc when reading 'binary' marc under ruby 1.9. > > It's currently in a git branch, not yet released, and not yet in git > master. https://github.com/ruby-marc/**ruby-marc/tree/char_encodings<https://github.com/ruby-marc/ruby-marc/tree/char_encodings> > > If anyone who uses this (or doesn't) has a chance to beta test it, it > would be appreciated. One way to test, checkout with git, switch to > 'char_encodings' branch, and `rake install` to install as a gem to your > system. These changes should _only_ effect use under ruby 1.9, and only > effect reading in 'binary' (ISO 2709) marc. > > The new functionality is pretty extensively covered by automated tests, > but there are some weird and complex interactions that can occur depending > on exactly what you're doing, bugs are possible. It was somewhat more > complicated than one might expect to implement a complete solution here, in > part because we _do_ have international users who use ruby-marc, with > encodings that are neither MARC8 nor UTF8, and in fact non-MARC21. > > If any of the other committers (or anyone else) wants to code review, you > are welcome to. > > POSSIBLE BACKWARDS INCOMPAT > > Some previous 0.4.x versions, when running under ruby 1.9 only, would > automatically _transcode_ non-unicode encodings to UTF-8 for you under the > hood. The new version no longer does so automatically (although you can ask > it to). It was not tenable to support that backwards compatibly. > > Everything else _ought_ to be backwards compatible with previous 0.4.x > ruby-marc under ruby 1.9, fixing many problems. > > NEW FEATURES > > All applying to ruby 1.9 only, and to reading binary MARC only. > > * Do a pretty good job of setting encodings properly for your ruby > environment, especially under standard UTF-8 usage. > > * You _can_ and _do have to_ provide an argument for reading non-UTF8 > encodings. (but sadly no support for marc8). > > * You can ask MARC::Reader to transcode to a different encoding when > loading marc. > > * You can ask MARC::Reader to replace bytes that are illegal in the > believed source encoding with a replacement character (or the empty string) > to avoid ruby "invalid UTF-8 byte" exceptions later, and sanitize your > input. > > New features documented in inline comments, see at: > http://rubydoc.info/github/**ruby-marc/ruby-marc/MARC/**Reader<http://rubydoc.info/github/ruby-marc/ruby-marc/MARC/Reader> > > I had trouble making the docs concise, sorry, I think I've been pounding > my head against this stuff so much realizing how complicated it ends up > being that I wasn't sure what to leave out. > -- Bill Dueber Library Systems Programmer University of Michigan Library