Ed,
Sure -- but this is one part of a much larger process. MarcEdit has two MARC algorithms, one that is a strict processing algorithm, and one that is a loose processing algorithm that is able to process data that would be otherwise invalid for most processors (and this is done because in the real world, vendors send bad records...often. Anyway, the character encoding is actually one of the last things MarcEdit does before writing the processed file to disk. The reason for this is that MarcEdit reads and interacts with MARC data at the bit level, meaning characterset is pretty meaningless for the vast majority of the work that it does. When writing to disk though, .NET requests the filestream to be set to the correct encoding, otherwise data can be flattened and diacritics lost.
Essentially at that last step, the record is passed to a function called RecognizeUTF8 that takes a byte array. The program then enumerates the bytes to determine if the record is recognizable as UTF8 using a process based loosely around some of the work done by the International Components for Unicode (http://site.icu-project.org/) -- who have some incredible C libraries that do much more than you'd ever need to know how to do. While these don't work in C# -- they demonstrate some well-known methods for evaluating byte level data for code page evaluation.
Of course, one area where I split directions is that I'm not interested in other charactersets and MARC data with poorly coded UTF8 data needs to be forced to render as MARC8 (my opinion) until the invalid characters are corrected. So, in my process, invalid UTF8 data will flag the process and force data output in the mnemonic data format I use for MARC8 encoded data.
Does that make sense?
--TR
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Ed Summers
Sent: Thursday, March 08, 2012 12:19 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records
Hi Terry,
On Thu, Mar 8, 2012 at 2:36 PM, Reese, Terry <[log in to unmask]> wrote:
> This is one of the reasons you really can't trust the information found in position 9. This is one of the reasons why when I wrote MarcEdit, I utilize a mixed process when working with data and determining characterset -- a process that reads this byte and takes the information under advisement, but in the end treats it more as a suggestion and one part of a larger heuristic analysis of the record data to determine whether the information is in UTF8 or not. Fortunately, determining if a set of data is in UTF8 or something else, is a fairly easy process. Determining the something else is much more difficult, but generally not necessary.
Can you describe in a bit more detail how MARCEdit sniffs the record to determine the encoding? This has come up enough times w/ pymarc to make it worth implementing.
//Ed
|