On Thu, Mar 8, 2012 at 3:18 PM, Ed Summers <[log in to unmask]> wrote: > Hi Terry, > > On Thu, Mar 8, 2012 at 2:36 PM, Reese, Terry > <[log in to unmask]> wrote: > > This is one of the reasons you really can't trust the information found > in position 9. This is one of the reasons why when I wrote MarcEdit, I > utilize a mixed process when working with data and determining characterset > -- a process that reads this byte and takes the information under > advisement, but in the end treats it more as a suggestion and one part of a > larger heuristic analysis of the record data to determine whether the > information is in UTF8 or not. Fortunately, determining if a set of data > is in UTF8 or something else, is a fairly easy process. Determining the > something else is much more difficult, but generally not necessary. > > Can you describe in a bit more detail how MARCEdit sniffs the record > to determine the encoding? This has come up enough times w/ pymarc to > make it worth implementing. > > One side comment here; while smart handling/automatic detection of encodings would be a nice feature to have, it would help if pymarc could operate in an 'agnostic', or 'raw' mode where it would simply preserve the encoding that's there after a record has been read when writing the record. [ Right now, pymarc does not have such a mode - if leader[9] == 'a', the data is unconditionally utf8 encoded on output as per mbklein's patch. ] - Godmar