For what it's worth, my patch was a stopgap measure, and acknowledged as such at the time. My proposal for a real, comprehensive solution was detailed in a comment in a (now-closed) issue Github ticket[1]. If I'd had the time and the knowledge, I would have implemented it that way. If I'd had the need, I would have made the time and gained the knowledge. As it was, I submitted a patch to make Unicode handling (a) better than it was, and (b) work as well as I needed it to. [1] https://github.com/edsu/pymarc/issues/7#issuecomment-501460 On Thu, Mar 8, 2012 at 12:32 PM, Godmar Back <[log in to unmask]> wrote: > On Thu, Mar 8, 2012 at 3:18 PM, Ed Summers <[log in to unmask]> wrote: > > > Hi Terry, > > > > On Thu, Mar 8, 2012 at 2:36 PM, Reese, Terry > > <[log in to unmask]> wrote: > > > This is one of the reasons you really can't trust the information found > > in position 9. This is one of the reasons why when I wrote MarcEdit, I > > utilize a mixed process when working with data and determining > characterset > > -- a process that reads this byte and takes the information under > > advisement, but in the end treats it more as a suggestion and one part > of a > > larger heuristic analysis of the record data to determine whether the > > information is in UTF8 or not. Fortunately, determining if a set of data > > is in UTF8 or something else, is a fairly easy process. Determining the > > something else is much more difficult, but generally not necessary. > > > > Can you describe in a bit more detail how MARCEdit sniffs the record > > to determine the encoding? This has come up enough times w/ pymarc to > > make it worth implementing. > > > > > One side comment here; while smart handling/automatic detection of > encodings would be a nice feature to have, it would help if pymarc could > operate in an 'agnostic', or 'raw' mode where it would simply preserve the > encoding that's there after a record has been read when writing the record. > > [ Right now, pymarc does not have such a mode - if leader[9] == 'a', the > data is unconditionally utf8 encoded on output as per mbklein's patch. ] > > - Godmar >