For what it's worth, my patch was a stopgap measure, and acknowledged as
such at the time. My proposal for a real, comprehensive solution was
detailed in a comment in a (now-closed) issue Github ticket[1].
If I'd had the time and the knowledge, I would have implemented it that
way. If I'd had the need, I would have made the time and gained the
knowledge. As it was, I submitted a patch to make Unicode handling (a)
better than it was, and (b) work as well as I needed it to.
[1] https://github.com/edsu/pymarc/issues/7#issuecomment-501460
On Thu, Mar 8, 2012 at 12:32 PM, Godmar Back <[log in to unmask]> wrote:
> On Thu, Mar 8, 2012 at 3:18 PM, Ed Summers <[log in to unmask]> wrote:
>
> > Hi Terry,
> >
> > On Thu, Mar 8, 2012 at 2:36 PM, Reese, Terry
> > <[log in to unmask]> wrote:
> > > This is one of the reasons you really can't trust the information found
> > in position 9. This is one of the reasons why when I wrote MarcEdit, I
> > utilize a mixed process when working with data and determining
> characterset
> > -- a process that reads this byte and takes the information under
> > advisement, but in the end treats it more as a suggestion and one part
> of a
> > larger heuristic analysis of the record data to determine whether the
> > information is in UTF8 or not. Fortunately, determining if a set of data
> > is in UTF8 or something else, is a fairly easy process. Determining the
> > something else is much more difficult, but generally not necessary.
> >
> > Can you describe in a bit more detail how MARCEdit sniffs the record
> > to determine the encoding? This has come up enough times w/ pymarc to
> > make it worth implementing.
> >
> >
> One side comment here; while smart handling/automatic detection of
> encodings would be a nice feature to have, it would help if pymarc could
> operate in an 'agnostic', or 'raw' mode where it would simply preserve the
> encoding that's there after a record has been read when writing the record.
>
> [ Right now, pymarc does not have such a mode - if leader[9] == 'a', the
> data is unconditionally utf8 encoded on output as per mbklein's patch. ]
>
> - Godmar
>
|