On the matter of input of characters, the following email from the Unicode list may be of interest to those working through or developing a web UI. Note that Key Curry is a work-in-progress, and has received a fair bit of "it doesn't have" comment on the Unicode list. But it is a good basis.
Peter
------------------------------
From: [log in to unmask] on behalf of Ed Trager [[log in to unmask]]
Sent: Tuesday, April 17, 2012 2:41 PM
To: Unicode Mailing List
Subject: Key Curry : Attempting to make it easy to type world languages and orthographies on the web
A long time in the making, I am finally making "Key Curry" public!
"Key Curry" is a web application and set of web components that allows
one to easily type many world languages and specialized orthographies
on the web. Please check it out and provide me feedback:
http://unifont.org/keycurry/
In addition to supporting major world languages and orthographies, I
hope that "Key Curry" makes it easy for language advocates and web
developers to provide support for the orthographies of minority
languages -- many of which are not currently supported (or are only
poorly supported) by the major operating system vendors.
Under the hood, the software uses a javascript user interface
framework that I wrote called "Gladiator Components" along with the
popular "jQuery" javascript library as a foundation. I have used HTML
5 technologies such as localStorage to implement certain features.
Currently, Key Curry appears to work well in the latest versions of
Google Chrome, Firefox, and Safari on devices with standard QWERTY
keyboards (e.g. laptops, desktop computers, netbooks, etc.). Recent
versions of Opera and Internet Explorer version 9 appear to have bugs
which limit the ability of Key Curry to operate as designed. The app
is not likely to work well on older versions of any browser. I have
not yet tested IE 10 on Windows 8.
Although Key Curry appears to load flawlessly on the very few Android
and Apple iOS tablet and/or mobile devices that I have "dabbled" with,
the virtual keyboards on those devices are very different from
physical keyboards and I have not yet investigated that problem area
at all - so don't expect it to work on your iPad or other mobile
device.
Constructive criticism and feedback is most welcome. I have many
additional plans for Key Curry "in the works" - but I'll leave further
commentary to another day!
- Ed
-----------------------------------------
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Robert Haschart
> Sent: Thursday, April 19, 2012 2:23 PM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
>
> On 4/18/2012 12:08 PM, Jonathan Rochkind wrote:
> > On 4/18/2012 11:09 AM, Doran, Michael D wrote:
> >> I don't believe that is the case. Take UTF-8 out of the picture, and
> >> consider the MARC-8 character set with its escape sequences and
> >> combining characters. A character such as an "n" with a tilde would
> >> consist of two bytes. The Greek small letter alpha, if invoked in
> >> accordance with ANSI X3.41, would consist of five bytes (two bytes
> >> for the initial escape sequence, a byte for the character, and then
> >> two bytes for the escape sequence returning to the default character
> >> set).
> >
> > ISO 2709 doesn't care how many bytes your characters are. The
> > directory and offsets and other things count bytes, not characters.
> > (which was, in my opinion, the _right_ decision, for once with marc!)
> >
> > How bytes translate into characters is not a concern of ISO 2709.
> >
> > The majority of non-7-bit-ASCII encodings will have chars that are
> > more than one byte, either sometimes or always. This is true of MARC8
> > (some chars), UTF8 (some chars), and UTF16 (all chars), all of them.
> > (It is not true of Latin-1 though, for instance, I don't think).
> >
> > ISO 2709 doesn't care what char encodings you use, and there's no
> > standard ISO 2709 way to determine what char encodings are used for
> > _data_ in the MARC record. ISO 2709 does say that _structural_
> > elements like field names, subfield names, the directory itself,
> > seperator chars, etc, all need to be (essentially, over-simplifying)
> > 7-bit-ASCII. The actual data itself is application dependent, 2709
> > doesn't care, and 2709 doesn't give any standard cross-2709 way to
> > determine it.
> >
> > That is my conclusion at the moment, helped by all of you all in this
> > thread, thanks!
>
> The conclusion that I came to in the work I have done on marc4j (which is used heavily by SolrMarc)
> is that for any significant processing of Marc records the only solution that makes sense is to
> translate the record data into Unicode characters as it is being read in. Of course as you and others
> have stated, determining what the data actually is, in order to correctly translate it to Unicode, is
> no easy task. The leader byte that merely indicates "is UTF8" or "is not UTF8" is wrong often enough
> in the real world that it is of little value when it indicates "is UTF-8"and is even less value when
> it indicates "is not UTF-8"
>
> Significant portions of the code I've added to marc4j deal with trying to determine what the encoding
> of that data actually is and trying to translate the data correctly into Unicode even when the data is
> incorrect.
>
> You also argued in another message that cataloger entry tools should
> give feedback to help the cataloger not create errors. I agree. I
> think one possible step towards this would be that the editor must work in Unicode, irrespective of
> the data format that the underlying system expects the data to be. If the underlying system expects
> MARC8 then the "save as" process should be able to translate the data into MARC8 on output.
>
> -Robert Haschart
|