Hmm, you could theoretically assign chars in the private unicode area to
the chars you need -- but then have your application replace those chars
by small images on rendering/display.
This seems as clean a solution as you are likely to find. Your TEI
solution still requires chars-as-images for these unusual chars, right?
So this is no better with regard to copying-and-pasting, browser
display, and general interoperability than your TEI solution, but no
worse either -- it's pretty much the same thing. But it may be better in
terms of those considerations for chars that actually ARE currently
unicode codepoints.
If any of your "private" chars later become non-private unicode
codepoints, you could always globally replace your private codepoints
with the new standard ones.
With 137K "private codepoints" available, you _probably_ wouldn't run
out. I think. You could try standardizing these "private" codepoints
among people in similar contexts/communities to you and your needs -- it
looks like there are several existing efforts to document shared uses of
"private codepoints" for chars that do not have official unicode
codepoints. They are mentioned in the wikipedia article.
[Reading that wikipedia article taught me something new I didn't know
about Marc21 and unicode too -- a topic generally on top of my pile
these days -- "The MARC 21 standard uses the [Private Use Area] to
encode East Asian characters present in MARC-8 that have no Unicode
encoding." Who knew? ]
Jonathan
Jakob Voss wrote:
> Hi Stuart,
>
>
>> These have been included because they are in widespread use in a current
>> written culture. The problems I personally have are down to characters
>> used by a single publisher in a handful of books more than a hundred
>> years ago. Such characters are explicitly excluded from Unicode.
>>
>> In the early period of the standardisation of the Māori language there
>> were several competing ideas of what to use as a character set. One of
>> those included a 'wh' ligature as a character. Several works were
>> printed using this ligature. This ligature does not qualify for
>> inclusion in Unicode.
>>
>
> That is a matter of discussion. If you do not call it 'ligature' chances
> are higher to get it included.
>
>
>> To see how we handle the text, see:
>>
>> http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html
>>
>> The underlying representation is TEI/XML, which has a mechanism to
>> handle such glyphs. The things I'm still unhappy with are:
>>
>> * getting reasonable results when users cut-n-paste the text/image HTML
>> combination to some other application
>> * some browsers still like line-breaking on images in the middle of words
>>
>
> That's interesting and reminds me on the treatment of mathematical
> formula in journal titels which mostly end up as ugly images.
>
> In Unicode you are allowed to assign private characters
>
> http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Private_use_characters
>
> The U+200D ZERO WIDTH JOINER could also be used but most browsers will
> not support it - you need a font that supports your character anyway.
>
> http://blogs.msdn.com/michkap/archive/2006/02/15/532394.aspx
>
> In summary: Unicode is just a subset of all characters which have been
> used for written communication and whether a character gets included
> depends not only on objective properties but on lobbying and other
> circumstances. The deeper you dig the more nasty Unicode gets - as all
> complex formats and standards.
>
> Cheers
> Jakob
>
> P.S: Michael Kaplan's blog also contains a funny article about emoji:
> http://blogs.msdn.com/michkap/archive/2010/04/27/10002948.aspx
>
>
|