It seems that different people are seeing different things in their
respective viewers (i.e some are OK and others are like what I am
seeing).
When I use wget and view the local file in Firefox (3.0.4, Linux Suse
11.0) I see:
http://cuvier.cisti.nrc.ca/~gnewton/tictoc1.gif
[gif used as it is not lossy]
The text is clearly not correct.
The file I got with wget is:
http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt
Is this just a question of different client software (and/or OSes)
viewing or mangling the content?
-glen
-------------------------------------------------------
Thanks for tracking this down Godmar.
I've emailed tictocs and we'll see what they say.
-Glen :-)
------------------------------------------------------------------
From: Godmar Back <[log in to unmask]>
Sender: Code for Libraries <[log in to unmask]>
To: [log in to unmask]
Subject: Re: [CODE4LIB] Character problems with tictoc
Date: Mon, 21 Dec 2009 13:20:08 -0500
Message-ID: <[log in to unmask]>
The string in question is double-encoded, that is, a string that's in
UTF-8 already was run through a UTF-8 encoder.
The string is "Acta Ortopedica" where the 'e' is really '\u00e9' aka
'Latin Small Letter E with Acute'. [1]
In UTF-8, the e-acute is two-byte encoded as C3 A9. If you run the
bytes C3 A9 through a UTF-8 encoder, C3 ('\u00c3' - Capital A with
tilde) becomes C3 83 and A9 (copyright sign, '\u00a9' becomes C2 A9).
C3 83 C2 A9 is exactly what JISC is serving, what it should be serving
is C3 A9.
Send email to them.
- Godmar
[1] http://www.utf8-chartable.de/
2009/12/21 Glen Newton <[log in to unmask]>
>
> [I realise there was a recent related 'Character-sets for dummies'[1]
> discussion recently]
>
> I am using tictocs[2] list of journal RSS feeds, and I am getting
> gibberish in places for diacritics. Below is an example:
>
> in emacs:
> 221 Acta Ortop dica Brasileira http://www.scielo.br/rss.php?pid=1413-7852&lang=en 1413-7852
> in Firefox:
> 221 Acta Ortop dica Brasileira http://www.scielo.br/rss.php?pid=1413-7852&lang=en 1413-7852
>
> Note that the emacs view is both of a save of the Firefox, and from a
> direct download using 'wget'.
>
> Is this something on my end, or are the tictocs people not serving
> proper UTF-8?
>
> The HTTP header from wget claims UTF-8:
> > wget -S http://www.tictocs.ac.uk/text.php
> > --2009-12-21 12:47:59-- http://www.tictocs.ac.uk/text.php
> > Resolving www.tictocs.ac.uk... 130.88.101.131
> > Connecting to www.tictocs.ac.uk|130.88.101.131|:80... connected.
> > HTTP request sent, awaiting response...
> > HTTP/1.1 200 OK
> > Date: Mon, 21 Dec 2009 17:42:05 GMT
> > Server: Apache/2.2.13 (Unix) mod_ssl/2.2.13 OpenSSL/0.9.8k PHP/5.3.0 DAV/2
> > X-Powered-By: PHP/5.3.0
> > Content-Type: text/plain; charset=utf-8
> > Connection: close
> > Length: unspecified [text/plain]
> ><....stuff removed>
>
> Can someone validate if they are also experiencing this issue?
>
> Thanks,
> Glen
>
> [1]https://listserv.nd.edu/cgi-bin/wa?S2=CODE4LIB&q=&s=character-sets+for+dummies&f=&a=&b=
> [2]http://www.tictocs.ac.uk/text.php
>
> --
> Glen Newton | [log in to unmask]
> Researcher, Information Science, CISTI Research
> & NRC W3C Advisory Committee Representative
> http://tinyurl.com/yvchmu
> tel/t l: 613-990-9163 | facsimile/t l copieur 613-952-8246
> Canada Institute for Scientific and Technical Information (CISTI)
> National Research Council Canada (NRC)| M-55, 1200 Montreal Road
> http://www.nrc-cnrc.gc.ca/
> Institut canadien de l'information scientifique et technique (ICIST)
> Conseil national de recherches Canada | M-55, 1200 chemin Montr al
> Ottawa, Ontario K1A 0R6
> Government of Canada | Gouvernement du Canada
> --
|