On behalf of Charles Riley:
---------- Forwarded message ----------
From: Riley, Charles <[log in to unmask]>
Date: 23 February 2016 at 05:37
Subject: [camms-ccaam] Common encoding errors
To: "[log in to unmask]" <[log in to unmask]>, "
[log in to unmask]" <[log in to unmask]>, "[log in to unmask]" <
[log in to unmask]>, "[log in to unmask]" <
[log in to unmask]>, "[log in to unmask]" <
[log in to unmask]>, "[log in to unmask]" <
[log in to unmask]>
Hi all,
This is something I’ve noticed happening with somewhat regular, and
probably increasing occurrence lately: a class of problems with records
containing either escaped entity references from HTML or XML (like
‘ ’), or accented characters that have become corrupted in a data
migration (like ‘français
<https://openlibrary.org/works/OL10004281W/Les_archets_français>‘). I was
asked by another librarian if I could point them to any resources that deal
with this class of issues, and rounded up a few that I thought would be
good to share. Here’s what I came across, in terms of examples and
explanations for some of the more common cases:
http://markmcb.com/2011/11/07/replacing-ae%E2%80%9C-ae%E2%84%A2-aeoe-etc-with-utf-8-characters-in-ruby-on-rails/
https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
(But treat this list with caution in using it to search; there will be
false positives for a search for ‘amp;’, for example.)
http://www.i18nqa.com/debug/utf8-debug.html (See also associated links on
this page.)
Hope this helps!
Charles Riley
*Charles Riley*
*Interim Librarian for African Studies and Catalog Librarian*
*Sterling Memorial Library*
*Yale University*
*[log in to unmask] <[log in to unmask]>*
*(203)432-7566 <%28203%29432-7566> or (203)432-9301 <%28203%29432-9301>*
--
Andrew Cunningham
[log in to unmask]
|