Print

Print


Hello:

I'm wondering if anyone else has run into this situation with OCLC WorldCat
Z39.50 servers. When we search Voilà (Canada's national union catalogue,
built on WorldCat)
<https://library-archives.canada.ca/eng/collection/basics/pages/voila.aspx>
via Z39.50 for a record that is catalogued in French for ILL purposes, the
resulting records contain corrupted characters where accented characters
are expected. We're seeing this in Ex Libris Alma, where Voilà is an
external search source maintained by Alma (we can't see the parameters Alma
is using to connect).

For example, searching Voilà for ISBN 9782920325241 returns a list of
records for which the summaries display without accents. However, checking
the MARC record itself shows that where there should be accents, there are
instead literals in place of Unicode codepoints.

The 245 for the first record returned for ISBN 9782920325241 appears as
follows in Alma:
245 1 0 $$a R<U+00fd>ever le nouveau monde : $$b une <U+00fd>uvre d'art
public de Michel Goulet offerte par la ville de Montr<U+00fd>eal <U+00fd>a
la ville de Qu<U+00fd>ebec dans le cadre des c<U+00fd>el<U+00fd>ebrations
du 400e anniversaire de sa fondation / $$c textes, Louise D<U+00fd>ery et
Michel Goulet.

   - Where we see R<U+00fd>ever, we should see Rèver.
   - Where we see <U+00fd>uvre, we should see œuvre.
   - Where we see Montr<U+00fd>eal, we should see Montréal.
   - Where we see <U+00fd>a, we should see à.

I don't *think *this is an example of a Unicode canonicalization into a
normalized form using combining characters, as <U+00fd> is not a combining
character. If that was the case, we would see <U+0301>e for é, <U+0300>a
for à, etc.

Instead, it looks like a case of corruption introduced by incorrect
decoding. This can occur when the Z39.50 source uses UTF8 and the client
interprets it as MARC8, or vice versa.

That said, I have tested a few different approaches to querying Voilà using
the yaz-client command line Z39.50 client (version 5.28.0 on Ubuntu), and
it seems that Voilà itself might be returning the corrupted characters.

For example, when I set the negotiation character set to MARC-8 (matching
the LDR[09] of the returned record) and the display character set to UTF-8,
the results on the screen and in the file show corruption:

$ yaz-client -m testvoila.mrc
Z> negcharset MARC-8
Character set negotiation : MARC-8
Z> charset
Negotiation character set `MARC-8'
Records in charset yes
Charneg version 3
Display character set is `UTF-8'
MARC character set is `none'
Query character set is `none'
Z> authentication USERNAME/PASSWORD
Authentication set to Open (USERNAME/PASSWORD)
Z> open fsz3950.oclc.org:210/gclacConnecting...OK.
Sent initrequest.
Connection accepted by v3 target.
UserInformationfield:
{
  OID: 1 2 840 10003 10 1000 17 1
  {
    ANY (len=66)
  }
}
OCLC UserInformation:
{
  motd 'You are searching the FirstSearch 5.0 Z39.50 Server!'
  dblist {
    'laccat'
  }
}
Options: search present delSet triggerResourceCtrl scan sort
extendedServices namedResultSets
Elapsed: 0.167063
Z> find @attr 1=7 9782920325241
Sent searchRequest.
Received SearchResponse.
Search was a success.
Number of hits: 6, setno 1
records returned: 0
Elapsed: 0.190875
Z> show
Sent presentRequest (1+1). Records: 1 [laccat]
Record type: USmarc
02802cam 2200685 a 4500 008 20081121s2008 quca b 000 p eng
015 $a 20089424808E $2 can
015 $a 20089424808 $2 can
016 $a (AMICUS)000034543125
016 7 $a 015299625 $2 Uk
020 $a 9782920325241 $q (Galerie de l'UQAM)
020 $a 2920325248
020 $a 9782980985249 $q (Galerie Simon Blais)
020 $a 2980985244
040 $a NLC $b eng $c NLC $d CDX $d BWX $d SUC $d UKMGB $d OCLCF $d NLC $d
OCLCA $d REB $d OCLCA $d OCLCQ $d OCLCO
041 0 $a eng $a fre
043 $a n-cn-qu 050 4 $a N6549.G686 $b A4 2008
055 02 $a NB249*
055 3 $a N6549 G686 $b A4 2009
082 04 $a 730.92 $2 22
084 $a cci1icc $2 lacc
100 1 $a D�ery, Louise, $d 1955- 245 10 $a R�ever le nouveau monde : $b une
�uvre d'art public de Michel Goulet offerte par la ville de Montr�eal �a la
ville de Qu�ebec dans le cadre des c�el�ebrations du 400e anniversaire de
sa fondation / $c textes, Louise D�ery et Michel Goulet.

In the file with the downloaded record (looking at it with Vim in binary
mode) you can see the same <fd> characters as we see in Alma for the same
record:

02802cam 2200685 a
4500008004300000015002200043015002100065016002500086016001800111020003900129020001500168020004100183020001500224040008500239041001300324043001200337050002400349055001100373055002400384082001500408084001800423100002600441245024000467246005500707260004700762300005200809336002600861337002800887338002700915500002000942500003500962504005500997546003201052550006201084600002701146600003301173648002001206650004501226650006601271650006601337650004101403650002201444650003801466651001301504651002201517700002701539710002301566710002501589710002401614856009401638938004701732945000801779947003001787948014101817948003801958948004701996948002902043948003202072949001202104^^20081121s2008
quca b 000 p eng ^^ ^_a20089424808E^_2can^^ ^_a20089424808^_2can^^
^_a(AMICUS)000034543125^^7 ^_a015299625^_2Uk^^ ^_a9782920325241^_q(Galerie
de l'UQAM)^^ ^_a2920325248^^ ^_a9782980985249^_q(Galerie Simon Blais)^^
^_a2980985244^^
^_aNLC^_beng^_cNLC^_dCDX^_dBWX^_dSUC^_dUKMGB^_dOCLCF^_dNLC^_dOCLCA^_dREB^_dOCLCA^_dOCLCQ^_dOCLCO^^0
^_aeng^_afre^^ ^_an-cn-qu^^ 4^_aN6549.G686^_bA4 2008^^02^_aNB249*^^
3^_aN6549 G686^_bA4 2009^^04^_a730.92^_222^^ ^_acci1icc^_2lacc^^1
^_aD<fd>ery, Louise,^_d1955-^^10^_aR<fd>ever le nouveau monde :^_bune
<fd>uvre d'art public de Michel Goulet offerte par la ville de Montr<fd>eal
<fd>a la ville de Qu<fd>ebec dans le cadre des c<fd>el<fd>ebrations du 400e
anniversaire de sa fondation /^_ctextes, Louise D<fd>ery et Michel Goulet.

If anyone has run into this with OCLC WorldCat Z39.50 and figured out a
solution, I would love to hear it. It would even be helpful to know if you
have a connection to an OCLC WorldCat Z39.50 server and aren't seeing these
kinds of problems. I know most of you probably won't have authentication
credentials for Voilà, but the setup has to be the same for other OCLC
WorldCat Z39.50 servers... right?

In Z39.50 and character encoding madness, I remain,
Dan