LISTSERV 16.5 - CODE4LIB Archives

On Fri, 19 Jan 2007, Erik Hatcher wrote:

> Tod,
>
> Great information.  I apologize for being a late comer to the game
> and bringing up FAQs.
>
> What about date normalization?
>
> One thing that must be considered when doing faceted browsing is that
> it works best with some pre-processed data, such as years rather than
> full dates.  The question becomes where does the logic for stripping
> out the years belong?  Solr could do it if configured with a custom
> analyzer for certain fields, or the client could do it.  Is there
> XSLT to do this sort of thing with dates available?

I know XSLT 2.0 can handle them far better due to the support for types.
However, MARC still has oddities which would probably need to be address
directly.  If doing it entirely in XSLT I'd probably actually pipeline it
and do several transformations in a row.

There's also been work done to provide libraries and the like in XSLT.
EXSLT comes to mind right away.

One example of an MARC oddity I had recently is that a report required the
260 |c field.  I got complaints that the dates were malformed.  Why?  They
appeared like 1922].  Those with some catalog experience can guess the
problem.  The whole 260 field looks like this $a [Chicago: $b some
publisher $c 1922].

I'm not entirely sure how that would get parsed into MARCXML in the first
place.

There's techniques to deal with this in xslt, but the string manipulations
are generally more cumbersome in that language than in a scripting
language as you mention.

In XSLT 2.0 I'd probably have a template/function to parse out
punctuation, then something to possibly normalize dates.

Which reminds me, I need to start reviewing some XSLT/Cocoon for the
pre-conference ;).


Jonathan T. Gorman
Research Information Specialist
University of Illinois at Champaign-Urbana
216 Main Library - MC522
1408 West Gregory Drive
Urbana, IL 61801
Phone: (217) 244-4688


>
>       Erik
>
>
> On Jan 19, 2007, at 5:58 AM, Tod Olson wrote:
>
>> On Jan 19, 2007, at 4:07 AM, Erik Hatcher wrote:
>>
>>> On Jan 17, 2007, at 3:26 PM, Andrew Nagy wrote:
>>>> One thing I am hoping that can come out of the preconference is a
>>>> standard XSLT doc.  I sat down with my metadata librarian to
>>>> develop our
>>>> XSLT doc -- determining what fields are to be searchable what fields
>>>> should be left out to help speed up results, etc.
>>>>
>>>> It's pretty easy, I think you will be amazed how fast you can have a
>>>> functioning system with very little effort.
>>>
>>> You're quite right with that last statement.
>>>
>>> I am, however, skeptical of a purely MARC -> XSLT -> Solr solution.
>>> The MARC data I've seen requires some basic cleanup (removing dots at
>>> the end of subjects, normalizing dates, etc) in order to be useful as
>>> facets.  While XSLT is powerful, this type of data manipulation is
>>> better (IMO) done with scripting languages that allow for easy
>>> tweaking in a succinct way.  I'm sure XSLT could do everything that
>>> you'd want done; you can also drive screws in with a hammer :)
>>
>> So the punctuation stripping has already been done in XSLT.
>>
>> LoC has a MARCXML -> MODS XSLT stylesheet [1] which strips out the
>> evil
>> ISBD punctuation. I've generally found mapping from MODS to be more
>> convenient than mapping from MARC, so while it's an extra step, it
>> does
>> save a little programmer time since some of the hidden hierarchy in
>> the
>> MARC data is made explicit in the MODS structure.
>>
>> If hopping through MODS is unacceptable, the LoC has the punctuation-
>> stripping nicely tucked away into a MARC Conversion Utility Stylesheet
>> that you could use directly in a MARC XML -> Solr transformation. [2]
>>
>> [1] http://www.loc.gov/standards/mods/v3/MARC21slim2MODS.xsl
>> [2] http://www.loc.gov/marcxml/xslt/MARC21slimUtils.xsl
>>
>>
>> Tod Olson <[log in to unmask]>
>> Programmer/Analyst
>> University of Chicago Library
>