We run into this problem fairly regularly, and in fact, ran into it on
Monday with ruby-marc.
The way we've traditionally handled it is to put our marc stream through
a cleanup preprocessor before passing it off to a marc parser (ruby marc
or marc4j).
The preprocessor can do one of two things:
1) Skip the bad record in the marc stream and move on; or
2) Substitute the bad characters with some default character, and
write it out.
In both cases we log the error as a warning, and include a byte offset
where the bad character occurs, and the record ID, if possible. This
allows us to go back and fix the errors in a stream in a batch;
generally, the bad encoding errors fall into four or five common errors
(cutting and pasting data from Windows is a typical cause).
In either case, what we DON'T want is to halt the processing altogether.
Generally, we're dealing with thousands, sometimes millions, of MARC
records in a stream; it's very frustrating to get halfway through the
stream, then have the parser throw an exception and halt. Halting the
processing should be the strategy of last resort, to be called only when
the stream has become so corrupted you can't go on to the next record.
I'd want the default to be option 1. Let the user determine what
changes need to be made to the data; the parser's job is to parse, not
infer and create. Overwriting data could also lead to the misperception
that everything is okay, when it really isn't.
-- Scott
On 11/20/2013 08:32 AM, Jon Stroop wrote:
> Coming from nowhere on this...is there a place where it would be
> convenient to flag which behavior the user (of the library) wants? I
> think you're correct that most of the time you'd just want to blow
> through it (or replace it), but for the situation where this isn't the
> case, I think the Right Thing to do is raise the exception. I don't
> think you would want to bury it in some assumption made internal to the
> library unless that assumption can be turned off.
>
> -Jon
>
>
> On 11/19/2013 07:51 PM, Jonathan Rochkind wrote:
>> ruby-marc users, a question.
>>
>> I am working on some Marc8 to UTF-8 conversion for ruby-marc.
>>
>> Sometimes, what appears to be an illegal byte will appear in the Marc8
>> input, and it can not be converted to UTF8.
>>
>> The software will support two alternatives when this happens: 1)
>> Raising an exception. 2) Replacing the illegal byte with a replacement
>> char and/or omitting it.
>>
>> I feel like most of the time, users are going to want #2. I know
>> that's what I'm going to want nearly all the time.
>>
>> Yet, still, I am feeling uncertain whether that should be the default.
>> Which should be the default behavior, #1 or #2? If most people most
>> of the time are going to want #2 (is this true?), then should that be
>> the default behavior? Or should #1 still be the default behavior,
>> because by default bad input should raise, not be silently recovered
>> from, even though most people most of the time won't want that, heh.
>>
>> Jonathan
--
Scott Prater
Shared Development Group
General Library System
University of Wisconsin - Madison
[log in to unmask]
5-5415
|