When I first started working on marc4j, its behavior was to behave as
suggested here, ie. expect the records to be correctly formed in almost
every respect, and to throw an exception when an error was encountered,
it was done in a way that didn't even allow the processing to continue
with the next record, since the state of the Reader when the exception
was detected was inconsistent.
The approach that I took in creating the MarcPermissiveStreamReader was
to move as far as possible towards the other approach being suggested
here. ie, flag the error, fix it as best that it can, and allow the
program to proceed on. To this end, Marc4j has a ErrorHandler class
that tracks all of the errors it encounters as it is processing a
record. The ErrorHandler is used by the MarcPermissiveStreamReader in
general as well as by the Marc8 to UTF-8 translation code to note what
errors were encountered, how severe they are, and a description of the
corrective action that was taken.
In our implementation at UVa these error messages are included in the
records that are built and sent to the solr index, so that they can be
later reviewed and (perhaps) eventually fixed. I think at present our
index of 6.3M records has close to 600K records containing errors of one
sort or another.
On 11/20/2013 10:26 AM, Jonathan Rochkind wrote:
> I am not sure how you ran into this problem on Monday with ruby-marc,
> since ruby-marc doesn't currently handle Marc8 conversion to UTF-8 at
> all -- how could you have run into a problem with Marc8 to UTF8
> conversion? But that is what I am adding.
> But yeah, using a preprocessor is certainly one option, that will not
> be taken away from people. Although hopefully adding Marc8->UTF8
> conversion to ruby-marc might remove the need for a preprocessor in
> many cases.
> So again, we have a bit of a paradox, that I have in my own head too.
> Scot suggests that "In either case, what we DON'T want is to halt the
> processing altogether." And yet, still, that the default behavior
> should be raising an exception -- that, is halting processing
> altogether, right?
> So hardly anyone hardly ever is going to want the default behavior,
> but everyone thinks it should be default anyway, to force people to
> realize what they're doing? I am not entirely objecting to that --
> it's why I brought it up here, but it does seem odd, doesn't it? To
> say something should be default that hardly anyone hardly ever will want?
> On 11/20/13 10:10 AM, Scott Prater wrote:
>> We run into this problem fairly regularly, and in fact, ran into it on
>> Monday with ruby-marc.
>> The way we've traditionally handled it is to put our marc stream through
>> a cleanup preprocessor before passing it off to a marc parser (ruby marc
>> or marc4j).
>> The preprocessor can do one of two things:
>> 1) Skip the bad record in the marc stream and move on; or
>> 2) Substitute the bad characters with some default character, and
>> write it out.
>> In both cases we log the error as a warning, and include a byte offset
>> where the bad character occurs, and the record ID, if possible. This
>> allows us to go back and fix the errors in a stream in a batch;
>> generally, the bad encoding errors fall into four or five common errors
>> (cutting and pasting data from Windows is a typical cause).
>> In either case, what we DON'T want is to halt the processing altogether.
>> Generally, we're dealing with thousands, sometimes millions, of MARC
>> records in a stream; it's very frustrating to get halfway through the
>> stream, then have the parser throw an exception and halt. Halting the
>> processing should be the strategy of last resort, to be called only when
>> the stream has become so corrupted you can't go on to the next record.
>> I'd want the default to be option 1. Let the user determine what
>> changes need to be made to the data; the parser's job is to parse, not
>> infer and create. Overwriting data could also lead to the misperception
>> that everything is okay, when it really isn't.
>> -- Scott
>> On 11/20/2013 08:32 AM, Jon Stroop wrote:
>>> Coming from nowhere on this...is there a place where it would be
>>> convenient to flag which behavior the user (of the library) wants? I
>>> think you're correct that most of the time you'd just want to blow
>>> through it (or replace it), but for the situation where this isn't the
>>> case, I think the Right Thing to do is raise the exception. I don't
>>> think you would want to bury it in some assumption made internal to the
>>> library unless that assumption can be turned off.
>>> On 11/19/2013 07:51 PM, Jonathan Rochkind wrote:
>>>> ruby-marc users, a question.
>>>> I am working on some Marc8 to UTF-8 conversion for ruby-marc.
>>>> Sometimes, what appears to be an illegal byte will appear in the Marc8
>>>> input, and it can not be converted to UTF8.
>>>> The software will support two alternatives when this happens: 1)
>>>> Raising an exception. 2) Replacing the illegal byte with a replacement
>>>> char and/or omitting it.
>>>> I feel like most of the time, users are going to want #2. I know
>>>> that's what I'm going to want nearly all the time.
>>>> Yet, still, I am feeling uncertain whether that should be the default.
>>>> Which should be the default behavior, #1 or #2? If most people most
>>>> of the time are going to want #2 (is this true?), then should that be
>>>> the default behavior? Or should #1 still be the default behavior,
>>>> because by default bad input should raise, not be silently recovered
>>>> from, even though most people most of the time won't want that, heh.