Some of the problems in your first cut are:
1. Offsets for regex are given in terms of lines. MARC files don't have
newlines in them, unless you're Millennium, in which case they can be
inserted every 200,000 bytes to keep things interesting.
2. Byte matches match byte values, so "20 byte 4" is looking for the
binary value, not the ascii digit.
3. Sometimes you need to prime the buffer before you can do a regexp match.
Is this good enough?
# MARC 21 Magic (First cut)
# indicator count must be "2"
10 string 2
# leader must end in "4500"
>20 string 4500
# leader must start with five digits, a record status, and a record
type
>0 regex ^([0-9]{5})[acdnp][acdefgijkmoprt][abcims] MARC Bibliographic
>0 regex ^([0-9]{5})[acdnp][z] MARC Authority
Simon
On Wed, Mar 23, 2011 at 8:09 PM, William Denton <[log in to unmask]> wrote:
> Has anyone figured out the magic necessary for file to recognize MARC
> files?
>
> If you don't know it, file is a Unix command that tells you what kind of
> file a file is. For example:
>
> $ file 101015_001.mp3
> 101015_001.mp3: Audio file with ID3 version 2.3.0, contains: MPEG ADTS,
> layer III, v1, 192 kbps, 44.1 kHz, Stereo
>
> $ file P1000026.jpg
> P1000026.jpg: JPEG image data, EXIF standard, comment: "AppleMark"
>
> It's a really useful command. I assume it's on OSX, but I don't know. You
> can get it for Windows with Cygwin.
>
> The problem is, file doesn't grok MARC:
>
> $ file catalog.01.mrc
> catalog.01.mrc: data
>
> I took a stab at getting the magic defined, but it didn't work. I'll
> include what I used below. You can put it into a magic.txt file, and then
> use
>
> file -m magic.txt some_file.mrc
>
> to test it. It'll tell you the file is MARC Bibliographic ... but it also
> thinks that PDFs, JPEGs, and text files are MARC. That's no good.
>
> It'd be great if the MARC magic got into the central magic database so
> everyone would be able to recognize various MARC file types.
>
> Bill
>
>
> # --- clip'n'test
> # MARC 21 for Bibliographic Data
> # http://www.loc.gov/marc/bibliographic/bdleader.html
> #
> # This doesn't work properly
>
> 0 string x
>
>> 5 regex [acdnp]
>> 6 regex [acdefgijkmoprt]
>> 7 regex [abcims]
>> 8 regex [\ a]
>> 9 regex [\ a]
>> 10 byte x
>> 11 byte x
>> 12 string x
>> 17 regex [\ 12345678uz]
>> 18 regex [\ aciu]
>> 19 regex [\ abc] MARC Bibliographic
>>
> #>20 byte 4
> #>21 byte 5
> #>22 byte 0
> #>23 byte 0 MARC Bibliographic
>
> # --- end clip'n'test
>
> --
> William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org
>
|