Print

Print


On May 3, 2010, at 2:47 PM, Aaron Rubinstein wrote:

>> 1. MARC the data format -- too rigid, needs to go away
>> 2. MARC21 bib data -- very detailed, well over 1,000 different data
>> elements, some well-coded data (not all); unfortunately trapped in #1
> 
> For the sake of my own understanding, I would love an explanation of the 
> distinction between #1 and #2...


Item #1

The first item (#1) is MARC, the data structure -- a container for holding various types of bibliographic information. From one of my older publications [1]:

  ...the MARC record is a highly structured piece of information.
  It is like a sentence with a subject, predicate, objects,
  separated with commas, semicolons, and one period. In data
  structure language, the MARC record is a hybrid sequential/random
  access record.

  The MARC record is made up of three parts: the leader, the
  directory, the bibliographic data. The leader (or subject in our
  analogy) is always represented by the first 24 characters of each
  record. The numbers and letters within the leader describe the
  record's characteristics. For example, the length of the record
  is in positions 1 to 5. The type of material the record
  represents (authority, bibliographic, holdings, et cetera) is
  signified by the character at position 7. More importantly, the
  characters from positions 13 to 17 represent the base. The base
  is a number pointing to the position in the record where the
  bibliographic information begins.
  
  The directory is the second part of a MARC record. (It is the
  predicate in our analogy.) The directory describes the record's
  bibliographic information with directory entries. Each entry
  lists the types of bibliographic information (items called
  "tags"), how long the bibliographic information is, and where the
  information is stored in relation to the base. The end of the
  directory and all variable length fields are marked with a
  special character, the ASCII character 30.
  
  The last part of a MARC record is the bibliographic information.
  (It is the object in our sentence analogy.) It is simply all the
  information (and more) on a catalog card. Each part of the
  bibliographic information is separated from the rest with the
  ASCII character 30. Within most of the bibliographic fields are
  indicators and subfields describing in more detail the fields
  themselves. The subfields are delimited from the rest of the
  field with the ASCII character 31.
  
  The end of a MARC record is punctuated with an end-of-record
  mark, ASCII character 29. The ASCII characters 31, 30, and 29
  represent our commas, semicolons, and periods, respectively.

At the time, MARC -- the data structure -- was really cool. Consider the environment in 1965. No hard disks. Tape drives instead. Data storage was expensive. The medium had to be read from beginning to end. No (or rarely any) sequential data access. Thus, the record and field lengths were relatively short. (No MARC record can be longer 99,999 characters, and no MARC field can be longer than 999 characters.) Remember too the purpose of MARC -- to transmit the content of catalog cards. Given the leader, the directory, and the bibliographic sections of a MARC record all preceded by pseudo checksums and delimited by non-printable ASCII characters, the MARC record -- the data structure comes with a plethora of check and balances. Very nice.

Fast forward to the present day. Disk space is cheap. Tapes are not the norm. More importantly the wider computing environment uses XML as their data structure of choice. If libraries are about sharing information, then we need to communicate to them in their language. The language of the Net is XML not MARC. Not only is MARC -- the data structure -- stuck on 50 year-old technology, but more importantly it is not the language of the people to whom we want to share.


Item #2

Our bibliographic data (item #2) is the metadata of the Web. While it is important, and it adds a great deal of value, it is not as important as it used to be. It too needs to change. Remember, MARC was originally designed to print catalog cards. Author. Title. Pagination. Series. Notes. Subject headings. Added entries. Looking back, these were relatively simple data elements, but what about system numbers? ISBN numbers? Holdings information? Tables of contents? Abstracts? Ratings? We have stuffed these things into MARC every which way and we call MARC flexible.

More importantly, and as many have said previously, string values in MARC records lead to maintenance nightmares. Instead, like a relational database model, values need to be described using keys -- pointers -- to the canonical values. This makes find/replace operations painless, enables for the use of different languages, as well as numerous other advantages.

ISBD is also a pain. Take the following string:

  Kilgour, Frederick Gridley (1914–2006)

There is way too much punctuation going on here. Yes, as a human you can figure it out, but a computer is stupid and needs to have things made explicit. Do all values in all fields inside parentheses denote dates? No. Are all names presented in last-name, first-name order? No. Something like this would be much better:

  <author>
    <first_name>Frederick</firstname>
    <last_name>Kilgour</last_name>
    <birth_year>1914</birth_year>
    <death_year>2006</death_year>
  </author>

The example above is unambiguous.

MARCXML removes no punctuation. It retains all of the ISBD "encoding" as well as the archaic field codes (110, 245, 650, etc.). MODS goes a step beyond this. It replaces the field codes with human readable labels/words. It also breaks out some of the sub-fields denoted by ISBD into explicitly labeled fields. MARCXML is a step in the right direction. MODS goes even further. Neither really go far enough.

Why isn't this metadata as necessary as it used to be? Because with the advent of full text indexing it is possible to use computers to determine the "aboutness" of a document as well as how a document relates to other documents. Controlled vocabularies are not useless, just less useful and less necessary than previously.

Well, that is enough for now. Dinner time.

[1] http://infomotions.com/musings/marc-reader/

-- 
Eric Lease Morgan