LISTSERV 16.5 - CODE4LIB Archives

Eric Morgan wrote:

>> [I also put this on AUTOCAT. Apologies if you also follow that. This
>> falls at the intersection of hand-cataloging, data processing and
>> simple AI.]...
> 
> Tim, your's is a perfect example of a supervised machine learning classification process. The process works very much like your computer's spam filter. Here's how:
> 
>   1. collect a set of data that you know is
>      library-written
> 
>   2. collect a set of data that you know is
>      publisher-sourced
> 
>   3. count, tabulate, and vectorize the
>      features of your data -- measure the data's
>      characteristics and associate them with
>      a collection
> 
>   4. model the data -- use any one of a number
>      of clustering algorithms to associate
>      the data with one collection or another,
>      such as Naive Bayes
> 
>   5. optionally, test the accuracy of the model
> 
>   6. save the model


The crucial part of a supervised machine learning process is the training step, and each sub-step can (and probably should) be tweaked given one's particular situation. There are a number of things to consider:

  * Identifying correct & accurate sets of training data is difficult. First, many times data does not fall neatly into one or more distinct categories. While a book may be written by a single individual, the book may fall into a number of different subjects or genres. Second, the distinction between one category and another may be so subtle, that even a computer, given a very large set of sample data, may not be able to consistently choose between one category and another. Third, binary classification is easy (spam versus ham). Classification into a flat list of categories is not too difficult. But hierarchal classification is very difficult.

 * Measuring the data -- counting, tabulating, and vectorizing -- is fraught with nuance. For example, what are you going to count? Individual words? Phrases? Numbers? Will you exclude stop words? Are you going to stemmatize the features? Maybe you will lemmatize the words? Maybe you will do neither. Will you merely count and tabulate the words, or maybe you will use something like an algorithm called TFIDF to create a more "relevant" list of words and scores? To what degree will you test the accuracy of the data, and if to a high degree, then what technique will you use?

  * Modeling the data - This is the "magic happen here" step. What algorithm are you going to use, and how are you going to parameterize it? Your choices will depend on many things, such as: the size & scope of the data, whether the data is numeric or not, the desire for a true/false classification or a degree of certainty, the size & scope of your computer(s), the degree of real distinctiveness of the different data sets, etc. Entire dissertations are written on this topic.

Not ironically, there are computer processes that help with the writing of these sorts of computer programs; there are techniques used to determine which of the various combinations -- "turning the knobs" -- are the most efficient. Computer programs used to create... machine learning programs. Yikes!!

When it comes to the use case alluded to in the original posting, this is what I would do:

  1) Identify a "large" set of library-written MARC
     records, at least 50.

  2) Identify a similarly large set of publisher
     -sourced MARC records.

  3) Loop through each MARC record, read the 520
     field, and save the result as a file in a
     directory named "library" or a directory named
     "publisher", accordingly.

  4) Run train.py against the directories.

  5) Identify a set of MARC records which contain
     values in the 520 field.

  6) Loop through each of these additional records,
     read the 520 field, and save the result as a
     file in a directory, called, say "unclassified"

  7) Run classify.py against the unclassified
     directory.

  8) The result will be a list of labels/filenames
     -- classifications.

You will then want to repeat the whole process for the purposes of "turning the knobs". For example:

  * increase the size of your datasets but keep
    them similarly sized; not as easy as you might
    think

  * use different techniques to measure your data

  * use different modeling algorithms

What is really cool about this whole process is that it is immensely scalable. For example, one could classify a whole set of documents, and one could feel okay about the result. Then, a year later, given more expertise and additional sets of data, the process could be tweaked, and the whole lot could be re-classified. The computer doesn't care about touching each item more than once. It will touch it as many times as you tell it. Yes, there is a lot of work up front, the work requires additional skills, but the result can definitely supplement & enhance the work that is already being done. 

We, as a profession, need to go beyond the use of computers to merely automate things. We need -- ought -- to learn how to exploit computers to really & truly take advantage of their ability to store vast amounts of data, organize it into information, widely share the information, consume ("read") the information, analyze the information, and output knowledge which is then verified by a person as true, useful, relevant, understandable, etc.

(Again, the whole lot of this posting has been saved in a tarball temporarily accessible at http://dh.crc.nd.edu/tmp/classification.zip)

--
Eric Lease Morgan, Librarian