On Oct 30, 2017, at 12:04 PM, Julie Swierczek <[log in to unmask]> wrote:
> I'm looking for some help with parsing dates. I have inherited over 430 HTML finding aids that I need to migrate to ArchivesSpace. I have an excel template that creates EADs from folder lists, and I use that to get the finding aids into ArchivesSpace. (It is a simplified version of the Spreadsheet from Heaven: http://clir.pacscl.org/2012/03/19/excel-to-xml-the-spreadsheet-from-heaven/.) Mine has columns for box number, series, folder title, normalized single or begin date, normalized end date, free-form date, and notes. The tricky part is filling in the normalized date columns. They need to be in the format YYYY-MM-DD, YYYY-M, or YYYY. So far, I have found dates in the following formats, and I've indicated after the '|' how I'd like them to be normalized:
>
> 1947|1947
> August 1947|1947-08
> August 3, 1947|1947-08-03
> August 3-7, 1947|1947-08-03/1947-08-07
> July 24, 1914 - January 30, 1915|1914-07-24/1915-01-30…
>
> ...So, while I am off trying to use OpenRefine and Excel, with the world's sloppiest regex and GREL to see if I can somehow transform the majority of these dates…
Yes, I suggest OpenRefine and the liberal use of its cool find/replace capabilities. For example, convert a month name to a number:
value.replace( /August/, 08 )
Convert spaces to dashes:
value.replace( / /, ‘-‘ )
Etc.
You may also benefit from OpenRefine’s “clustering” tools.
Yes, doing this by hand may be the brute force method, but if you use something like OpenRefine, then your skill set will increase, and will be able to do so much more in the future. Consider the use of OpenRefine to be an investment.
—
Eric Morgan
University of Notre Dame
|