Dan,
This. Is. Awesome. And your interpretation of the date string "1933, 1937-1938, 1941" is correct - I meant to say it should be 1933/1941. This sort of error is exactly why I wanted to approach this programmatically, and not type the dates by hand. I used student employees to copy the data from the HTML pages into spreadsheets, and to check for spelling errors. However, I didn't want to use students to type the dates. I feel like that would be risking the creation of too much metacrap. I can't even type them correctly myself, so I can't expect students to have 100% accuracy, either.
Also, for anyone else following from home, I have to say why I love this solution compared to all the others.
1) I have over 400 spreadsheets, some with over 1000 lines. While I *could* use OpenRefine or Excel for a certain amount of date cleaning, that assumes I am interested in - and have the time for - opening each file individually and working on the dates one spreadsheet at a time. I can set this script up to run through a bunch of csv files. I don't need to look at them. (And, yes, I know how to set up a task in OpenRefine and save it and use it again later - and I was working on building one of those - but that is more time consuming than I want this task to be.)
2) This doesn't' use Ruby or perl or other tools that I don't know and don't have time to use now. I said I can handle basic Python, and that's what this is.
3) This is written simply and clearly, and doesn't do too much of 'let's prove how awesome I am by using as few lines of code as possible', which is really hard for newbies to interpret and change. (You know what I'm talking about - something that a newbie would write in 200 lines and someone else says, "Yeah, you idiot, I can do that in two lines". Cf. ALL OF STACK OVERFLOW.)
4) Building on point number 3, this is written simply and clearly enough that I can figure out how to modify it further if I come across any other date cases that I haven't discovered so far. I would even feel confident enough to submit a pull request if I do develop solutions for other date formats for this.
5) Further, this is written simply and clearly enough that I can use this as a model for figuring out how to write other Python stuff to handle other similar tasks. This is now my favorite thing in all of GitHub. (I wish GitHub had a special facet for 'newbie friendly' stuff. I know that is somewhat subjective, but I can't tell you how many 'easy' tools that have been recommended to me that would take me roughly a week to figure out how to run once, and possibly another month of trying to troubleshoot error messages to get it to actually work. Cf. http://tpverso.com/an-open-letter-to-open-source-projects-for-lams/)
I again want to thank Dan for this code and I also want to commend it to everyone else's attention as the sort of code that is really friendly to newbies. If you are thinking of writing a tool and you want to be able to share it with institutions of all sizes, with a really low barrier to entry (e.g., the knowledge of how to put a .py file in a directory, change the filename in the .py file, and then run 'python test.py'), then this is a good model of how code should be written. Also, while I am on my soapbox, here's a great model for documentation: https://github.com/CarletonArchives/BagBatch.
Thus Endeth the Lecture.
Dan, thanks again. This just made my semester.
Julie Swierczek
Transformer of Dates
|