Hi Code4Lib folks, I was wondering if anyone had some experience cleaning up OCR text. Particularly I am trying to figure out how I can deal with the random line breaks that come from OCR. I am trying to parse out a bibliography with regex. I think I've figured out which queries I need to run to break it up so I can make it into a tab delimited text file but I noticed that the text does the classic thing of OCR inserting line breaks where they physically are on the page. This will obviously be a bit of an issue since it would break the annotation into a bunch of lines rather than leaving it one block so I can manipulate it into a database. So I am wondering if anyone who has worked with OCR text before has a suggested way to clean up those line breaks without doing 300 + pages by hand? Any thoughts would be welcome. Matt Sherman