LISTSERV 16.5 - CODE4LIB Archives

Let’s try this again without my hitting ‘send’ when I want to send it to drafts.  (Yay, mystery meat navigation in cell phone interfaces)

>> On May 12, 2022, at 2:40 PM, Danielle Reay <[log in to unmask]> wrote:
>> 
>> Hello,
>> 
>> We have a faculty member looking to create a dataset from an annotated
>> bibliography she compiled. Right now it exists as a word file and as a pdf.
>> The entries are relatively structured with a citation and an abstract, but
>> the document is about 150 pages long with multiple entries per page. Rather
>> than manually copy and paste everything to create the spreadsheet/csv, I
>> wanted to ask for suggestions or approaches to doing this by either
>> scraping or extracting structured data from the pdf. Thanks very much in
>> advance!

I haven’t had to do this in years, but I used to do it quite a bit.  (Including trying to extract information from our school’s course catalog and build cross-linked websites from it)

First, try to start with whatever you have that’s the lowest level… in this case, the Word document.  From that, try to see if there’s any semantic content (did they use character or paragraph styles for formatting, or is it just ‘bold’ and ‘italic’?

If there’s semantics, then you might want to use something to extract data from those files…. But that’s incredibly rare.

So instead, export to something that has just enough formatting to not lose information, but that there are lots of parsers for.  I tend to like HTML or RTF (rich text format)

Depending on exactly what’s in the file, you might even be able to export to just plain text and not lose too much.

From there, I tend to go through cycles of parsing and cleanup.  There are lots of parsers for bibliography data out there these days, but it’s amazing at what sort of errors you end up with in manually maintained files.  (I’ve even given a talk or two about it).

Basically, run the parser, and then look to figure out what it missed.  I tend to write stuff that either does in-line replacement (regex type stuff) or removes items as it finds them and moves them to a new file.

Both have issues, as sometimes things get parsed wrong and it’s easier to restore the original file and clean it up there than in the new format, especially if you can find the problematic patterns that your parser is having trouble with.

Again, I haven’t done this for years, so I don’t know what new tools are out there, but I used to do much of my work in Perl, as it has really good regular expression / string manipulation support.  I know there are some PDF libraries for Perl, but I’ve luckily been able to get original source content and never needed to parse them directly.

Once you extract the data, I would try to set your faculty member up with a bibliography management tool so you can hopefully avoid having to do this again.

-Joe



Sent from a mobile device with a crappy on screen keyboard and obnoxious "autocorrect"