Danielle, .DOCX files are just a collection of zipped xml and image files. You can see this by changing the extension (on a copy) on the file and then exploring. It should be possible to parse out the data from the XML file(s) and build a structure from it. Erich On Thursday, May 12, 2022 at 14:39, Danielle Reay eloquently inscribed: > Hello, > > We have a faculty member looking to create a dataset from an annotated > bibliography she compiled. Right now it exists as a word file and as a > pdf. The entries are relatively structured with a citation and an > abstract, but the document is about 150 pages long with multiple entries > per page. Rather than manually copy and paste everything to create the > spreadsheet/csv, I wanted to ask for suggestions or approaches to doing > this by either scraping or extracting structured data from the pdf. > Thanks very much in advance! > > Danielle Reay > > Digital Scholarship Technology Manager > Drew University