Dear Jason,

I only have experience with creating PDFs using Apache FOP, but you can embed metadata in a PDF file.  One approach is to use Adobe's XMP ( standard, which is also an ISO standard.  

Have you tried adding XMP to your PDFs to see what sort of support is available from Zotero, Mendeley, etc?  I looked at one example PDF from but it doesn't look to include any embedded metadata, so that might be a good place to start.  Also, another great thing about a tool like Apache FOP is that you can utilize it to help ensure that the resulting PDF meets accessibility standards and/or guidelines (, such as PDF/UA.  

In any event, I'd love to hear more about what approach you take once you find out what works best.

All my best,


-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Jason Best
Sent: Tuesday, 26 March, 2019 12:54 PM
To: [log in to unmask]
Subject: [CODE4LIB] Best practices for improving metadata extractability from journal articles?

I’m working with our journal to improve the quality of the metadata that can be extracted from PDFs of individual journal articles by reference management software like Zotero, Mendeley, EndNote, etc. The only description I’ve found of the metadata extraction process is from Zotero ( which "sends the first few pages of a PDF to the web service, which uses a variety of extraction algorithms and known metadata from CrossRef, paired with DOI and ISBN lookups, to build a parent item for the PDF”. What I haven’t found yet is a description of how to format the text of a PDF to ensure that the article metadata can be reliably extracted by reference managers. Most of these journal articles were published before we were issuing DOIs (or even before DOIs existed) so I’ll be adding a cover page to all the PDFs with title, authors, issue, pages, doi (issued retroactively), issn, etc. I’d like to format these pages in a way that ensures optimal extraction of metadata emphasizing of course the DOI and ISSN. In my experience, Mendeley can sometimes extract the article metadata fairly well even without a DOI lookup so I’d to aim for a format that is easily parsable in this way and not 100% relying on a DOI lookup. Does anyone have any experience or suggestions on how to craft such a page to work well across different reference managers?


Jason Best
Director of Biodiversity Informatics
Botanical Research Institute of Texas
1700 University Drive
Fort Worth, Texas 76107

817-332-4441 ext. 230