> Does anyone have experience with web scraping publications to train LLM? One of our researchers is looking for a good source on condensed matter and materials science. They've tried arXiv but couldn't find enough publications specifically on materials science as a subcategory. They were hoping for about 400,000 publications.
>
> --
> Janine Pino (she/her)
> Data Librarian
> Research Library & Information Services
> Office of Institutional Planning
> Oak Ridge National Laboratory
> Email: [log in to unmask]<mailto:[log in to unmask]>
More exactly, what is the question the researcher is trying to address?
Creating a large-langauge model from scratch is very expensive in terms of money, time, and skill. Moreover, one will need access to not small computers complete with GPUs, but I suspect your National Laboratory have these types of resources.
Harvesting things from Arxiv is possible. It is not technically too difficult. Instead, the challenge comes with throttling. Arxiv limits downloads to a few items every few seconds, and thus downloading 400,000 items takes a long time.
Harvesting things at the other end of OAI-PMH repositories is not too difficult either, but it does require practice. Identify OAI-PMH repositories of interest. Harvest their metadata. Reverse-engineer the metadata's identifers to point to actual documents, not landing pages. Harvest content. Curate the result.
Another possibility is to query Web of Science, and use the free API to curate a collection of bibliographics. The API does not support the acquisistion of abstracts, but given the DOI from Web of Science, one can get the abstracts from a service called OpenAlex.
Scraping is ugly and possible, but I run the other way because there is too much cruft in the result.
Fun with modern-day collection development practice.
--
Eric Lease Morgan
Navari Family Center for Digital Scholarship
University of Notre Dame
|