Hello,
Maybe you can take a look at Openalex<https://openalex.org/>, which has a very broad, open and multi-disciplinary knowledge base of bibliographic metadata of scholary outputs and a very robust well-documented API. The entity-relationship model<https://help.openalex.org/how-it-works> behind the metadata catalog contains concept-type entities aligned with wikidata concepts which can help you build your corpus of metadata.
Depending on what you want to train an LLM for and if you need the fulltext you can then use the doi to scrape the full text online.
Géraldine Geoffroy
[1710955607815]
Géraldine Geoffroy
Bibilothèque de l'EPFL
Rolex Learning Center
Station 20
1015 Lausanne
go.epfl.ch/bibliotheque<https://www.epfl.ch/campus/library/fr/bibliotheque/>
+41 21 693 87 34
[log in to unmask]<mailto:[log in to unmask]>
Follow @EPFLlibrary
[X]<https://www.instagram.com/epfllibrary/>[X]<https://twitter.com/epfllibrary>[X]<https://www.facebook.com/EPFLlibrary/>[X]<https://www.linkedin.com/showcase/epfllibrary>[X]<https://www.youtube.com/user/epfllibrary>
________________________________
De : Code for Libraries <[log in to unmask]> de la part de Deen, Julia <[log in to unmask]>
Envoyé : vendredi, 26 avril 2024 19:29
À : [log in to unmask]
Objet : Re: [CODE4LIB] web scraping to train LLM
The Open Science Foundation has an API<https://developer.osf.io/#> - not sure what their rules are for data mining but maybe worth looking into!
Julia Deen (they/them)
Data Services Librarian
Davis Family Library 209
[log in to unmask]
Schedule an appointment with me<https://middlebury.libcal.com/appointments/jdeen>
https://www.data-is-plural.com/
________________________________
From: Code for Libraries <[log in to unmask]> on behalf of Joe Hourclé <[log in to unmask]>
Sent: Friday, April 26, 2024 11:15 AM
To: [log in to unmask] <[log in to unmask]>
Subject: Re: [CODE4LIB] web scraping to train LLM
>
> On Apr 26, 2024, at 9:36 AM, Pino, Janine <[log in to unmask]> wrote:
>
> Hello,
>
> Does anyone have experience with web scraping publications to train LLM? One of our researchers is looking for a good source on condensed matter and materials science. They've tried arXiv but couldn't find enough publications specifically on materials science as a subcategory. They were hoping for about 400,000 publications.
You might not need to do any scraping. Searching for “materials science corpus” led me to:
https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.nature.com%2Farticles%2Fs41524-022-00784-w&data=05%7C02%7Cjdeen%40MIDDLEBURY.EDU%7Cd9bb468485e74ab8234e08dc6603c611%7Ca1bb0a191576421dbe93b3a7d4b6dcaa%7C1%7C0%7C638497413600831591%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=WR%2F%2BEV6I6DnWx8X4mtdwrFAMxa5b7iDcftwyB8lJsmo%3D&reserved=0<https://www.nature.com/articles/s41524-022-00784-w>
I’m not sure what exactly Nature’s rules are for this sort of work, but for science articles in there, you have to freely share your data.
(They might not be able to share the whole thing, depending on what they agreed to when getting access to their training corpus, but any open publications should be fair game)
-Joe
(Not affiliated)
|