The Open Science Foundation has an API<https://developer.osf.io/#> - not sure what their rules are for data mining but maybe worth looking into!
Julia Deen (they/them)
Data Services Librarian
Davis Family Library 209
[log in to unmask]
Schedule an appointment with me<https://middlebury.libcal.com/appointments/jdeen>
https://www.data-is-plural.com/
________________________________
From: Code for Libraries <[log in to unmask]> on behalf of Joe Hourclé <[log in to unmask]>
Sent: Friday, April 26, 2024 11:15 AM
To: [log in to unmask] <[log in to unmask]>
Subject: Re: [CODE4LIB] web scraping to train LLM
>
> On Apr 26, 2024, at 9:36 AM, Pino, Janine <[log in to unmask]> wrote:
>
> Hello,
>
> Does anyone have experience with web scraping publications to train LLM? One of our researchers is looking for a good source on condensed matter and materials science. They've tried arXiv but couldn't find enough publications specifically on materials science as a subcategory. They were hoping for about 400,000 publications.
You might not need to do any scraping. Searching for “materials science corpus” led me to:
https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.nature.com%2Farticles%2Fs41524-022-00784-w&data=05%7C02%7Cjdeen%40MIDDLEBURY.EDU%7Cd9bb468485e74ab8234e08dc6603c611%7Ca1bb0a191576421dbe93b3a7d4b6dcaa%7C1%7C0%7C638497413600831591%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=WR%2F%2BEV6I6DnWx8X4mtdwrFAMxa5b7iDcftwyB8lJsmo%3D&reserved=0<https://www.nature.com/articles/s41524-022-00784-w>
I’m not sure what exactly Nature’s rules are for this sort of work, but for science articles in there, you have to freely share your data.
(They might not be able to share the whole thing, depending on what they agreed to when getting access to their training corpus, but any open publications should be fair game)
-Joe
(Not affiliated)
|