>
> On Apr 26, 2024, at 9:36 AM, Pino, Janine <[log in to unmask]> wrote:
>
> Hello,
>
> Does anyone have experience with web scraping publications to train LLM? One of our researchers is looking for a good source on condensed matter and materials science. They've tried arXiv but couldn't find enough publications specifically on materials science as a subcategory. They were hoping for about 400,000 publications.
You might not need to do any scraping. Searching for “materials science corpus” led me to:
https://www.nature.com/articles/s41524-022-00784-w
I’m not sure what exactly Nature’s rules are for this sort of work, but for science articles in there, you have to freely share your data.
(They might not be able to share the whole thing, depending on what they agreed to when getting access to their training corpus, but any open publications should be fair game)
-Joe
(Not affiliated)
|