Potentially you could advise them to take a look at CORE https://core.ac.uk/services which harvests from across many open repositories (metadata and full text) already and offers downloads and sync mechanisms. Downloads of older copies of the data are free and licenses ODB-BY. If they need the most recent data then that can be licensed for a fee, or accessed via a CORE “Sustaining” membership (if your institution has one, or is willing to take one - it does have some other benefits as well as access to the dataset so maybe worth looking at https://core.ac.uk/membership#membership-levels)
This gives access to a large amount of data, simplifies the licensing question, avoids adding extra strain to the OA Repository infrastructure, and (IMO) supports a valuable service (disclaimer - I was involved in the project that first established CORE - which aimed to “enrich scholarly data using state-of-the-art text and data mining technologies” but I’m not in anyway involved with the current service - I just think it’s a really good idea!)
Owen
> On 29 Apr 2024, at 22:15, Fitchett, Deborah <[log in to unmask]> wrote:
>
> With open data sources, also emphasise checking for any throttling limits and if the site doesn't specify then go super slow: it may not be resourced for dealing with large number of queries and it's a bit rude to accidentally DoS attack an open site. 😉
>
> There are discussions currently on the Dspace list about how to block AI/other scrapers from repositories because of the havoc they cause. Our own repository is having general performance issues so even the harvesters we actively *want* running on our site had to throttle back to 1 request per 10 seconds(!!)
>
> Deborah
>
> -----Original Message-----
> From: Code for Libraries <[log in to unmask]> On Behalf Of Pino, Janine
> Sent: Saturday, April 27, 2024 2:58 AM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] web scraping to train LLM
>
> [You don't often get email from [log in to unmask] Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Caution: This email originated from outside our organisation. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
>
> Yeah, I'm a little nervous about providing advice in this situation. I do not want to recommend Scopus or Web of Science; we've had vendor complaints about people going over the data limit. I am going to emphasize open data sources and crediting the data to be safe. They are using Beautiful Soup and APIs to get the data.
>
> -----Original Message-----
> From: Code for Libraries <[log in to unmask]> On Behalf Of Pikas, Christina K.
> Sent: Friday, April 26, 2024 10:03 AM
> To: [log in to unmask]
> Subject: [EXTERNAL] Re: [CODE4LIB] web scraping to train LLM
>
> There be dragons! In particular don't mention "scraping" anywhere within distance of A. C. S. Open collections are probably your best bet. Maybe something from NIST for reference data and then things like Semantic Scholar.
>
> Many/most publishers have hastily constructed "NO AI" rules ... which forbid everything, even things which are clearly fair use.
>
> -----Original Message-----
> From: Code for Libraries <[log in to unmask]> On Behalf Of Pino, Janine
> Sent: Friday, April 26, 2024 9:40 AM
> To: [log in to unmask]
> Subject: [EXT] [CODE4LIB] web scraping to train LLM
>
> APL external email warning: Verify sender [log in to unmask] before clicking links or attachments
>
> Hello,
>
> Does anyone have experience with web scraping publications to train LLM? One of our researchers is looking for a good source on condensed matter and materials science. They've tried arXiv but couldn't find enough publications specifically on materials science as a subcategory. They were hoping for about 400,000 publications.
>
> Thanks,
>
> Janine Pino (she/her)
> Data Librarian
> Research Library & Information Services
> Office of Institutional Planning
> Oak Ridge National Laboratory
> Email: [log in to unmask]<mailto:[log in to unmask]>
> Phone: 865.341.2465
>
> ________________________________
>
> "The contents of this e-mail (including any attachments) may be confidential and/or subject to copyright. Any unauthorised use, distribution, or copying of the contents is expressly prohibited. If you have received this e-mail in error, please advise the sender by return e-mail or telephone and then delete this e-mail together with all attachments from your system."
|