Have you tried using Semantic Scholar? Their API is pretty decent. On Fri, Apr 26, 2024, 8:03 AM Abner, Kayla <[log in to unmask]> wrote: > Pre-AI mania, vendors might share that data upon request for research. So > you could ask WOS or Scopus, or check their text and data mining policy to > see what their required steps are to get the data. However as others have > mentioned, vendors have been very finicky about data mining since AI has > become such a hot topic. > > > ---- > > Kayla Abner > > (she/her) > > Digital Scholarship Librarian > > Digital Initiatives and Preservation > > Library, Museums and Press > > University of Delaware > > [log in to unmask]<mailto:[log in to unmask]> > > Book time to meet with me<https://calendly.com/kabner-gx9j/consultation> > > > > **The University of Delaware, a land grant institution, is located on land > that was and continues to be vital to the web of life of the Nanticoke and > Lenni-Lenape people. We express gratitude and honor the people who have > inhabited, cultivated, and nourished this land for thousands of years, even > after their attempted forced removal during the colonial era and early > federal period. The University of Delaware also financially benefitted from > the expropriation of Indigenous territories in the region colonially known > as Montana. View the full Living Land Acknowledgement< > https://sites.udel.edu/antiracism-initiative/committees/american-indian-and-indigenous-relations/living-land-acknowledgement/#Living_Land_Acknowledgement > >.** > > [cid:12c2dc0f-7d43-4c66-82f6-e726436595d4] > > ________________________________ > From: Code for Libraries <[log in to unmask]> on behalf of Pino, > Janine <[log in to unmask]> > Sent: Friday, April 26, 2024 10:57 AM > To: [log in to unmask] <[log in to unmask]> > Subject: Re: [CODE4LIB] web scraping to train LLM > > Yeah, I'm a little nervous about providing advice in this situation. I do > not want to recommend Scopus or Web of Science; we've had vendor complaints > about people going over the data limit. I am going to emphasize open data > sources and crediting the data to be safe. They are using Beautiful Soup > and APIs to get the data. > > -----Original Message----- > From: Code for Libraries <[log in to unmask]> On Behalf Of Pikas, > Christina K. > Sent: Friday, April 26, 2024 10:03 AM > To: [log in to unmask] > Subject: [EXTERNAL] Re: [CODE4LIB] web scraping to train LLM > > There be dragons! In particular don't mention "scraping" anywhere within > distance of A. C. S. Open collections are probably your best bet. Maybe > something from NIST for reference data and then things like Semantic > Scholar. > > Many/most publishers have hastily constructed "NO AI" rules ... which > forbid everything, even things which are clearly fair use. > > -----Original Message----- > From: Code for Libraries <[log in to unmask]> On Behalf Of Pino, > Janine > Sent: Friday, April 26, 2024 9:40 AM > To: [log in to unmask] > Subject: [EXT] [CODE4LIB] web scraping to train LLM > > APL external email warning: Verify sender [log in to unmask] > before clicking links or attachments > > Hello, > > Does anyone have experience with web scraping publications to train LLM? > One of our researchers is looking for a good source on condensed matter and > materials science. They've tried arXiv but couldn't find enough > publications specifically on materials science as a subcategory. They were > hoping for about 400,000 publications. > > Thanks, > > Janine Pino (she/her) > Data Librarian > Research Library & Information Services > Office of Institutional Planning > Oak Ridge National Laboratory > Email: [log in to unmask]<mailto:[log in to unmask]> > Phone: 865.341.2465 >