Have you tried using Semantic Scholar? Their API is pretty decent.
On Fri, Apr 26, 2024, 8:03 AM Abner, Kayla <[log in to unmask]> wrote:
> Pre-AI mania, vendors might share that data upon request for research. So
> you could ask WOS or Scopus, or check their text and data mining policy to
> see what their required steps are to get the data. However as others have
> mentioned, vendors have been very finicky about data mining since AI has
> become such a hot topic.
>
>
> ----
>
> Kayla Abner
>
> (she/her)
>
> Digital Scholarship Librarian
>
> Digital Initiatives and Preservation
>
> Library, Museums and Press
>
> University of Delaware
>
> [log in to unmask]<mailto:[log in to unmask]>
>
> Book time to meet with me<https://calendly.com/kabner-gx9j/consultation>
>
>
>
> **The University of Delaware, a land grant institution, is located on land
> that was and continues to be vital to the web of life of the Nanticoke and
> Lenni-Lenape people. We express gratitude and honor the people who have
> inhabited, cultivated, and nourished this land for thousands of years, even
> after their attempted forced removal during the colonial era and early
> federal period. The University of Delaware also financially benefitted from
> the expropriation of Indigenous territories in the region colonially known
> as Montana. View the full Living Land Acknowledgement<
> https://sites.udel.edu/antiracism-initiative/committees/american-indian-and-indigenous-relations/living-land-acknowledgement/#Living_Land_Acknowledgement
> >.**
>
> [cid:12c2dc0f-7d43-4c66-82f6-e726436595d4]
>
> ________________________________
> From: Code for Libraries <[log in to unmask]> on behalf of Pino,
> Janine <[log in to unmask]>
> Sent: Friday, April 26, 2024 10:57 AM
> To: [log in to unmask] <[log in to unmask]>
> Subject: Re: [CODE4LIB] web scraping to train LLM
>
> Yeah, I'm a little nervous about providing advice in this situation. I do
> not want to recommend Scopus or Web of Science; we've had vendor complaints
> about people going over the data limit. I am going to emphasize open data
> sources and crediting the data to be safe. They are using Beautiful Soup
> and APIs to get the data.
>
> -----Original Message-----
> From: Code for Libraries <[log in to unmask]> On Behalf Of Pikas,
> Christina K.
> Sent: Friday, April 26, 2024 10:03 AM
> To: [log in to unmask]
> Subject: [EXTERNAL] Re: [CODE4LIB] web scraping to train LLM
>
> There be dragons! In particular don't mention "scraping" anywhere within
> distance of A. C. S. Open collections are probably your best bet. Maybe
> something from NIST for reference data and then things like Semantic
> Scholar.
>
> Many/most publishers have hastily constructed "NO AI" rules ... which
> forbid everything, even things which are clearly fair use.
>
> -----Original Message-----
> From: Code for Libraries <[log in to unmask]> On Behalf Of Pino,
> Janine
> Sent: Friday, April 26, 2024 9:40 AM
> To: [log in to unmask]
> Subject: [EXT] [CODE4LIB] web scraping to train LLM
>
> APL external email warning: Verify sender [log in to unmask]
> before clicking links or attachments
>
> Hello,
>
> Does anyone have experience with web scraping publications to train LLM?
> One of our researchers is looking for a good source on condensed matter and
> materials science. They've tried arXiv but couldn't find enough
> publications specifically on materials science as a subcategory. They were
> hoping for about 400,000 publications.
>
> Thanks,
>
> Janine Pino (she/her)
> Data Librarian
> Research Library & Information Services
> Office of Institutional Planning
> Oak Ridge National Laboratory
> Email: [log in to unmask]<mailto:[log in to unmask]>
> Phone: 865.341.2465
>
|