Just a ping to the list about our upcoming, informal, totally rad Code4Lib Slack's "Spark in the Dark” (#sparkinthedark) talk next week.
We have a wildly informal, super fun and mad informative call next week, Tuesday, November 21st at 4 PM Eastern / 1 PM Pacific, on Text Analysis at Scale by Corey Harper & Jessica Cox (see their blurb below). You can join the call here if you’re interested https://stanford.zoom.us/j/4167209074.
And if you haven’t yet, join us on our Code4Lib slack channel, #sparkinthedark
— — Talk details — —
Spark at Elsevier: Tools for Text Analysis at Scale
This talk is a hybrid of a talk on Citing Sentences analysis given at PyGotham 2017 and a second talk on AnnotationQuery Use Cases presented internally to Elsevier.
The first half of the talk will be focused on doing Natural Language Processing (NLP) in a Python-based Spark environment using PySpark. Examples will be drawn from a Citing Sentences project underway within Elsevier Labs (http://labs.elsevier.com/). The goal of this project is to build and analyze citation networks to understand the diffusion and flow of ideas through the scientific research landscape. Much like a social network, scientists want to understand how others are ‘talking’ about their papers. Are they supporting their work? Disagreeing with it? Is it being referred to as a discovery? PySpark code will be demoed using the Community Edition of DataBricks, and the talk will cover using the DataBricks environment to manage Spark clusters. A DataBricks notebook and sample dataset will be provided at the end of the talk.
The second half of the talk will introduce AnnotationQuery. Recently Open Sourced by Elsevier Labs, AnnotationQuery is designed as a set of composable (and extensible) functions that allows users to query annotations generated from full-text content at scale. We will introduce our internal Content Analysis Toolbench (CAT3) annotation format. We will then use another set of DataBricks notebooks, this time in Scala, to show how AnnotationQuery allows combining structural and natural language content to allow for powerful text mining pipelines. We will focus on a Use Case about extracting units and measures contained within article text. These measurements can then be used in a variety of analyses of experimental conditions and entity properties, from mouse bioterium temperatures to compressive strengths of concrete.
Data Operations Engineer
Digital Library Systems and Services
Stanford, CA 94305
[log in to unmask]<mailto:[log in to unmask]>