Print

Print


Kaggle has a challenge based on this data that is "asking for your help to develop text and data mining tools that can help the medical community develop answers to high priority scientific questions" [1]. There are some medical professionals, including at least one epidemiologist, that have weighed in on the discussion boards. I think submissions have to be in one Kaggle's notebooks format [2] but there are ideas and approaches posted outside of this.

art
---
1. https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
2. https://www.kaggle.com/docs/notebooks
Notebooks Documentation | Kaggle<https://www.kaggle.com/docs/notebooks>
Explore and run machine learning code with Kaggle Notebooks, a cloud computational environment that enables reproducible and collaborative analysis
www.kaggle.com
[https://storage.googleapis.com/kaggle-datasets-images/551982/1008364/5cde9da345ce9deab89b6dfdfc201c49/dataset-card.png?t=2020-03-14-01-34-32]<https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge>
COVID-19 Open Research Dataset Challenge (CORD-19) | Kaggle<https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge>
An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House
www.kaggle.com


________________________________
From: Code for Libraries <[log in to unmask]> on behalf of Eric Lease Morgan <[log in to unmask]>
Sent: Friday, March 20, 2020 11:40 AM
To: [log in to unmask] <[log in to unmask]>
Subject: [CODE4LIB] public health or medical research

Do you know of any researcher or scholar in the realm of public health or medicine that may need/want to read the flood of scholarship being generated by Covid-19?

As you may or may not know, the Distant Reader is designed to read large amounts of narrative texts, such as scholarly journal articles. The Gates Foundation, the Allen Institute for AI, and their friends have made freely available a data set of 13,000 full text scholarly articles on the topic of covid-19. [1]

I have downloaded the data set and fed it to the Reader, and the initial results are here:

 https://carrels.distantreader.org/library/covid-19/

The results are okay, but they can be improved in a number of ways. For example, I can easily create a full text (Solr) index to the data set. I can create a network diagram illustrating the relationship of a given word to other nearby words. I could apply various types of machine learning to the Reader's output, such as topic modeling and classification, to look for patterns and anomalies.

To do some of these things additional resources may be needed, such as data processing power, data visualization skills, as well as some cyber infrastructure. I have been in touch with my XSEDE colleagues at IU, and they seem more than amenable to help, but the whole thing would be GREATLY improved and MUCH MORE relevant if we were working with somebody who has specific questions to answer -- somebody from the fields of public health, medicine, etc.

Do you know the names of anybody in public health, medicine, or some other discipline who might want to read -- use & understand -- the literature being generated?

Be safe.

[1] data set - https://pages.semanticscholar.org/coronavirus-research

--
Eric Morgan
University of Notre Dame