The goal of the Coleridge Initiative at NYU is to use data to transform the way governments access and use data for the social good. We are a fast-growing university-based startup that has already created dozens of pilot projects, worked with over 100 agencies – federal, state and local - and trained over 450 agency staff. Our program directors – Julia Lane, Rayid Ghani, and Frauke Kreuter – have designed and implemented training programs, research projects and a secure data facility that are attracting national attention, including the Commission on Evidence Based Policy and the Federal Data Strategy.
Our team works with government agencies to break down data barriers around the secure use of confidential data. We do this in two ways. We have developed a secure environment for data (the Administrative Data Research Facility, or ADRF https://coleridgeinitiative.org/computing ), and are building new tools for data stewardship, data discovery and collaboration with some of the top scientists in the nation. We work with government agencies to (1) identify critical agency problems, (2) train staff to solve them, and (3) create products that have value. You can read more about our work at https://coleridgeinitiative.org.
Role & Responsibilities
We are seeking an enthusiastic, analytically minded Research Information Scientist with extensive experience working with data and research processes, as well as demonstrated experience in information or content management. The Research Information Scientist will be the lead on the full life cycle of data ingestion and storage in the ADRF. This is detail-oriented work, and the successful candidate will have complementary technical skills in data management, programming, and user experience as well as knowledge of current technologies, metadata standards and encoding standards (e.g. XML).
The Research Information Scientist will design and develop highly robust, repeatable and scalable workflow patterns to ingest, integrate and publish a wide variety of data from internal and external sources. The successful candidate will be responsible for ensuring that the ADRF’s data workflows and pipelines are enterprise-grade – reliable, scalable and secure – and for maintaining infrastructure and operations to support data science activities. The Research Information Scientist will focus on performance tuning, quickly identifying bottlenecks through review of SQL execution plans to maximize ADRF resource utilization and system performance. The successful candidate will also work directly with ADRF development and operations team-members, as well as collaborators and clients, to build out semi-automated approaches to data management, with an emphasis on data quality automation as the Coleridge Initiative builds to scale.
The Research Information Scientist’s responsibilities will include:
Managing data ingestion process and troubleshooting/resolving any resulting issues, ensuring the integrity and security of data housed in the ADRF
Performing preliminary quality assessment on data files, correcting obvious issues and then formatting files for ingestion
Contributing, as part of a team, to ADRF platform enhancement projects using appropriate technologies in research and large-scale data management (e.g., Hadoop and contemporaries, parallel databases, cloud services), and/or interactive visualization and specialized data presentation interfaces.
Implementing and documenting data ingestion best practices
Master’s Degree in Information Science, Library Science, or Computer Science
Proven experience successfully managing the full ETL and data preparation life cycle of large datasets in a data warehouse
Proficient in programming; required: ETL, Metadata harvesting, ETL distributed programming, ETL distributed debugging, PySpark, AWS Glue Jobs, AWS Glue Development Endpoints
Experience with relational and non-relational databases and other data storage and access technologies, such as MySQL, PostgreSQL, Aurora, Citus Data, Oracle, Hadoop, Spark, and/or AWS Athena.
Strong communication skills, team player
Additional Desired Experience & Skills
Experience with development of web applications and APIs using open source software
Experience working with large scale administrative datasets
Knowledge of key open source software resources
Prior experience in SQL and working with database technologies like Postgres
Demonstrated ability to write analytical reports
Please include a resume and cover letter.
Brought to you by code4lib jobs: https://jobs.code4lib.org/jobs/42488-research-information-scientist