LISTSERV 16.5 - CODE4LIB Archives

From Preserving Data to Preserving Research: Curation of Process and
Context
<http://timbusproject.net/events/events/206-from-preserving-data-to-preserving-researchcuration-of-process-and-context->

The TIMBUS and Wf4Ever projects are offering a half-day tutorial at the
10th International Conference on Preservation of Digital Objects (iPres)
2013, in Lisbon, Portugal on September 2, 2013.
http://ipres2013.ist.utl.pt/index.html
ABSTRACT

In the domain of eScience, investigations are increasingly collaborative.
Most scientific and engineering domains benefit from building on the
outputs of other research: by sharing information to reason over and data
to incorporate in the modeling task at hand. This raises the need for
preserving and sharing entire eScience workflows and processes for later
reuse. We need to define which information is to be collected, create means
to preserve it and approaches to enable and validate the re-execution of a
preserved process. This includes and goes beyond preserving the data used
in the experiments, as the process underlying its creation and use is
essential.

The TIMBUS project and Wf4Ever project team up for this half-day tutorial
to provide an introduction to the problem domain and discuss solutions for
the curation of eScience processes.
TUTORIAL LEVEL

Introductory level
DURATION

Half-day
OUTLINE OF THE CONTENT

The tutorial will cover the following topics:

Introduction to Process and Context Preservation: The introduction will
motivate the need for process and context preservation, illustrate how this
task is difficult in an evolving domain, and introduce a use case for the
rest of the tutorial to illustrate approaches and tools.

*Data Citation*: Data forms the basis of the results of many research
publications, and thus needs to be referenced with the same accuracy as
bibliographic data. Only if data can be identified with high precision can
it be reused, validated, verified and reproduced. Citing a specific data
set is however not trivial - it exists in a vast plurality of
specifications and instances, can potentially be huge in size, and its
location might change. We will provide an overview over existing approaches
to overcoming these challenges. Further, we will present the issue of
creating data citations of data held in databases, especially of dynamic
data sets where data is added or updated on a regular basis.

*Re-usability and traceability of workflows and processes*: The processes
creating and interpreting data are complex objects. Curating and preserving
them requires special effort, as they are dynamic, and highly dependent on
software, configuration, hardware, and other aspects. We will discuss these
issues in detail, and provide an introduction to two complementary
approaches.

The first approach is based on the concept of Research Objects, which
adopts a workflow-centric approach and thereby aims at facilitating the
reuse and reproducibility. It allows packaging the data and the methods as
one Research Object to share and cite it, and thus enable publishers to
grant access to the actual data and methods that contribute to the findings
reported in scholarly articles.

A second approach focuses on describing and preserving a process and the
context it is embedded in. The artifacts that may need to be captured range
from data, software and accompanying documentation, to legal and human
resource aspects. Some of this information can be automatically extracted
from an existing process, and tools for this will be presented. Ways to
archive the process and to perform preservation actions on the process
environment, such as recreating a controlled execution environment or
migration of software components, are presented. Finally, the challenge of
evaluating the re-execution of a preserved process is discussed, addressing
means of establishing its authenticity.
INTENDED AUDIENCE

The tutorial is targeted at researchers, publishers and curators in
eScience disciplines who want to learn about methods of ensuring the
long-term availability of experiments forming the basis of scientific
research.
EXPECTED LEARNING OUTCOMES

The tutorial participants will understand

- Motivations and challenges of process preservation

- Motivations, stakeholders and challenges of making data citable

- How Data is Cited Today: OECD report on data citability, Google search
of data sets, requirements, guidelines, metadata, locators and identifiers,
approaches to naming schemes and properties.

- Available technologies for identifiers: Archival Resource Key (ARK),
Digital Object Identifiers (DOI), Extensible Resource Identifier (XRI),
HANDLE, Life Science ID (LSID), Object Identifiers (OID), Persistent
Uniform Resource Locators (PURL), URI/URN/URL, Universally Unique
Identifier (UUID)

- Approaches and Initiatives for citing data: CODATA, Data Cite,
OpenAire, challenges and opportunities: granularity, scalability,
complexity and evolving data sets current research questions

- Ontologies needed to capture research objects: Core Ontology of the RO
family of vocabularies, workflow centric ROs, provenance traces, life cycle
of research objects.

- Wf4Ever Toolkit / technological infrastructure for the preservation
and efficient retrieval and reuse of scientific workflows: software
architecture, functionalities, software interfaces to functionalities,
reference implementation as services and clients:

- Collect, manage and preserve aggregations of scientific workflows and
related objects and annotations

- Workflow sharing through a social website

- Execution of workflows

- Testing completeness, execution, repeatability and other desired quality
features

- Testing the ability of a Research Object to achieve its original purpose
after changes to its resources.

- Recommendations of relevant users, Research Objects and their aggregated
resources

- Converting workflows into Research Objects

- Search for workflows by input parameters or frequency of use

- Collaborative environment

- Access and use of research objects and aggregated resources.

- Synchronization with remote repositories

- Visualization of correlation between similar objects

- TIMBUS context model and tools to semi-automatically capture the
relevant context of a business process for preservation

- The scope of context regarding business process preservation -
technology, application and business context, aligned with enterprise
architecture

- The context meta-model, with domain independent and domain specific
aspects

- Demonstration of a context model instance of example processes (in the
eScience domain)

- Tools to automatically capture some parts of the context (software
dependencies, data formats, licenses, ...)

- Outlook on reasoning and preservation planning, based on the context model

BIOGRAPHY OF THE PRESENTER(S)

*Angela Dappert* is a researcher at the Digital Preservation coalition,
working on the FP7 project TIMBUS. She also serves on the PREMIS Editorial
Committee. In both capacities is she involved with the issues of modeling
and defining metadata for computational environments. She has worked at the
British Library on data carrier stabilization, digital asset registration,
preservation planning and characterization, eJournal ingest, and digital
metadata standards. Before this she worked for Schlumberger, the University
of California, Stanford University and Siemens. She has been involved in
numerous initiatives in the area of digital preservation (Planets, SCAPE,
TIMBUS). She has been lecturing extensively on this subject as part of
WePreserve, Planets, TIMBUS and other training initiatives on digital
preservation.

*Daniel Garijo* is a PhD student in the Ontology Engineering Group at the
Universidad Politecnica de Madrid. His research activities focus on
e-Science and the Semantic Web, specifically on how to increase the
understandability of scientific workflows using provenance and metadata. He
is a member of the W3C Provenance Working Group, and he is currently part
of the Wf4Ever project.

*Rudolf Mayer* is a researcher at Secure Business Austria, as well as the
Department of Software Technology and Interactive Systems at the Vienna
University of Technology. His research interests cover digital
preservation, specifically the preservation of processes, information
retrieval (specifically on text documents and music), data analysis and
machine learning. He has many years of lecturing experience in these
subjects. He has been involved in the DELOS and PLANETS projects, and
currently works on digital preservation aspects in the FP7 projects APARSEN
and TIMBUS.

*Raul Palma* is a researcher at Poznan Supercomputing and Networking Center
(PSNC). His research interests cover digital preservation, particularly of
scientific methods, provenance and evolution of digital artifacts, ontology
engineering and distributed technologies. He has participated in several EU
projects, including the Network of Excellence Knowledge Web, NeOn, e-Lico
and WF4Ever. He has many years of lecturing experience in related topics,
both at the university and private institutions. He has authored or
co-authored several vocabularies and ontologies, such as the Research
Object evolution Ontology, Ontology Metadata Vocabulary (OMV) and different
extensions for describing ontologies and related resources, models for
collaborative ontology construction and digital multimedia repositories

*Stefan Pröll* is a researcher at SBA Research. His primary research focus
lies on digital preservation, especially on security aspects of digital
archives, including authenticity and provenance of digital objects. Further
areas of interest are databases and data citation. Currently he is working
on FP7 projects APARSEN and TIMBUS focusing on security and provenance
related topics. Before he joined SBA in April 2011, he was working in
international organizations in the area of Web development, Linux server
and database administration.

*Andreas Rauber* is Associate Professor at the Department of Software
Technology and Interactive Systems at the Vienna University of Technology.
He is involved in several research projects in the field of Digital
Libraries, focusing on the organization and exploration of large
information spaces, as well as Web archiving and digital preservation. His
research interests cover the broad scope of digital libraries, including
specifically text and music information retrieval and organization,
information visualization, as well as data analysis and neural computation.
He has been involved in numerous initiatives in the area of digital
preservation (DELOS, DPE, Planets, SCAPE, TIMBUS, APARSEN). He has been
lecturing extensively on this subject at different universities, as part of
the DELOS and nestor summer schools on digital preservation, as well as
during a range of training events on digital preservation.