This project seeks to develop and implement the Kepler/Ptolemy scientific workflow system as an interface to the Cheshire 3 digital library framework. The aim is to enable researchers in both the humanities and scientific disciplines to use the Kepler/Cheshire software to conduct analyses and perform distributed processing in several different software and hardware environments; and to coordinate the export and import of data from one environment to another.

Implementing the Kepler Workflow Interface into the Cheshire Digital Library

The project seeks to develop and implement the Kepler/Ptolemy scientific workflow system as an interface to the Cheshire 3 digital library framework. The aim is to enable researchers in both the humanities and scientific disciplines to use the Kepler/Cheshire software to conduct analyses and perform distributed processing in several different software and hardware environments; and to coordinate the export and import of data from one environment to another. We intend to use the Kepler/Cheshire interface to provide researchers with capabilities ranging from discovering information to publishing results, thus comprising a Virtual Research Environment. In particular, we intend to work with the Arts and Humanities Data Service to develop a number of transactional services for the humanities. 

Aims and Objectives  

The overall aim of the project is to implement established, automated workflow technologies into the Cheshire 3 Digital Library framework. This will provide researchers with an easy-to-use yet powerful system for executing workflows; and will enable users of the system to generate, more easily, publishable results from relevant text data.  

There is an increasing need for workflow capabilities across both humanities and science sectors. In using software and online resources, researchers in these sectors typically carry out tasks involving the design and execution of a series of steps, or workflow. A researcher begins by identifying and accessing initial data sets, and proceeds through additional steps using software tools such as Web services, modelling and simulation programs, image processing programs, visualization software, etc. Each of these steps progressively transforms the initial data, and researchers need to keep track of what was done and why. Adding to the complexity, researchers are often attempting to run quantitative and repeatable analyses and models in more than one software and hardware environment.  

To overcome these limitations, the Kepler initiative has developed a generic tool and environment that builds on existing technologies and will work in a wide range of applications to capture, automate, and manage researchers' actions as they carry out workflows. We wish to build on this tool by providing a range of digital library capabilities, enabling researchers to discover information from JISC services and, once found, reuse this information in any number of ways.

The outputs of the project will form a major contribution to the teaching and research environments across a spectrum of projects and services, particularly multi-disciplinary projects involving resources in the humanities. For example, it will allow users of AHDS and other services to automate complex workflows, without having to become expert programmers. 

This capability should relieve researchers of repetitive tasks so that they can focus on their particular area of expertise. It will also give researchers increased capabilities to communicate and work together – searching for, integrating, and sharing data and workflows in large-scale collaborative environments. In effect, the project will add transactional capabilities to services which are at present focused on content. 

The specific objectives are to:

  • Devise an interface for a workflow creation and execution process so that users may design, execute, monitor, and communicate analytical procedures repeatedly with minimal effort.
  • Implement this as part of the Cheshire 3 digital library framework.
  • Incorporate this integration into data-grid systems, through support of the Storage Resource Broker, and Grid workflow patterns.
  • In doing so, address issues of data and process provenance, user interaction, reporting and logging
  • Test the implementation on large, complex, and heterogeneous data sets particularly from the Arts and Humanities Data Service (AHDS).
  • Evaluate the implementation (QUB, AHDS); with improvements introduced from user feedback.

Project Methodology  

Kepler is actually a set of workflow steps designed to operate in the Ptolemy dataflow system and includes aspects such as web service interaction. We wish to build an interface to this dataflow system, which will allow us to plug different execution models into the Cheshire 3 workflows for use in the Cheshire 3 Digital Library framework. The technical steps to do this are set out below. Once this has been completed, we will work with the AHDS and QUB to test and evaluate the workflows on a range of humanities-based content; we will also work with SDSC to do the same with projects in the science domain. 

In terms of Kepler/Cheshire 3 integration, the technical steps are as follows:

  • Configure a Kepler web service actor to use SRW (Search/Retrieve Web service). We will, first, need to investigate whether the Ptolemy/Kepler compiler will be able to compile the WSDL (Web Service Definition Language) for SRW, which has proven to be more complex than conventional transactional services. If this does not work with the Ptolemy/Kepler compiler, then we will need to build the object definitions by hand or treat it as document/literal and use regular XML processing on the back end to parse the response. This will let us use Kepler to interact with Cheshire as a 'black box' -- eg all you can do are the operations allowed by SRW (search and scan, primarily). If possible the actor should self configure from the Explain response, but that is not prioritized.
  • Write a series of Kepler actors to interact directly with Cheshire3 objects -- for example a PreParser actor, a Parser actor, a Transformer actor and so forth. This will be more complex than the previous web service based interaction and will require linking the Java based Kepler with the Python based Cheshire3. The expected implementation will rely on TCP/IP sockets so as to be distributable in the grid environment. Transporting object around the network will require serialisation, potentially using Python's fast 'pickle' algorithm in a similar method to the distributed processing module for PVM or MPI. The requirement to maintain Session objects between calls however may mean that it becomes easier for Kepler to interact with a Python daemon rather than with the Cheshire3 objects directly.
  • Write a Cheshire3 to Kepler handler to be instantiated primarily as PreParser, Transformer and possibly a Normaliser. This will let us call Kepler workflows from Cheshire and import the results back in to the local environment. For this we'll need some link between a constantly running java Kepler execution server, and the Cheshire3 objects that need to call them. There are three reasons why we have selected Kepler/Ptolemy: a) it is the only available system which allows one to plug in different execution models into workflows; b) it is a mature system which is already widely used and supported in the e-Science and cyberinfrastructure communities; c) it leverages related joint development work that we are undertaking with the San Diego Supercomputer Center.

Implications/ Deliverables/ Stakeholders  

The project will support an environment which will allow researchers across a range of disciplines to analyse information discovered using the Cheshire 3 digital library system. It will make available to humanities researchers a number of important tools originating from the e-Science community; and for the scientific researchers it will integrate these tools with a digital library framework. Strategically, it will introduce a range of transactional services to the humanities sector, including visualization technologies, data mining systems, use of statistical programmes. Finally, it will investigate the implementation of distributed execution (gridservices) to enable researchers to use computational resources on the internet in a distributed workflow.  

Project Staff

Project Manager

Clare Llewellyn
Liverpool University Library
P.O.Box 123, Liverpool L69 3DA
Telephone: (0151) 794 2696
Fax: (0151) 794 2681
Email: m2cal@liverpool.ac.uk
 

Project Team 

John Harrison, Research Associate
Liverpool University Library
PO Box 123, Liverpool L69 3DA
Tel: (0151) 794 2696
Fax: (0151) 794 2681
Email: johnph@liverpool.ac.uk

Fabio Corubolo, Research Associate
Liverpool University Library
PO Box 123, Liverpool L69 3DA
Tel: (0151) 794 2696
Fax: (0151) 794 2681
Email: f.corubolo@liverpool.ac.uk

Project Partners  

Sheila Anderson, Arts and Humanities Data Service
26-29 Drury Lane, 3rd Floor, London WC2B 5RL
Tel: (0207) 848 1988
Fax: (0207) 848 1989
Email: sheila.anderson@ahds.ac.uk  

Paul Ell, Centre for Data digitisation and Analysis
School of Sociology and Social Policy, The Queen's University of Belfast
Belfast, Northern Ireland BT7 INN
Tel: (0)28 90273408
Fax: (0)28 90320668
Email: p.ell@qub.ac.uk  

Ilkay Altintas, Manager Scientific Automation Technologies Lab
SDSC, University of California,
San Diego, 9500 Gilman Drive, MC 0505, La Jolla, CA 92093-0505, USA.
Tel: (858) 822 5453
Fax: (858) 822 3693
Email: altintas@sdsc.edu

Bookmark and Share