In grid environments, resources are distributed and may belong to multiple organizations. Although they collectively present large amounts of computing powers, it is not easy to coordinate them to support long-running scientific applications. The availabilities of the computing resources and network connections in grid environments can change at any time. As a number of resources used by an applications scales over multiple administration domains, failure rate could highly increase and be unpredictable. The abilities to adapt to these changes are essential to the development of reliable applications and services in a multi-institutional, grid computing environment.
We have developed a leader-based group membership protocol that adapts group membership in presences of process failure, hardware failure, network partition, and network recovery situations. The protocol is described in detail in this draft paper. We have implemented the protocol into the eXtended Virtual Machine (XVM) system and conducted a number of experiments on a 64-node Linux cluster and DOE Science Grid testbed at ORNL. Although the protocol is currently implemented based on the virtual machine technology, we believe that it can be applied to other distributed applications and middleware systems as well.
In the past two years, two versions of the XVM prototypes have been developed. In the first system, we have implemented every feature of our protocol and tested it on top of a Linux cluster. We are currently writing documents describing XVM in detail to help other software developers who are interested in testing or using it. In our second prototype, we only implement the leader re-election feature for a non-partitioning environment. We have made this version interoperable with Globus, and added mechanisms to restart web server and MDS servers. (See SC 02 demo snapshots) Documentation is in progress.
User-defined Fault Monitoring and Notification and the Slave MONitor (SMON) system
While remote job submissions can be performed based on a workflow model, we adopt a data-flow model for the development of fault monitoring and notification mechanisms. As a proof of concept, we have developed a prototype, light-weighted sensor program, namely the Slave MONitor (SMON) system, which allows users to define monitoring conditions and their associated notification mechanisms. We are currently writing SMON’s documents. In the near future, we plan to investigate the implementation of the above membership protocol into a group of sensors so that they can coordinate their monitoring and notification efforts.
Our demo program is currently running at http://sleepy.ccs.ornl.gov:8080/examples/jsp/indexFMS.html.