Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published June 1, 1994 | public
Book Section - Chapter Open

Efficient checkpointing over local area networks

Abstract

Parallel and distributed computing on clusters of workstations is becoming very popular as it provides a cost effective way for high performance computing. In these systems, the bandwidth of the communication subsystem (Using Ethernet technology) is about an order of magnitude smaller compared to the bandwidth of the storage subsystem. Hence, storing a state in a checkpoint is much more efficient than comparing states over the network. In this paper we present a novel checkpointing approach that enables efficient performance over local area networks. The main idea is that we use two types of checkpoints: compare-checkpoints (comparing the states of the redundant processes to detect faults) and store-checkpoints (where the state is only stored). The store-checkpoints reduce the rollback needed after a fault is detected, without performing many unnecessary comparisons. As a particular example of this approach we analyzed the DMR checkpointing scheme with store-checkpoints. Our main result is that the overhead of the execution time can be significantly reduced when store-checkpoints are introduced. We have implemented a prototype of the new DMR scheme and run it on workstations connected by a LAN. The experimental results we obtained match the analytical results and show that in some cases the overhead of the DMR checkpointing schemes over LAN's can be improved by as much as 20%.

Additional Information

© 1994 IEEE. Reprinted with Permission. The research reported in this paper was supported in part by the NSF Young Investigator Award CCR-9457811, by the Sioan Research Fellowship, by a grant from the IBM Almaden Research Center, San Jose, California and by a grant from the AT&T Foundation.

Files

ZIVftpds94.pdf
Files (401.8 kB)
Name Size Download all
md5:a95d7c57f47b4e8486ec74a13b37ccdf
401.8 kB Preview Download

Additional details

Created:
August 22, 2023
Modified:
October 16, 2023