For example,
- etc containing the configuration file
- bin executable program scripts
- doc the program document
I would like to use the same program module design being proposed in the proposal, as it still work well.
And here are my designs for user interfaces.
User interface for submitting a job
Users would use command 'qsub' to submit their job and with the same standard PBS configuration in their submitted script file. For example,
###########example1.qsub#############
#!/bin/bash
#PBS -N Example1Job
#PBS -o /home/user1/my-mpi-app.out.1.txt
lamboot
mpirun -np 10 /home/user1/my-mpi-app.bin
##################################
But to allow the framework know which programs in the script should be taken care, and will be checkpointed and automatically restarted, users must specify some information just a line before MPI program in the script, for example,
###########example2.qsub#############
#!/bin/bash
#PBS -N Example2Job
#PBS -o /home/user1/my-mpi-app.out.2.txt
lamboot
#FT CHECKPOINT SYSTEM_INITIATED
mpirun -np 10 /home/user1/my-mpi-app1.bin
#FT CHECKPOINT PERIOD=10m
mpirun -np 10 /home/user1/my-mpi-app2.bin
#FT CHECKPOINT CKPT_SCHEDULER=/home/user1/my_checkpoint_scheduler
mpirun -np 10 /home/user1/my-mpi-app3.bin
##################################
SYSTEM_INITIATED means this program will be checkpointed by the system (for some system that run the checkpoint scheduling program)
PERIOD=XXm means this program will be checkpointed periodically every XXm (XX is a number)
CKPT_SCHEDULER=
For convenient in checkpionting, the checkpoint script for the job will be generated (by job id), e.g. checkpoint.job1.sh. So, the users' custom checkpoint scripts (or programs) can be written in any language, just they called "checkpoint.job1.sh", the current running program of the job will be checkpointed.
I just have some quick idea to make this design realistic, but it would be a little bit long description, so I would ask to not talking about it right now (it would surely included in the program document).
Configuration Files
This framework need a configuration file to allow the running fault tolerant daemon know which nodes are in the cluster, and which would be used (while the rest would be the standby nodes)
This is an example for configuration file
###########FT.CONFIG##############
PRIMARY_HEAD 192.168.29.1
SECONDARY_HEAD 192.168.29.200
PRIMARY_FT 222.222.222.1
SECONDARY_FT 222.222.222.200
BROADCAST_FT 222.222.222.255
NETMASK_FT 255.255.255.0
NETWORK_FT 222.222.222.0
FT_NODES
222.222.222.2
222.222.222.3
END
RUNNING_NODES
192.168.29.2
END
##################################
As I proposed to use network aliased address in this framework, each node in the cluster will have 2 IP address, one is working address--where user programs use for communication--another is aliases address--were fault tolerant framework use to identify nodes and ssh to it to control and achieve fault tolerant capability, e.g. change some configuration of the standby node to imitate the environment of the failed working node.
Thus, user must assigned these parameters (as seen in the example ft.config file), and let the program take care the rest.
--------------------------------------------------------------------------
Then here is my basic thought of how user will use my package, I know that this might be not so good. So, all comments are appreciably welcomed.

