Wednesday, June 6, 2007

Program Usage Design

The program structure would be looked like other standard Linux program.
For example,
- etc containing the configuration file
- bin executable program scripts
- doc the program document

I would like to use the same program module design being proposed in the proposal, as it still work well.

And here are my designs for user interfaces.

User interface for submitting a job
Users would use command 'qsub' to submit their job and with the same standard PBS configuration in their submitted script file. For example,

###########example1.qsub#############
#!/bin/bash
#PBS -N Example1Job
#PBS -o /home/user1/my-mpi-app.out.1.txt

lamboot
mpirun -np 10 /home/user1/my-mpi-app.bin
##################################

But to allow the framework know which programs in the script should be taken care, and will be checkpointed and automatically restarted, users must specify some information just a line before MPI program in the script, for example,

###########example2.qsub#############
#!/bin/bash
#PBS -N Example2Job
#PBS -o /home/user1/my-mpi-app.out.2.txt

lamboot

#FT CHECKPOINT SYSTEM_INITIATED
mpirun -np 10 /home/user1/my-mpi-app1.bin

#FT CHECKPOINT PERIOD=10m
mpirun -np 10 /home/user1/my-mpi-app2.bin

#FT CHECKPOINT CKPT_SCHEDULER=/home/user1/my_checkpoint_scheduler
mpirun -np 10 /home/user1/my-mpi-app3.bin
##################################

SYSTEM_INITIATED means this program will be checkpointed by the system (for some system that run the checkpoint scheduling program)

PERIOD=XXm means this program will be checkpointed periodically every XXm (XX is a number)

CKPT_SCHEDULER= means this program's checkpointing will be scheduled by user's customized script.

For convenient in checkpionting, the checkpoint script for the job will be generated (by job id), e.g. checkpoint.job1.sh. So, the users' custom checkpoint scripts (or programs) can be written in any language, just they called "checkpoint.job1.sh", the current running program of the job will be checkpointed.

I just have some quick idea to make this design realistic, but it would be a little bit long description, so I would ask to not talking about it right now (it would surely included in the program document).

Configuration Files
This framework need a configuration file to allow the running fault tolerant daemon know which nodes are in the cluster, and which would be used (while the rest would be the standby nodes)

This is an example for configuration file
###########FT.CONFIG##############
PRIMARY_HEAD 192.168.29.1
SECONDARY_HEAD 192.168.29.200
PRIMARY_FT 222.222.222.1
SECONDARY_FT 222.222.222.200
BROADCAST_FT 222.222.222.255
NETMASK_FT 255.255.255.0
NETWORK_FT 222.222.222.0

FT_NODES
222.222.222.2
222.222.222.3
END

RUNNING_NODES
192.168.29.2
END
##################################

As I proposed to use network aliased address in this framework, each node in the cluster will have 2 IP address, one is working address--where user programs use for communication--another is aliases address--were fault tolerant framework use to identify nodes and ssh to it to control and achieve fault tolerant capability, e.g. change some configuration of the standby node to imitate the environment of the failed working node.

Thus, user must assigned these parameters (as seen in the example ft.config file), and let the program take care the rest.

--------------------------------------------------------------------------
Then here is my basic thought of how user will use my package, I know that this might be not so good. So, all comments are appreciably welcomed.

Sunday, March 25, 2007

Improving HA-OSCAR, soc proposal

Improving HA-OSCAR

Narate Taerat
nta008@latech.edu
Louisiana Tech University


Motivations:
For any Beowulf cluster, system availability is preeminent. When the master node failed, the whole cluster would be unavailable. Propitiously, we had High-Availability OSCAR (HA-OSCAR) providing master-node redundancy and automatic fail-over. But HA-OSCAR focus only on master-node failure, and did not deal anything with compute-node failure.

Compute node failure is a pain to most of the MPI applications--even only a single of nodes failure may cause the application failure. Fortunately, we had Berkeley Checkpoint/Restart (BLCR) LAM/MPI, by which allows MPI application to be checkpointed and restarted from last checkpoint. Cluster user might write a shell script to checkpoint their program automatically, but they had to manually restart their application. Moreover, as long as the failed nodes were not being recovered, the application could not be restarted, since the BLCR/LAM MPI requires the same environment for restarting the checkpoint file. Enabling the automatic compute-node failover will also enable the automatic process restart, as the automatic fail-over will recover the environment to be the same.

In addition, there is no mechanism to preserve the job queue at the master node when the failure occurred, and the users had to manually submit their jobs again.

Thus, improving HA-OSCAR with job queue fault tolerance, automatic checkpoint/restart, and compute-node failover will make cluster user’s life easier. All a user would do is just submit his/her job and leave the machine without any anxiety that his/her job would be failed.

Besides, this project will also enhance the HA-OSCAR backbone package and bring it up-to-date with the most recent OSCAR release, for example, update some obsolete components, keeping HA-OSCAR up-to-date for the most recent environment.

Objectives:
1. Enable fault tolerance TORQUE job queue in OSCAR and HA-OSCAR
2. Integrate automatic checkpoint/restart into HA-OSCAR
3. Add automatic compute-node failover in HA-OSCAR
4. Polish and update obsolete HA-OSCAR components

Plan:
HA-OSCAR release 1.2.1 will be used as a based of development. Berkeley Checkpoint/Restart (BLCR) will be used as a checkpoint/restart, and the MPI implementation would be LAM/MPI or OpenMPI, as if it is available with BLCR.

Since I had created a prototype for this project, and had a proof of concept that HA-OSCAR, LAM/MPI, TORQUE, my components can provide automatic fail-over at both compute-node and head-node failure. Figure 1 shows the program module from the prototype and overview of working scenario.


Figure 1 Module diagram

Basically, the job-submission module will take care of job info replication, providing job info to the job queue replication to HA-OSCAR standby head, so the job will not be lost when the primary job queue crashed. The compute-node fail-over module provides automatic compute-node fail-over, using active/standby pattern, and automatic restart the affected job, using checkpoint provided by automatic checkpointing module. The compute-node monitoring module will monitor both active and standby compute node and invoke the fail-over module when a failure was detected. Job submitting, automatic checkpoint, compute-node fail-over and monitoring modules will be taken over by the standby master node when the primary master node failed.

So, the task for this project would be started from reviewing the component design, in the first week, and implementing all modules for next four weeks, making it more configurable and easier to use. Then, the sixth week will come to midterm testing phase, the alpha program will be tested, and the midterm document report will be submitted. Next three week after midterm submission, the alpha program will be polished, and the complete installation script will be finalized in this phase. After that, the final testing for the program will be taken place, and the installation manuals, report documents will be finalized in the tenth week. The last week extension period was prepared as a buffer for unpredictable events.



Figure 2 Project time line

Deliverable:
The OSCAR/HA-OSCAR enhancement with job-queue preservation, automatic checkpoint/restart and automatic compute-node fail-over, based on my graduate student colleagues’s works [1] [2] [3] , will be delivered at the end of the project. The deployment method would be installation scripts and sources in the tar ball. The alpha-tested program will be delivered at the mid-term evaluation.

About me:
I am a graduate at Louisiana Tech University and also working in the Extreme Computing Research (XCR) group, which maintain HA-OSCAR project, under the supervision of Dr. Chokchai (Box) Leangsuksun. I had just focused on the area of reliability improvement on high-performance computing for three month, as I had just being a graduate student on winter 2006. By the way, I am a fast learner, being capable to learn the OSCAR and HA-OSCAR stuff within three month and being a maintenance member of a HA-OSCAR project and take responsibility with team member in minor release of HA-OSCAR 1.2.1.

Moreover, I have a strong basic of computer programming skill, especially C, C++ and Java programming languages. And I am also working with shell scripts, and perl.

Reference:
[1] Sunil Rani, Chokchai Leangsuksun, Anand Tikotekar, Vishal Rampure, Stephen L. Scott, “Toward efficient failure detection and recovery in HPC”

[2] Kshitij Limaye, Anand Tikotekar, Box Leangsuksun, “Fault tolerance-enabled HPC resource management with HA-OSCAR framework”

[3] Anand Tikotekar, Chokchai Leangsuksun, Stephen L. Scott, “On the Survivability of Standard MPI Application”