How to run ESTEL in parallel on a cluster

This article describes how to run parallel jobs in ESTEL on HPC clusters.

Beowulf clusters are real high performance facilities such as Blue Crystal. If you plan to run ESTEL on a network of workstations, use this  article about networks of workstations instead.

= Pre-requesites =
 * TELEMAC system installed and configured for MPI. See the installation article if necessary.
 * PBS queuing system on the cluster.
 * Fortran compiler and local directory in the  (see below).

= Adjusting the PATH for batch jobs = When you submit a job to a queue on a cluster, the environment available when the job is eventually run is not necessarily the one you had when you submitted the job. This is particularly important for the TELEMAC system as the job submitted to the PBS queue needs access to the Fortran compiler and the local folder to run. Various solutions exist depending on the platform. On Bluecrystal, the easiest is to load all required modules in the  file.

= Submitting a job = When all pre-requesites are satisfied, it is quite easy to submit a TELEMAC job to the PBS queue. A script exists in the  directory which takes care of most aspects of the job submission. How nice... Get the script for the  subversion repository if necessary.

The syntax is the following:

where:
 * is a name for the job you are submitting
 * is the number of nodes to use. Only one processor per node is supported at the moment.
 * is the PBS walltime, i.e. the simulation will stop if this time is reached. The syntax is standard PBS syntax; hh:mm:ss or a single number is seconds.
 * is the name of the TELEMAC code to run, i.e. probably estel2d or estel3d.
 * is the name of the steering file to use for the simulation.

Note that  is clever enough (!) to adjust the number of parallel processors in the steering file automatically to match the argument. However, the keyword  needs to be in the steering file. Only the value is adjusted at the moment...

For instance, for ESTEL-3D one could use:

This would submit a job on 12 nodes using one processor on each node, with a PBS walltime of 10 hours to run a case named "cas" with the code ESTEL-3D. You would end up with a new file called  which is the file actually submitted to PBS and also a new steering file,   which is a copy of   but with the correct number of parallel processors.

= Limitations = All these limitations are being worked on and hopefully will be addressed soon.
 * Cannot add the keyword  if not present
 * Cannot deal with multiple processors and/or cores per node
 * Cannot generate a list of the processes to kill on each processor in case of crash