How to run ESTEL in parallel on a cluster
This article describes how to run parallel jobs in ESTEL on HPC clusters.
Beowulf clusters are real high performance facilities such as Blue Crystal. If you plan to run ESTEL on a network of workstations, use this article about networks of workstations instead.
Pre-requesites
- TELEMAC system installed and configured for MPI. See the installation article if necessary.
- PBS queuing system on the cluster.
- Fortran compiler and local directory in the
PATH
(see below).
Adjusting the PATH for batch jobs
When you submit a job to a queue on a cluster, the environment available when the job is eventually run is not necessarily the one you had when you submitted the job. This is particularly important for the TELEMAC system as the job submitted to the PBS queue needs access to the Fortran compiler and the local folder (./
) to run. Various solutions exist depending on the platform. On Bluecrystal, the easiest is to load all required modules in the .bashrc
file.
Submitting a job
When all pre-requesites are satisfied, it is quite easy to submit a TELEMAC job to the PBS queue. A script exists in the /path/to/systel90/bin/
directory which takes care of most aspects of the job submission. How nice... Get the script for the estel-bin
subversion repository if necessary.
The syntax is the following:
$ qsub-telemac jobname nbnodes walltime code case
where:
jobname
is a name for the job you are submittingnbnodes
is the number of nodes to use. Only one processor per node is supported at the moment.walltime
is the PBS walltime, i.e. the simulation will stop if this time is reached. The syntax is standard PBS syntax; hh:mm:ss or a single number is seconds.code
is the name of the TELEMAC code to run, i.e. probably estel2d or estel3d.case
is the name of the steering file to use for the simulation.
Note that qsub-telemac
is clever enough (!) to adjust the number of parallel processors in the steering file automatically to match the argument nbnodes
. However, the keyword PARALLEL PROCESSORS
needs to be in the steering file. Only the value is adjusted at the moment...
For instance, for ESTEL-3D one could use:
$ qsub-telemac test 12 10:00:00 estel3d cas
This would submit a job on 12 nodes using one processor on each node, with a PBS walltime of 10 hours to run a case named "cas" with the code ESTEL-3D. You would end up with a new file called test
which is the file actually submitted to PBS and also a new steering file, cas-test
which is a copy of cas
but with the correct number of parallel processors.
Limitations
All these limitations are being worked on and hopefully will be addressed soon.
- Cannot add the keyword
PARALLEL PROCESSORS
if not present - Cannot deal with multiple processors and/or cores per node
- Cannot generate a list of the processes to kill on each processor in case of crash