Difference between revisions of "How to run ESTEL in parallel on a cluster"

From SourceWiki
Jump to navigation Jump to search
 
(One intermediate revision by the same user not shown)
Line 3: Line 3:
 
'''This article describes how to run parallel jobs in [[Estel | '''ESTEL''']] on HPC clusters.'''
 
'''This article describes how to run parallel jobs in [[Estel | '''ESTEL''']] on HPC clusters.'''
  
[http://en.wikipedia.org/wiki/Beowulf_cluster Beowulf clusters] are real high performance facilities such as [http://www.acrc.bris.ac.uk Blue Crystal]. If you plan to run [[Estel | '''ESTEL''']] on a network of workstations instead, use this [[How to run ESTEL in parallel | article about networks of workstations]].
+
[http://en.wikipedia.org/wiki/Beowulf_cluster Beowulf clusters] are real high performance facilities such as [http://www.acrc.bris.ac.uk Blue Crystal]. If you plan to run [[Estel | '''ESTEL''']] on a network of workstations, use this [[How to run ESTEL in parallel | article about networks of workstations]] instead.
  
 
= Pre-requesites =
 
= Pre-requesites =
* TELEMAC system installed and configured for MPI.
+
* TELEMAC system installed and configured for MPI. See the [[Install the TELEMAC system | installation article]] if necessary.
* [http://en.wikipedia.org/wiki/Portable_Batch_System PBS] queuing system
+
* [http://en.wikipedia.org/wiki/Portable_Batch_System PBS] queuing system on the cluster.
* <code>PATH</code> for fortran compiler...
+
* Fortran compiler and local directory in the <code>PATH</code> (see below).
 +
 
 +
= Adjusting the PATH for batch jobs =
 +
When you submit a job to a queue on a cluster, the environment available when the job is eventually run is not necessarily the one you had when you submitted the job. This is particularly important for the TELEMAC system as the job submitted to the PBS queue needs access to the Fortran compiler and the local folder (<code>./</code>) to run. Various solutions exist depending on the platform. On Bluecrystal, the easiest is to load all required modules in the <code>.bashrc</code> file.
  
 
= Submitting a job =
 
= Submitting a job =
When the setup is done, it is quite easy to .
+
When all pre-requesites are satisfied, it is quite easy to submit a TELEMAC job to the PBS queue. A script exists in the <code>/path/to/systel90/bin/</code> directory which takes care of most aspects of the job submission. How nice... Get the script for the <code>estel-bin</code> subversion repository if necessary.
A script exists in the <code>/path/to/systel90/bin/</code> directory wehich submits a TELEMAC job to the PBS queue.
+
 
 +
The syntax is the following:
 
<code><pre>
 
<code><pre>
 
$ qsub-telemac jobname nbnodes walltime code case
 
$ qsub-telemac jobname nbnodes walltime code case
Line 18: Line 22:
  
 
where:
 
where:
* <code>jobname</code>
+
* <code>jobname</code> is a name for the job you are submitting
* <code>nbnodes</code>
+
* <code>nbnodes</code> is the number of nodes to use. Only one processor per node is supported at the moment.
* <code>walltime</code>
+
* <code>walltime</code> is the PBS walltime, i.e. the simulation will stop if this time is reached. The syntax is standard PBS syntax; hh:mm:ss or a single number is seconds.
* <code>code</code>
+
* <code>code</code> is the name of the TELEMAC code to run, i.e. probably '''estel2d''' or '''estel3d'''.
* <code>case</code>
+
* <code>case</code> is the name of the steering file to use for the simulation.
 +
 
 +
Note that <code>qsub-telemac</code> is clever enough (!) to adjust the number of parallel processors in the steering file automatically to match the argument <code>nbnodes</code>. However, the keyword <code>PARALLEL PROCESSORS</code> needs to be in the steering file. Only the value is adjusted at the moment...
  
 
For instance, for '''ESTEL-3D''' one could use:
 
For instance, for '''ESTEL-3D''' one could use:
Line 29: Line 35:
 
</pre></code>
 
</pre></code>
  
This would submit a job on 12 processors with a walltime of 10 hours to run a case named "cas" with '''ESTEL-3D'''.
+
This would submit a job on 12 nodes using one processor on each node, with a PBS walltime of 10 hours to run a case named "cas" with the code '''ESTEL-3D'''. You would end up with a new file called <code>test</code> which is the file actually submitted to PBS and also a new steering file, <code>cas-test</code> which is a copy of <code>cas</code> but with the correct number of parallel processors.
 
 
Note that the script is clever enough to adjust the number of parallel processors in the steering file automatically to match the argument <code>nbnodes</code>. However, the keyword <code>PARALLEL PROCESSORS</code> needs to be in the steering file. Only the value is adjusted.
 
  
 
= Limitations =  
 
= Limitations =  
All these limitations are being worked on:
+
All these limitations are being worked on and hopefully will be addressed soon.
 
* Cannot add the keyword <code>PARALLEL PROCESSORS</code> if not present
 
* Cannot add the keyword <code>PARALLEL PROCESSORS</code> if not present
* Cannot deal with multiple processors per node
+
* Cannot deal with multiple processors and/or cores per node
 
* Cannot generate a list of the processes to kill on each processor in case of crash
 
* Cannot generate a list of the processes to kill on each processor in case of crash

Latest revision as of 11:41, 11 September 2007


This article describes how to run parallel jobs in ESTEL on HPC clusters.

Beowulf clusters are real high performance facilities such as Blue Crystal. If you plan to run ESTEL on a network of workstations, use this article about networks of workstations instead.

Pre-requesites

  • TELEMAC system installed and configured for MPI. See the installation article if necessary.
  • PBS queuing system on the cluster.
  • Fortran compiler and local directory in the PATH (see below).

Adjusting the PATH for batch jobs

When you submit a job to a queue on a cluster, the environment available when the job is eventually run is not necessarily the one you had when you submitted the job. This is particularly important for the TELEMAC system as the job submitted to the PBS queue needs access to the Fortran compiler and the local folder (./) to run. Various solutions exist depending on the platform. On Bluecrystal, the easiest is to load all required modules in the .bashrc file.

Submitting a job

When all pre-requesites are satisfied, it is quite easy to submit a TELEMAC job to the PBS queue. A script exists in the /path/to/systel90/bin/ directory which takes care of most aspects of the job submission. How nice... Get the script for the estel-bin subversion repository if necessary.

The syntax is the following:

$ qsub-telemac jobname nbnodes walltime code case

where:

  • jobname is a name for the job you are submitting
  • nbnodes is the number of nodes to use. Only one processor per node is supported at the moment.
  • walltime is the PBS walltime, i.e. the simulation will stop if this time is reached. The syntax is standard PBS syntax; hh:mm:ss or a single number is seconds.
  • code is the name of the TELEMAC code to run, i.e. probably estel2d or estel3d.
  • case is the name of the steering file to use for the simulation.

Note that qsub-telemac is clever enough (!) to adjust the number of parallel processors in the steering file automatically to match the argument nbnodes. However, the keyword PARALLEL PROCESSORS needs to be in the steering file. Only the value is adjusted at the moment...

For instance, for ESTEL-3D one could use:

$ qsub-telemac test 12 10:00:00 estel3d cas

This would submit a job on 12 nodes using one processor on each node, with a PBS walltime of 10 hours to run a case named "cas" with the code ESTEL-3D. You would end up with a new file called test which is the file actually submitted to PBS and also a new steering file, cas-test which is a copy of cas but with the correct number of parallel processors.

Limitations

All these limitations are being worked on and hopefully will be addressed soon.

  • Cannot add the keyword PARALLEL PROCESSORS if not present
  • Cannot deal with multiple processors and/or cores per node
  • Cannot generate a list of the processes to kill on each processor in case of crash