Condor

From SourceWiki
Jump to: navigation, search

Condor: Making best use of the computers in the teaching labs

Introduction

The Condor Project enables us to run batch jobs on idle CPUs found in a pool of desktop PCs, such as those in offices and computer labs in the department. Condor is particularly useful for high 'throughput' computing. An is example of this is the use of an ensemble of independent model simulations to explore parameter-space.

Basic commands

You can run the following commands from the submission host of your condor pool. In Geography, the submission hosts are condor and dylan (Note that these are Linux servers).

You can review the status of all the computers in the 'condor pool' using the command condor_status:

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@GEOG-B224.gg WINNT51    INTEL  Owner     Idle     0.000  1661  0+01:30:04
slot2@GEOG-B224.gg WINNT51    INTEL  Owner     Idle     0.010  1661  0+01:30:05
slot1@geog-a105.gg WINNT51    INTEL  Claimed   Busy     1.130  1661  0+01:02:57
slot2@geog-a105.gg WINNT51    INTEL  Claimed   Busy     1.130  1661  0+01:02:58
slot1@geog-c200.gg WINNT51    INTEL  Unclaimed Idle     0.000  1662  0+00:00:04
slot2@geog-c200.gg WINNT51    INTEL  Unclaimed Idle     0.040  1662  0+00:00:00
...
...
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

       INTEL/WINNT51   191    78     109         4       0          0        0

               Total   191    78     109         4       0          0        0

Typically you will get several screen's full of output, so I've chopped out a large swathe of the listing. This leaves us with just few examples from the top of the listing and the final summary, given at the end.

From the listing, you can see that:

  • The PC called GEOG-B224 has someone logged into it, indicated by the keyword Owner, but it is not working hard and so is marked as Idle.
  • In contrast, geog-a105 is marked as Claimed, indicating that it has been grabbed by Condor, and it working hard; marked Busy.
  • The third possible state is exemplified by geog-c200, which is neither claimed nor in interactive use.

The final summary tells us that at the time of writing, the pool contains 191 PCs, 78 of which has a user logged in, 109 are claimed by condor and 4 remain unclaimed.

Another view of the state-of-play is given by condor_q:

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
 51.0   ggdagw          4/22 13:21   0+01:30:07 R  0   3.7  test.bat          
...
...
 51.15  ggdagw          4/22 13:21   0+01:33:46 I  0   3.4  test.bat   
...
...
150 jobs; 44 idle, 106 running, 0 held   

Here we see that:

  • job 51.0 was submitted at 13:21 on the 22nd of April, has been running for just over an hour and a half, and that the job executable is called 'test.bat'
  • We can also see that job 51.15 is idling in the queue, rather than running.
  • In total 106 of the 150 jobs submitted to condor are running, and accordingly 44 are still waiting to run and so are idle.

Submitting a simple script job

OK, so much for other people's jobs. Let's submit some of our own! I have prepared some examples to make this as easy as possible. To get these examples, you can cut & paste the following onto your (Linux) command line (If you have a Windows submission host, the same files will work for you, but you will need to use TortoiseSVN to get them):

svn co https://svn.ggy.bris.ac.uk/subversion-open/condor/trunk ./condor
cd condor/examples/example1

Without further ado, let's submit our first job to the pool:

condor_submit win.submit

If you look inside win.submit, you will see that it is a short and reasonably self explanatory file:

Universe   = vanilla
Notification = never
requirements = OpSys == "WINNT51" && Arch == "INTEL"
Output = test.out
Error = test.err
Log = test.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
Executable = test.bat
Queue

The key lines to note for the moment are:

  • the output of the job will collect in test.out
  • any errors will go to test.err
  • test.log will record the mechanics of sending the job to a remote PC
  • and that test.bat calls the shots!
  • the Queue keyword causes the job to enter the queue of jobs to be run

The executable file test.bat is a short and simple batch file (aka 'shell script' in Linux-speak):

echo Hello from a Condor batch file running on:
hostname
echo The date is:
date /T
echo and the time is:
time /T

The upshot of all this jiggery-pokery are the contents of test.out. All being well, this should have been created by now (if not, consult condor_status and condor_q for more information on the state-of-play), and look something like:

Hello from a Condor batch file running on:
geog-c211
The date is:
22/04/2009 
and the time is:
15:36

Congratulations! You've run your first condor job That's the hardest part out of the way. Honest! All that remain are a few more details. So far so good? OK, let's move on to the next example.

Submitting a compiled executable

Move to the next example directory by typing:

cd ../example2

In this example, we'll run a time-honoured "hello, world" job. I've included a pre-compiled executable for you to run. Note that I am assuming that your jobs will be run on Windows based PCs Note that despite this, we are submitting from a Linux server. This is fine since Condor can handle it, but we must remember it and not get our operating systems and compiled executables in a muddle.

To this end, I compiled hello_world.exe on my PC, as a DOS (or console) program. I hope it will run on your PCs too. When running jobs that you have compiled yourself, you must remember to compile it on Windows and to make sure that it will run on another Windows PC--not just your own--and then use, for example, the ssh file transfer tool to bring it over to the submission host. If you are in Geography and do not have a Windows C/C++ or Fortran compiler, contact me and I can probably compile your code for you.

OK, now that's all straight, let's run the job. We can do this as before:

condor_submit win.submit

Note that I have slightly modified the contents of win.submit:

...
Output = hello_world.out
Error = hello_world.err
Log = hello_world.log
Executable = hello_world.exe
Queue

And, all being well, we have a friendly message in test.out:

Hello World!

Jobs which use inputs

Sometimes we need our batch job to read some inputs from file. We'll do this in the next example:

cd ../example3
condor_submit win.submit

In this case, we've added the file test.in to the bones of example1. This is used in test.bat:

...
type test.in

In DOS-speak, this command will just echo the contents of this file, which will end up and the bottom of test.out:

...
tada! I'm from the input..

and we reference the input file in win.submit:

...
transfer_input_files = test.in
...

Well, that wasn't too bad, eh?!

Using command line arguments

In addition to input files, we may find that we need to pass a program some command line arguments. This is easy enough also.

cd ../example4

An additional line to win.submit is:

arguments = 6

and the result of the program is:

8 is the 6th Fibonacci number

Running an ensemble of jobs

Some of the real benefits of Condor come when we want to submit a number of jobs at the same time. Such an 'ensemble' is often used to explore parameter-space. Note that such an ensemble is often referred to as a 'cluster' in Condor-speak. This refers to a cluster of jobs. Not to be confused with a cluster of servers used in some High Performance Computing setup.

cd ../example5

In this case, we're going to submit several jobs which all make use of the same Fortran program. Let's take a look at the salient parts of win.submit on this occasion:

...
transfer_input_files = input.nml
Output = test.out
Error = test.err
Log = test.log
Executable = fort_nml_io.exe
initialdir = job.$(Process)
Queue 3

The executable program file called fort_nml_io.exe is expecting to read an input file with the name input.nml. We would like to feed the program with different versions of the input file (providing different values for some parameters) for the different runs in our ensemble. However, we can't have multiple files with the same name in the same directory. (Things are a good deal easier using C/C++, since we can pass the name of the relevant input file as a command line argument. The Fortran language does not support command line arguments, as a rule).

To get around this, we can create different sub-directories to hold the relevant files for each run. These will be called job.0, job.1, job.2 etc. A nifty piece of scripting allows us to refer to these in the submission file as job.$(Process). The number of jobs submitted is determined by the value given to Queue on the last line. In this case we have specified Queue 3 and we will get 3 jobs submitting, with $(Process) taking the values 0, 1 and 2 in turn, as a consequence.

We run the ensemble in the usual way:

condor_submit win.submit

Take a look at the contents of each of the sub-directories. Notice also that any output files created by the executable are also be transferred back from the PC which ran it, and placed in the relevant sub-directory.

Further Examples

Well, in 5 small steps you've covered the majority of the ground required to make effective use of a Condor pool. However, if you would like some more information and Condor, the following pages are useful:

  • A few more examples of submission scripts are given by the authors of Condor in the following page: Condor quick start

Condor and energy saving measures

In an effort to not to waste energy, we have set PCs which have been idle for several hours to shutdown (or hibernate). These PCs are periodically (once early in the morning and again at lunchtime) 'woken-up', so that they are ready to receive jobs again. If there are no calls on them for work the PCs will shutdown again after a few more hours have elapsed.

The Carbon Trust tells us that to produce 1kWh of grid electricity, 0.537 Kg of CO2 is emitted. We calculate that we can save up to 70 tons of CO2 emissions annually from the Geography department, through enabling the shutdown facility in Condor. To put this in context, (as of 2009) the average UK household emits 5.3 tons annually.

Appendices

If you would like to send your job to a particular machine, you should augment the requirements line in your submission script as below:

requirements = ..... && TARGET.Machine=="geog-XXX.ggy.bris.ac.uk"

You should check that the machine is unclaimed prior to submitting, using condor_status.