Condor
Condor: Making best use of the computers in the teaching labs
Introduction
The Condor Project enables us to run batch jobs on the pool of desktop computers around the department, that would otherwise be standing idle. Condor is particularly useful for high 'throughput' computing, such as an ensemble of independent model simulations, used to evaluate explore parameter-space.
Basic commands
You can run the following commands from the submission host of your condor pool. For Geography, this is condor.ggy.bris.ac.uk (Note that this is a Linux server).
You can review the status of all the machines in the 'condor pool' using the command condor_status:
Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@GEOG-B224.gg WINNT51 INTEL Owner Idle 0.000 1661 0+01:30:04 slot2@GEOG-B224.gg WINNT51 INTEL Owner Idle 0.010 1661 0+01:30:05 slot1@geog-a105.gg WINNT51 INTEL Claimed Busy 1.130 1661 0+01:02:57 slot2@geog-a105.gg WINNT51 INTEL Claimed Busy 1.130 1661 0+01:02:58 slot1@geog-c200.gg WINNT51 INTEL Unclaimed Idle 0.000 1662 0+00:00:04 slot2@geog-c200.gg WINNT51 INTEL Unclaimed Idle 0.040 1662 0+00:00:00 ... ... Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/WINNT51 191 78 109 4 0 0 0 Total 191 78 109 4 0 0 0
Typically you will get several screen's full of output, so I've chopped out the middle part of the listing, leaving just a few at the top and the final summary, given at the end.
From the listing, you can see that:
- The PC called GEOG-B224 has someone logged into it, indicated by the keyword Owner, but it is not working hard, as it is Idle.
- In contrast, geog-a105 is marked as Claimed, indicating that it has been grabbed by condor, and it working hard, Busy.
- The third possible state is exemplified by geog-c200, which is neither claimed nor in interactive use.
The final summary tells us that at the time of writing, the pool contains 191 PCs, 78 of which has a user logged in, 109 are claimed by condor and 4 remain unclaimed.
Another view of the state-of-play is given by condor_q:
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 51.0 ggdagw 4/22 13:21 0+01:30:07 R 0 3.7 test.bat ... ... 51.15 ggdagw 4/22 13:21 0+01:33:46 I 0 3.4 test.bat ... ... 150 jobs; 44 idle, 106 running, 0 held
Here we see that:
- job 51.0 was submitted at 13:21 on the 22nd of April, has been running for just over an hour and a half, and that the job executable is called 'test.bat'
- We can also see that job 51.15 is idling, rather than running.
- In total 106 of the 150 jobs submitted to condor are running, and accordingly 44 are still waiting to run and so are idle.
Submitting a simple script job
OK, so much for other people's jobs. Let's submit some ourselves. I have prepared some examples to make this as easy as possible. To get these, you can cut & paste the following onto your command line (I assume linux here, but the same files will work for a windows submission host too):
svn co http://source.ggy.bris.ac.uk/subversion-open/condor/trunk ./condor cd condor/examples/example1
Without further ado, let's submit our first job to the pool:
condor_submit win.submit
If you look inside win.submit, you will see that it is a short and reasonably self explanatory file:
Universe = vanilla Notification = never requirements = OpSys == "WINNT51" && Arch == "INTEL" Output = test.out Error = test.err Log = test.log should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT Executable = test.bat Queue
The key lines to note for the moment are:
- the output of the job will collect in test.out
- any errors will go to test.err
- test.log will record the mechanics of sending the job to a remote PC
- and that test.bat calls the shots!
- the Queue keyword causes the job to enter the queue of jobs to be run
The executable file test.bat is a short and simple batch file (aka 'shell script' in Linux-speak):
echo Hello from a Condor batch file running on: hostname echo The date is: date /T echo and the time is: time /T
The upshot of all the electrickery is the contents of test.out. All being well, this should have been created by now (if not, consult condor_status and condor_q for more information on the state-of-play):
Hello from a Condor batch file running on: geog-c211 The date is: 22/04/2009 and the time is: 15:36
Congratulations! You've run your first condor job That's the hardest part out of the way. All we have now are a few more details. So far so good? OK, let's move on to the next example.
Submitting a compiled executable
cd ../example2
In this example, we'll run the time-honoured "hello, world" job. I've included a precompiled executable for you to run. Note that I am assuming that the jobs will be run on Windows based PCs Indeed, we are submitting from a Linux server, but running our jobs on Windows PCs. This is fine for Condor, but we must remember it and not get confused.
To this end, I compiled hello_world.exe on my PC, as a DOS (or console) program. I hope it will run on your PCs too. When running jobs that you have compiled yourself, you must remember to compile it on Windows and to make sure that it will run on another Windows PC--not just your own--and then use, for example, the ssh file transfer tool to bring it over to the submission host.
OK, now that's all straight, let's run the job. We can do this as before:
condor_submit win.submit
Note that I have slightly modified the contents of win.submit:
... Output = hello_world.out Error = hello_world.err Log = hello_world.log Executable = hello_world.exe Queue
And, all being well, we have a friendly message as output:
Hello World!
Jobs which use inputs
Sometimes we need our batch job to read some inputs from file. We'll do this in the next example:
cd ../example3 condor_submit win.submit
In this case, we've added the file test.in to the bones of example1. This is used in test.bat:
... type test.in
In DOS-speak, this command will just echo the contents of this file, which will end up and the bottom of test.out:
... tada! I'm from the input..
and we reference the input file in win.submit:
... transfer_input_files = test.in ...
Well, that wasn't too bad, eh?!
Using command line arguments
In addition to input files, we may find that we need to pass a program some command line arguments. This is easy enough too.
cd ../example4
An additional line to win.submit is:
arguments = 6
and the result of the program is:
8 is the 6th Fibonacci number
Running an ensemble of jobs
Some of the real benefits of Condor come when we want to submit a number of jobs at the same time. Such an 'ensemble' is often used to explore parameter-space.
cd ../example5
In this case, we're going to submit several jobs using the same Fortran program. Let's take a look at the contents of win.submit on this occasion:
Universe = vanilla Notification = never requirements = OpSys == "WINNT51" && Arch == "INTEL" transfer_input_files = input.nml Output = test.out Error = test.err Log = test.log should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT Executable = fort_nml_io.exe initialdir = job.$(Process) Queue 3
The executable program file called fort_nml_io.exe is expecting to read an input file with the name input.nml. Of course, we would like to feed the program with different versions of the input file (providing different values for some parameters) for the different runs. However, we can't have multiple files with the same name in the same directory. (Things would be a good deal easier if we were using C, since we could pass the name of the different input files as command line arguments--Fortran does not support command line arguments, as a rule).
To get around this, we create different subdirectories to hold the relevant files for each run. These will be called job.0, job.1, job.2 etc. A nifty piece of scripting allows us to refer to these in the submission file as job.$(Process). The number of jobs submitted is determined by the value given to Queue on the last line. In this case we have specified Queue 3 and we will get 3 jobs submitting, counting from 0 to 2, as a consequence.
Run the ensemble in the usual way:
condor_submit win.submit
and then take a look at the contents of each of the subdirectories. Notice also that any output files created by the executable will also be transferred back from the PC which ran it, and placed in the relevant subdirectory.
Further Examples
Condor and energy saving measures
PCs that have been idle for a considerable length of time are set to shutdown (or hibernate). These PCs are periodically (once early in the morning and again at lunchtime) 'woken-up', so that they are ready to receive jobs again. If there are no calls on them for work the PCs will shutdown again after a few hours have elapsed.