Revision as of 16:55, 6 January 2009

'Parallel: Using more than one processor at a time'

Introduction

Computers. They're handy right? Email, buying stuff from Amazon.. You could even write a program. Create your own application. Indeed, for people of a certain mindset--people like you and me--we can do all sorts of interesting things like simulating the natural world, and in the process look at questions like, "will Greenland melt if we carry on polluting the atmosphere as we are?" and "what will be the consequences if it does melt?".

Sometimes it's handy to have more than one computer. Let's say that we have a new whizz-bang weather model that takes 26 hours to work out what the weather will do tomorrow. "Very nice. Superb accuracy," you say, "but, alas, about as much use as a chocolate teapot." In order for the model to be useful in the real world, we need it to run faster. We need to divide up the work it does and run it over two, or more computers. We need to enter the world of parallel programming.

"Hippee!" we cry, but a word of caution. Getting models to work reliably in parallel is a lot--I'll say it again, 'a lot--harder than getting them to work on a single processor. Before setting out down the road, it is well worth checking that you really do need your model to run faster, and that you've explored all other avenues to speed up your model before embarking on a parallelisation project.

Still keen? OK, let's get stuck in.

A Necessary Aside

We'll look at some parallel code in a moment--I promise--but before we do we must take a look at some simple serial code and ask ourselves whether we've been as smart as we can and made the code run as fast as it can. I know you're keen to crack on, but trust me, it's really important to make sure that your serial code optimised before converting it into parallel code. If you don't, any inefficiencies in your code are just going to get magnified as it is run on several processors at the same time, and that's not what you want.

Let's get some examples involving 2-dimensional arrays and nested loops from another short course on how to profile your code:

svn co http://source.ggy.bris.ac.uk/subversion-open/profiling/trunk ./profiling
cd profiling/examples/example2
make

and follow the accompanying notes.

Setting up your loops the wrong way round is bad enough for serial code. If you were to incorporate those same loops in some parallel code, you may very well end up creating a so-called false sharing condition and your parallel code will end up running very slowly indeed.

Righteo, forewarned is forearmed. We know that before we embark on any code parallelisation, we need to profile our code and ged rid of any inefficiencies. Once we've done that, we're ready to start the task of making it run in parallel. Indeed, if you are lucky, you may find that your newly optimised code has been sped-up sufficiently and you don't need to do anything else to it!

OpenMP

There are a number of different ways to create parallel programs, and we're going to start with one approach, called OpenMP. There a number of reasons for this:

It's widely available,
It's good for the muli-core processors that we find in a lot of computers these days,
It's fairly easy to use,
and it's based upon the not-so-minde-bending concept of threads

A schematic showing a multi-threaded process

At this point, we could launch overselves into a long and detailed discussion of threads, the OpenMP runtime environment, pre-processor macro statements and the like. But we won't. Because it's less fun. Let's just try an example instead.

OK, to get the examples, login to a Linux box and cut & paste the below onto the command line:

svn co http://source.ggy.bris.ac.uk/subversion-open/parallel/trunk ./parallel

Hello, world

Right, now do the following:

cd examples/example1
make
./omp_hello_f90.exe

Tada! Just like the old classic 'hello, world', but this time run in parallel on as many processors as you have available on your machine. Good eh? Now, how did that all work?

Take a look inside the file omp_hello_f90.f90. First up we have used a Fortran90 module containing routines specific to OpenMP:

use omp_lib

This gives us access to routines like:

omp_get_num_threads()

The rest of the program is straightforward Fortran code, except for some comment lines starting with !$omp, such as:

!$omp parallel private(nthreads, tid)
...
!$omp end parallel

These lines specify that the master thread should fork a parallel region. From the code, you will see that all the threads on the team will get their thread ID--through calls to the OpenMP library--and will print it. The master thread will also ask how many threads have been spawned by the OpenMP runtime environment, and will print the total.

Notice that the variables nthreads and tid have been marked as private. This means that seperate copies of the these variables will be kept for each thread. This is essential, or else the print statement, 'Hello, world from thread = ' would get all mixed up, right? Try deliberatly mucking things up. Go on, see what happens if you delete tid from the private list.

Look inside the Makefile and notice that the use of OpenMP has been flagged to the compiler. -fopenmp in this case, as we are use gfortran. It would be just -openmp if you were using ifort. You would get a compile-time error, if you were to try to compile the code without this flag.

There is also a C version of the Fortran90 example in omp_hello_c.c.

Work Sharing inside Loops

OK, so far so good and we can run some code in parallel. As I mentioned in the introduction, however, the real benefits start to appear when we can divide up a task between different processors and let them get on with things in parallel. That way the program will run faster overall. In OpenMP-speak, this is worksharing. Let's look at a straightforward example:

cd ../example2

Again we have C and Fortran90 versions of the same program. Take a look inside schedule_f90.f90. Here we see that we have a parallel section in each thread enquires of it's ID and the master enquires of the total, as in example1. Notice this time that some variables are marked as shared. In particular, the array a. Inside the parallel section, we have a do loop preceeded by the comment:

!$omp do schedule(static,chunk)

This statement is the key to the worksharing. It means that the total number of interations will be chunked. The static keyword indicates that each chunk of (10) (contiguous) iterations will be farmed out to each thread in turn--round-robin fashion. We see that we are just initialising the values in the array in the loop. All threads need access to the same array and that's why it is marked as shared.

You can replace the static schedule with dynamic. In this case chunks are handed out to processors as they become available. In this case you may see a certain (quiet) processor get used several times and so all the processors may be called upon in the workshare. Try changing the chunk size too.

Reductions

A reduction combines elements from the different threads to form a single result. Examples of reductions could be summations or products. These can be handy sometimes.

cd ../example3

In this example we'll look at an iterative algorithm to compute a value for pi and we'll see how we can speed this up by spreading the iteration over several processors. There are C and Fortran90 examples which you can look at.

Note that in both the reduction_ programs we explicitly control the number of threads that we want OpenMP to use (set to 4 initially) through a call to:

omp_set_num_threads(NUM_THREADS)

(A function call in C and a subroutine call in Fortran.)

In the C code, we flag the parallel reduction using the pragma statement:

#pragma omp parallel for reduction(+:sum) private(x)

We have a similar comment line in the Fortran code. In this language, we use a do loop, and have a corresponding end do comment line for the benefit of OpenMP:

!$omp parallel do reduction(+:sum) private(x)
...
!$omp end parallel do

In both cases we see that sum is indicated as the reduction variable, and that x is kept private to each thread.

Now compile and run the programs, timing them as the run, e.g.:

make
time ./serial_pi_f.exe
time ./reduction_pi_f.exe

Note how much quicker the parallel code runs. You could try changing the number of threads that we utilise (probably up to a maximum of 8, since the hardware most likely has a maximum of 8 processor cores), recompiling using make and re-timing the runs.

Matrix Multiplication Examples

cd ../example4
make
time ./serial_mm_fort.exe
time ./omp_mm_fort.exe

Homework: Is there a better way to arrange the loops for these C and Fortran examples?

@@ Line 32: / Line 32: @@
 =OpenMP=
-There are a number of different ways to create parallel programs, and we're going to start with one approach, called OpenMP.  There a number of reasons for this:
+There are a number of different ways to create parallel programs, and we're going to start with one approach, called '''OpenMP'''.  There a number of reasons for this:
-# It's pretty widely available
+# It's widely available,
-# It's good for the muli-core processors that we find in a lot of computers today
+# It's good for the muli-core processors that we find in a lot of computers these days,
-# It's fairly easy to use
+# It's fairly easy to use,
-# and it's based upon the not-so-minde-bending concept of  '''threads'''
+# and it's based upon the not-so-minde-bending concept of '''threads'''
 [[Image:Threads.gif|center|A schematic showing a multi-threaded process]]

Difference between revisions of "OpenMP"