MPI: Message passing for distributed memory computing

Introduction

MPI-1[ref]--the first incarnation of the standard--arrived in 1994 in response to the need for a portable means to program the growing number of distributed memory computers appearing in the marketplace. MPI stands for Message Passing Interface, and as its name suggests, it is an API, rather than a new programming language. At the time of writing, MPI can be used in C, C++, Fortran-77 and Fortran-90/95 programs. We will see that MPI-1 contained little on the topic I/O. This was rectified in 1997 with the arrival of MPI-2[ref], which contained the MPI-IO standard (supporting parallel I/O) along with additional functionality to support the dynamic creation of processes and also one-sided communication models.

We can extend Flynn's original Taxonomy[ref] with the acronym SPMD--Single Program Multiple Data. This emphasises the fact that using e.g. MPI, we can write single programs that will execute on computers comprised of multiple compute elements, each with its own--not shared--memory space.

Hello World

The quintessential start. Programs in C, Fortran-77 and Fortran-90.

MPI_Init() and MPI_Finalise(). MPI_Comm_Size() and MPI_Comm_rank(), which use the communicator MPI_COMM_WORLD.

These programs assume that all processes can write to the screen. This is not a safe assumption.

Send and Receive

Processes send their messages back to the master process, and it then prints to screen. A much safer program.

MPI_Send(), MPI_Recv().

The triple (address, count, datatype) provides a pattern.

Tags. Can use these as a form of filtering: junk mail, bin; handwritten & perfumed, open now!; bill, open later!

Communicators. Messages do not pass between different communicators. We can create custom communicators.

A Common Bug

If all processes are waiting to receive prior to sending, then we will have deadlock. See the example of a pairwise exchange.

Some Parallelisation Examples

Numerical Integration using the Trapezoidal Rule

First, numerical integration using the trapezoidal rule. The results we get from this program are highly sensitive to the number of CE when we have a relatively small no. of trapezoids. For example, I get a worse estimate if I run this example with 4 processes rather the 3. Can you see why?

Estimation using Monte Carlo Techniques

Notice that the accuracy does not increase monotonically. Monte Carlo techniques are suited to parallelisation. In particular, they are robust to the loss of a compute element.

Exercises

Experiment with the free parameters (number of trapezoids, number of processes), in the numerical integration example.
Experiment with the number of throws at the dartboard. What happens to the accuracy of the estimate on average? Ask yourself, how accurate do I need an estimate in order to solve my problem?
What happens if the tags don't match? (ans. deadlock)
Create two custom communicators: master & evens. master & odds. Write a 'chinese whispers' program that cycles messages around the two communicators in a round-robin fashion, randomly morphing a character..

Non-Blocking Communication

Synchronisation, Blocking and the role of Buffers

Independent 'compute elements' Synchronised communication requires that both sender and receiver are ready. Through the introduction of a buffer, a sender can deposit a message before the receiver is ready. MPI_Recv() only returns when the message has been received, however. Hence the term blocking.

Having processors often idle, waiting to receive messages when they could be getting on with something useful, will degrade performance. One option is to use non-blocking send and receives.

MPI_Isend(), MPI_Irecv().

In fact, we would like our algorithms to work as asynchronously as possible.

Latency Hiding: first class letters, coal, canals & power stations

With increasingly parallel architectures, latency will only get worse.

There will be an inevitable latency between the time a message is sent and when it is received. Say ~24hrs for a first class letter. If we sat twiddling our thumbs while we waited for the letter, we wouldn't get much done. If on the other hand, we can profitably spend our time working on something until the letter arrives, then we have effectively hidden the latency time. Think of a coal-fired power station that receives its fuel by canal barge. The barge may take a long time to travel between the pit and the boilers. If, however, the power station has a sufficient buffer, then the time spent on the canal doesn't matter.

Asyncronous

We want our codes to be as asynchronous (and latency tolerant) as possible.

Collective Communications

Load Balancing

Sending Fewer Messages: pack and derived Types

MPI

Contents