Revision as of 15:45, 17 December 2012

Benchmarking UM Version4.5 on different Architectures

Preamble

Cluster/Parallel file systems are often a bottleneck. Timings are for writing to local disk, unless specified otherwise.
If the model is not filesystem-bound, it is often (MPI massage) latency-bound.
Only the master process writes output, this can lead to load-balance issues, which hinder scaling.
Worst case message latencies for a cohort of processors are what matter for scaling. The vast majority of messages are either ~100 bytes or ~1KB in size. Latencies are reported for these key message sizes.

IMB ping-pong message latency
	0 bytes	128 bytes	1024 bytes
between nodes	~2.0us	~2.4us	~4.7us

IMB ping-pong message latency
	0 bytes	128 bytes	1024 bytes
between sockets	~0.70us	~1.15us	~2.0us

The last line of this table shows a real problem scaling beyond 16 cores. Load balance? (Latencies are much better than QDR IB.)
Would like to try to improve file writing performance and re-run.

@@ Line 7: / Line 7: @@
 * If the model is not filesystem-bound, it is often (MPI massage) latency-bound.
 * Only the master process writes output, this can lead to load-balance issues, which hinder scaling.
-* Worst case message latencies for a cohort of processors are what matter for scaling.
+* Worst case message latencies for a cohort of processors are what matter for scaling.  The vast majority of messages are either ~100 bytes or ~1KB in size.  Latencies are reported for these key message sizes.
 =Intel Westmere=