Latest revision as of 14:36, 24 May 2013

Benchmarking UM Version4.5 on different Architectures

Preamble

Cluster/Parallel file systems are often a bottleneck. Timings are for writing to local disk, unless specified otherwise.
If the model is not filesystem-bound, it is often (MPI massage) latency-bound.
Only the master process writes output, this can lead to load-balance issues, which hinder scaling.
Worst case message latencies for a cohort of processors are what matter for scaling. The vast majority of messages are either ~100 bytes or ~1KB in size. Latencies are reported for these key message sizes.

IMB ping-pong message latency
	0 bytes	128 bytes	1024 bytes
between nodes	~2.0us	~2.4us	~4.7us

IMB ping-pong message latency
	0 bytes	128 bytes	1024 bytes
between sockets	~0.70us	~1.15us	~2.0us

The last line of this table shows a real problem scaling beyond 16 cores. Load balance? (Latencies are much better than QDR IB.)
Would like to try to improve file writing performance and re-run.

@@ Line 7: / Line 7: @@
 * If the model is not filesystem-bound, it is often (MPI massage) latency-bound.
 * Only the master process writes output, this can lead to load-balance issues, which hinder scaling.
-* Worst case message latencies for a cohort of processors are what matter for scaling.
+* Worst case message latencies for a cohort of processors are what matter for scaling.  The vast majority of messages are either ~100 bytes or ~1KB in size.  Latencies are reported for these key message sizes.
-=Intel Westmere=
-* Emerald.  Intel E5649 (2.53GHz)
+=Emerald=
+* Intel Westmere E5649 (2.53GHz)
 * QDR Infiniband (non-RoCE)
 * GCOMv3.1
@@ Line 51: / Line 52: @@
 |}
-=Intel SandyBridge=
+=Intel SandyBridge Test System=
 * Test system: Quad socket, 8-core E-4650L (2.60GHz) (L for Low power)
@@ Line 91: / Line 93: @@
 |-
 || 8x4 || 32 || ~65
+|-
+|}
+=Polaris=
+* Intel E5-2670 @ 2.60GHz
+* Infiniband: Mellanox Technologies MT27500 Family [ConnectX-3]
+* Lustre
+==FAMOUS==
+{| border="1" cellpadding="10"
+|| Domain Decomposition || Number of Cores || Model-years/day
+|-
+|| 4x4 || 16 || ~330
+|-
+|| 8x4 || 32 || ~330
+|-
+|}
+==HadCM3==
+{| border="1" cellpadding="10"
+|| Domain Decomposition || Number of Cores || Model-years/day
+|-
+|| 4x4 || 16 || ~51
+|-
+|| 8x4 || 32 || ~73
+|-
+|| 16x4 || 64 || ~73
 |-
 |}