Difference between revisions of "UM version4.5 benchmarks"

From SourceWiki
Jump to navigation Jump to search
Line 11: Line 11:
  
 
=Intel Westmere=
 
=Intel Westmere=
 +
 +
* Emerald.
 +
* QDR Infiniband (non-RoCE)
 +
* GCOMv3.1
 +
 +
==FAMOUS==
 +
 +
{| border="1" cellpadding="10"
 +
|| Domain Decomposition || Number of Cores || Model-years/day
 +
|-
 +
|| 4x3 || 12 || ~313
 +
|-
 +
|| 6x4 || 24 || ~360
 +
|-
 +
|| 12x3 || 36 || ~424
 +
|-
 +
|}
 +
 +
==HadCM3==
 +
 +
{| border="1" cellpadding="10"
 +
|| Domain Decomposition || Number of Cores || Model-years/day
 +
|-
 +
|| 4x3 || 12 || ~24
 +
|-
 +
|| 6x4 || 24 || ~40
 +
|-
 +
|| 12x3 || 36 || ~60
 +
|-
 +
|}
  
 
=Intel SandyBridge=
 
=Intel SandyBridge=

Revision as of 15:31, 17 December 2012

Benchmarking UM Version4.5 on different Architectures

Preamble

  • Cluster/Parallel file systems are often a bottleneck.
  • If the model is not filesystem-bound, it is often (MPI massage) latency-bound.
  • Only the master process writes output, this can lead to load-balance issues, which hinder scaling.

AMD Bulldozer

Intel Westmere

  • Emerald.
  • QDR Infiniband (non-RoCE)
  • GCOMv3.1

FAMOUS

Domain Decomposition Number of Cores Model-years/day
4x3 12 ~313
6x4 24 ~360
12x3 36 ~424

HadCM3

Domain Decomposition Number of Cores Model-years/day
4x3 12 ~24
6x4 24 ~40
12x3 36 ~60

Intel SandyBridge

  • Test system: Quad socket, 8-core E-4650L (2.60GHz) (L for Low power)
  • 20MB L3 cache


MPI message latency
0 bytes 128 bytes 1024 bytes
between sockets ~0.70us ~1.15us ~2.0us

FAMOUS

Domain Decomposition Number of Cores Model-years/day
4x2 8 ~327
8x2 16 ~450
8x4 32 ~480
  • The last line of this table shows a real problem scaling beyond 16 cores. Load balance?
  • Would like to try to improve file writing performance and re-run.

HadCM3

Domain Decomposition Number of Cores Model-years/day
8x2 16 ~48
8x4 32 ~65