Difference between revisions of "UM version4.5 benchmarks"

From SourceWiki
Jump to navigation Jump to search
(Created page with 'category:JASMIN '''Benchmarking UM Version4.5 on different Architectures''' =Preamble= * Cluster/Parallel file systems are often a bottleneck. * If the model is not filesys…')
 
 
(13 intermediate revisions by the same user not shown)
Line 4: Line 4:
 
=Preamble=
 
=Preamble=
  
* Cluster/Parallel file systems are often a bottleneck.
+
* Cluster/Parallel file systems are often a bottleneck.  Timings are for writing to local disk, unless specified otherwise.
 
* If the model is not filesystem-bound, it is often (MPI massage) latency-bound.
 
* If the model is not filesystem-bound, it is often (MPI massage) latency-bound.
 
* Only the master process writes output, this can lead to load-balance issues, which hinder scaling.
 
* Only the master process writes output, this can lead to load-balance issues, which hinder scaling.
 +
* Worst case message latencies for a cohort of processors are what matter for scaling.  The vast majority of messages are either ~100 bytes or ~1KB in size.  Latencies are reported for these key message sizes.
  
=AMD Bulldozer=
 
  
=Intel Westmere=
+
=Emerald=
  
=Intel SandyBridge=
+
* Intel Westmere E5649 (2.53GHz)
 +
* QDR Infiniband (non-RoCE)
 +
* GCOMv3.1
 +
 
 +
 
 +
{| border="1" cellpadding="10"
 +
!colspan=4|IMB ping-pong message latency
 +
|-
 +
||  || 0 bytes || 128 bytes || 1024 bytes
 +
|-
 +
|| between nodes || ~2.0us || ~2.4us || ~4.7us
 +
|-
 +
|}
 +
 
 +
==FAMOUS==
 +
 
 +
{| border="1" cellpadding="10"
 +
|| Domain Decomposition || Number of Cores || Model-years/day
 +
|-
 +
|| 4x3 || 12 || ~313
 +
|-
 +
|| 6x4 || 24 || ~360
 +
|-
 +
|| 12x3 || 36 || ~424
 +
|-
 +
|}
 +
 
 +
==HadCM3==
 +
 
 +
{| border="1" cellpadding="10"
 +
|| Domain Decomposition || Number of Cores || Model-years/day
 +
|-
 +
|| 4x3 || 12 || ~24
 +
|-
 +
|| 6x4 || 24 || ~40
 +
|-
 +
|| 12x3 || 36 || ~60
 +
|-
 +
|}
 +
 
 +
 
 +
=Intel SandyBridge Test System=
 +
 
 +
* Test system: Quad socket, 8-core E-4650L (2.60GHz) (L for Low power)
 +
* 20MB L3 cache
 +
* GCOMv3.1
 +
 
 +
 
 +
{| border="1" cellpadding="10"
 +
!colspan=4|IMB ping-pong message latency
 +
|-
 +
||  || 0 bytes || 128 bytes || 1024 bytes
 +
|-
 +
|| between sockets || ~0.70us || ~1.15us || ~2.0us
 +
|-
 +
|}
 +
 
 +
==FAMOUS==
 +
 
 +
{| border="1" cellpadding="10"
 +
|| Domain Decomposition || Number of Cores || Model-years/day
 +
|-
 +
|| 4x2 || 8 || ~327
 +
|-
 +
|| 8x2 || 16 || ~450
 +
|-
 +
|| 8x4 || 32 || ~480
 +
|-
 +
|}
 +
 
 +
* The last line of this table shows a real problem scaling beyond 16 cores.  Load balance?  (Latencies are '''much''' better than QDR IB.)
 +
* Would like to try to improve file writing performance and re-run.
 +
 
 +
==HadCM3==
 +
 
 +
{| border="1" cellpadding="10"
 +
|| Domain Decomposition || Number of Cores || Model-years/day
 +
|-
 +
|| 8x2 || 16 || ~48
 +
|-
 +
|| 8x4 || 32 || ~65
 +
|-
 +
|}
 +
 
 +
 
 +
=Polaris=
 +
 
 +
* Intel E5-2670 @ 2.60GHz
 +
* Infiniband: Mellanox Technologies MT27500 Family [ConnectX-3]
 +
* Lustre
 +
 
 +
 
 +
==FAMOUS==
 +
 
 +
{| border="1" cellpadding="10"
 +
|| Domain Decomposition || Number of Cores || Model-years/day
 +
|-
 +
|| 4x4 || 16 || ~330
 +
|-
 +
|| 8x4 || 32 || ~330
 +
|-
 +
|}
 +
 
 +
==HadCM3==
 +
 
 +
{| border="1" cellpadding="10"
 +
|| Domain Decomposition || Number of Cores || Model-years/day
 +
|-
 +
|| 4x4 || 16 || ~51
 +
|-
 +
|| 8x4 || 32 || ~73
 +
|-
 +
|| 16x4 || 64 || ~73
 +
|-
 +
|}

Latest revision as of 14:36, 24 May 2013

Benchmarking UM Version4.5 on different Architectures

Preamble

  • Cluster/Parallel file systems are often a bottleneck. Timings are for writing to local disk, unless specified otherwise.
  • If the model is not filesystem-bound, it is often (MPI massage) latency-bound.
  • Only the master process writes output, this can lead to load-balance issues, which hinder scaling.
  • Worst case message latencies for a cohort of processors are what matter for scaling. The vast majority of messages are either ~100 bytes or ~1KB in size. Latencies are reported for these key message sizes.


Emerald

  • Intel Westmere E5649 (2.53GHz)
  • QDR Infiniband (non-RoCE)
  • GCOMv3.1


IMB ping-pong message latency
0 bytes 128 bytes 1024 bytes
between nodes ~2.0us ~2.4us ~4.7us

FAMOUS

Domain Decomposition Number of Cores Model-years/day
4x3 12 ~313
6x4 24 ~360
12x3 36 ~424

HadCM3

Domain Decomposition Number of Cores Model-years/day
4x3 12 ~24
6x4 24 ~40
12x3 36 ~60


Intel SandyBridge Test System

  • Test system: Quad socket, 8-core E-4650L (2.60GHz) (L for Low power)
  • 20MB L3 cache
  • GCOMv3.1


IMB ping-pong message latency
0 bytes 128 bytes 1024 bytes
between sockets ~0.70us ~1.15us ~2.0us

FAMOUS

Domain Decomposition Number of Cores Model-years/day
4x2 8 ~327
8x2 16 ~450
8x4 32 ~480
  • The last line of this table shows a real problem scaling beyond 16 cores. Load balance? (Latencies are much better than QDR IB.)
  • Would like to try to improve file writing performance and re-run.

HadCM3

Domain Decomposition Number of Cores Model-years/day
8x2 16 ~48
8x4 32 ~65


Polaris

  • Intel E5-2670 @ 2.60GHz
  • Infiniband: Mellanox Technologies MT27500 Family [ConnectX-3]
  • Lustre


FAMOUS

Domain Decomposition Number of Cores Model-years/day
4x4 16 ~330
8x4 32 ~330

HadCM3

Domain Decomposition Number of Cores Model-years/day
4x4 16 ~51
8x4 32 ~73
16x4 64 ~73