Attending organisations

  • ARCHER (Tier1), Cirrus (Tier2): EPCC, The University of Edinburgh
  • Cumulus (Tier2): University of Cambridge
  • Isambard (Tier2): GW4, University of Bath
  • Aberystwyth University, Supercomputing Wales
  • University of Southampton

Actions

  • (EPCC) Check with EPSRC about status of Autumn RAP Open Access to Tier-2 call

Meeting format

  • National service RSE groups will post an update to the HPC-UK website on highlights from the past month
  • Invite for all to join will point to updates as potential starting point for discussions and request other items for discussion
  • All will be invited to add upcoming training and events that are of interest and there will be a standing agenda item covering training and events

RSE Updates

Cumulus

  • Been investigating how well MPI HPC applications work over high performance ethernet compared to infiniband up to typical job size for Cumulus (maximum of 32 nodes)
    • Presenting the work at MPICH User Group meeting in September
    • Short story is that there is no significant difference in performance when using ethernet rather than IB - even for all-to-all heavy codes such as CASTEP
    • Will write up and circulate
  • Question about whether the Tier2 RAP Open Access call will run in Autumn or not
  • If Tier-2 will be used for gap between ARCHER and ARCHER2, we need to start work on porting researchers across soon
    • What is the plan for RSE support for ARCHER projects on Tier-2? Will it be supported by ARCHER CSE, Tier-2 RSE or both?

Isambard

  • First set of RAP projects on the system now - do not seem to have been any major issues porting their work and getting up and running
  • Bath have been working on getting VASP, QE and NAMD users onto system
    • Plans to get OpenFOAM users up and running soon
  • Workshop in Cardiff in July run by GW4 RSE and Arm
    • Looked at performance of VASP with ArmPL compared to Cray LibSci:
    • std and ncl versions: 10-30% faster with ArmPL compared to LibSci
    • gam version: 20% slower with ArmPL compared to LibSci
    • Investigations ongoing
  • Question about portability of CASTEP licence
    • Licence is for source so you can run on any system with an Academic licence
  • Also been looking at running research software in the commercial cloud
    • Attended the RSE Summit organised by MS in Brussels
    • One of the outcomes was the Research Software Reactor: https://github.com/research-software-reactor
    • Provide one-click blueprints to allow researchers to run in the commercial cloud (Azure at the moment)

ARCHER, Cirrus

  • Been looking at MPI performance across different platforms and different MPI libraries using IMB
  • Started out as we had seen poor CASTEP scaling on Arm-based systems which was tracked down to Alltoallv scaling issues
  • Full set of results and analysis at: https://github.com/hpc-uk/archer-benchmarks/tree/master/synth/IMB

  • Most detailed analysis so far on Alltoallv performance: https://github.com/hpc-uk/archer-benchmarks/blob/master/synth/IMB/analysis/IMB_Alltoallv_compare.ipynb
    • Arm-based systems show 10-20% performance of Cray Aries dragonfly, irrespective of interconnect technology and MPI library for fully-populated nodes
    • Arm: Halving the number of MPI ranks per node doubles the MPI performance on Cray Aries for medium to large messages
    • Arm: Halving the number of MPI ranks per node increases the performance by 5-6x on IB for medium to large messages
    • Have not yet investigated the effect of tuning parameters
    • Topology seems to have a large effect on OPA performance even though all runs are within a single non-blocking switch: compare Cumulus to Tesseract in the results.

Other topics

Cloud bursting

  • Has cloud bursting from on-prem HPC to commercial cloud been tried by anyone and, if so, how well has it worked?

  • University of Liverpool HPC have a solution with AWS that is supposed to do this using Condor HT
    • Latest reports are that this approach is working fine
  • Storage is a challenge when running HPC workloads in the cloud.

  • Software licences are usually OK.

  • Portability approach differs between HPC and cloud
    • HPC portability usually achieved by recompiling source code against hardware specific libraries and MPI
    • Cloud portability usually achieved with binary in VM/containers that can run on “any” hardware.

Do you need IB for shared NFS file systems?

  • If this is being passed over TCP/IP (e.g. IP over IB) then unlikely to see performance improvements for IB if you are using high-bandwidth ethernet (e.g. 25-50 Gbps)
    • Some NFS vendors may provide direct NFS via ibverbs that could improve performance

Upcoming training, events and meetings

  • ATI Data Study Event, Bristol, Aug 2019
    • Data provided
    • Azure resource to analyse the data
  • RSE for HPC Meeting, Birmingham, PM 16 Sep 2019
    • Associated with RSEConUK19, Birmingham
    • James Grant (Bath) and Jo Beech-Brandt (EPCC) organising
    • Registration link
  • RSEConUK 2019, Birmingham, 17-19 Sep 2019
    • Tickets selling out fast!
  • Upcoming ARCHER Training Opportunities. Full details and registration at http://www.archer.ac.uk/training/index.php
    • OpenMP on GPUs, Virtual Tutorial, Wednesday 28th August 2019 15:00
    • Enabling distributed kinetic Monte Carlo simulations for catalysis and materials science, Webinar, Wednesday 25th September 2019 15:00
    • Programming the ARM64 processor, University of Cambridge, 30 Sep - 1 Oct 2019
    • Hands-on Introduction to HPC for Life Scientists, University of Birmingham, 30 Oct - 1 Nov 2019
    • Shared-memory Programming with OpenMP - Online course, four Wednesday afternoons, 13, 20, 27 November and 4 December 2019

Date of next meeting

1400 BST, Tue 3 Sep 2019