ECMWF’s ten-year Scalability Programme has reached its crucial half-way stage. Key operational benefits have been achieved as well as extensive exploratory work that shows the way forward for our high-performance computing systems. The programme is key for supporting future developments in weather forecasting and climate prediction.
The Scalability Programme is ensuring that ECMWF’s high-performance computing can support the Centre’s ambitious targets to deliver more accurate predictions, using higher resolution and more complex modelling, greater use of ensembles and vastly increased volumes of data of all forms. The Programme also aims to keep the Centre’s computing energy usage at sustainable levels. Progress and plans were documented by the Scalability team in a special topic paper for ECMWF committees in 2019.
Over the first phase of the Programme, the existing ECMWF prediction system has been benchmarked on some of the largest supercomputing facilities in the world. A US Department of Energy (DoE) INCITE award gave ECMWF access to the largest machine in the world, called Summit, allowing both the efficiency and scalability of the Integrated Forecasting System (IFS) to be tested on central processing units (CPUs), and the performance of one of its most costly components, the spectral transforms, to be demonstrated on graphical processing units (GPUs).
The most promising options to optimise the performance of the existing code infrastructure have been explored. For example, mixed precision arithmetics, concurrent execution of model components, overlapping computation and communication, and the use of more efficient CPU-type processors and interconnects are expected to provide code speed-ups of the order of three as soon as the new machine has been fully tested.
Weather and climate dwarfs, a concept developed as part of the Programme, divide the forecasting model into functional units (e.g. the advection scheme) which have specific computational patterns. Using dwarfs, the potential optimisation with new processor technologies can be explored efficiently and the technologies benchmarked. The dwarf concept also provides a way to estimate the sustained performance of the full forecasting system on large-scale, future supercomputers.
The most computationally costly model components have been tested on GPUs and throughput speed-ups of more than 20 have been achieved, compared with CPUs. The first ever implementation of a forecast model component on a Field Programmable Gate Array (FPGA) improved the time to solution by 2.5 and reduced energy use by at least a factor of 10.
ECMWF has built its entire performance enhancement strategy on co-design of numerical methods/algorithms with code implementation. This strategy means that scientific and computing performance can be traded off against each other, and extensions to the modelling system will be future proof.
Observational data and model and product output are growing in terms of both volume and diversity. Hence a key focus of the Scalability Programme has been on tools for managing fast and flexible data access and minimising data transfer across the memory hierarchy. This has already led to modernisation of the data handling system and a 5-fold speed-up in product generation, alongside greater system robustness.
The Kronos workflow simulation and benchmark generator software is entirely new and a major step towards realistic capacity benchmarking. For the first time, Kronos can provide a full workflow representation including IFS model output and product generation. This has made the benchmarks much closer to the reality of running our operational system.
The Scalability Programme has established ECMWF as the leading centre performing cutting-edge research at the interface between computational and weather and climate science, while working with its Member States and drawing on computational expertise more widely. Leadership and involvement in European programmes is drawing in additional funding of some €1.5 million per year over 2015 to 2021.
Next phase
The main two focuses will be ‘performance portability’ and ‘data-centric workflows’. Many elements of these already exist from developments in the first phase, but their full implementation throughout the entire forecasting system will still require a significant commitment.