ECMWF’s world-class supercomputing facility is at the core of its operational and research activities and is upgraded typically every four or five years.

At the start of the year, ECMWF signed a contract with Atos worth over 80 million euros for a new system made up of four Atos Sequana XH2000 clusters. It will deliver about five times the performance of the current system, allowing the Centre to run higher-resolution ensemble forecasts to improve the prediction of extreme weather events significantly ahead of time.

Representatives from Atos and ECMWFIn January ECMWF signed a contract with Atos for the supply of the BullSequana XH2000 supercomputer, authorized by the Council in December 2019.
© Stephen Shepherd photography

In addition to maintaining the current operational infrastructure for forecast production, research and Member State use, the Centre completed significant preparatory work for the installation of the new machines and the migration of applications and data. The Scalability Programme also entered a new implementation phase, driving forward cutting-edge work to improve the efficiency and scalability of computer code to exploit the potential of future IT infrastructures.

Operational supercomputing facility

The Cray XC40 high-performance computing facility (HPCF) in Reading provided a good and stable service throughout 2020, processing more than half a million jobs per day on average, with availability over 99.6%. Forecast production was uninterrupted during the COVID-19 pandemic.

The Cray system will be replaced by the Atos machines in the Centre’s new data centre in Bologna, Italy, to be operational in 2022.

With the entire data centre operations moving to Bologna, the migration comprises much more than just high-performance computing (HPC) applications, such as ECMWF’s Integrated Forecasting System (IFS). Scientific applications must also be migrated, along with the hundreds of petabytes of operational and research data held in the Data Handling System.

By November 2020, the meteorological data archive contained 372 petabytes of primary data and 138 petabytes in the secondary data store. On average, around 290 terabytes of new data is added to the archive daily, and 210 terabytes is retrieved. Data stewardship remained important to limit the growth of the archive. Throughout the year, the Centre made strong progress transferring data from Oracle to IBM tape drives.

The first stages of the implementation phase were successfully completed in 2020. Test systems at the Atos factory in France and the Centre’s UK data centre allowed the Centre and Member States to start preparing for the main system in Bologna.

Details of the ICT system design for Bologna and how Member State users will interact with and access facilities were presented at the IT User Forum in October (formerly Computing Representatives’ meeting).

Despite a necessary slowing down of activities in the middle of the COVID-19 crisis, the Bologna building site progressed well through the year, made possible by the strong involvement of Italy and the Emilia-Romagna Region and the commitment of ECMWF staff in Bologna.

By the end of the year, the data centre was close to completion, with the handover to ECMWF planned to take place after full testing and commissioning. The Cray contract was extended to cover delays in the availability of the new data centre, and Atos test systems mean that any delays can be productively used for preparation work ahead of the main system deliveries.

Bologna data centreThe new facility is planned to become operational in 2022. Left: external facades completed; right: racks inside Data Hall.

For day-to-day operations, the security and availability of ECMWF’s office environments remain a high priority. A significant milestone in 2020 was the operational roll-out of a new SSH (Secure Shell) service allowing secure remote access to ECMWF computer systems.

The advent of COVID-19 brought a sudden need for home-working at scale. ECMWF’s IT infrastructure was well prepared, with robust and well-proven technologies such as Windows and Linux Virtual Desktops, laptop deployment and remote management, and video-conferencing applications already in place. Increased system capacity, coupled with training and equipment for staff, allowed for a smooth transition to remote working.

Almost four months of intensive efforts came to fruition towards the end of the year with the smooth switchover to a new firewall cluster in the Reading data centre. This major upgrade reduced complexity and doubled capacity to each network zone, in particular eliminating the network congestion which had caused slight delays to dissemination of forecast products earlier in the year.

The Centre’s IT security was also put to the test in October when the meteorological community was subjected to a large-scale cyber-attack targeting email services. Several teams in ECMWF, together with the security teams in the Member States, worked hard to successfully protect staff and organisation data while minimising the disturbance.

Atos HPC test systems

In mid-January, Atos provided a core group of ECMWF developers with access to a Familiarisation System, a four-node system hosted in the Atos factory in Angers, France. Featuring the full Atos software stack, the system allowed developers to make a quick start on porting libraries and codes to the new platform.

For porting and testing work for the new Atos machines to start as quickly as possible, a temporary Test and Early Migration System (TEMS) was installed at ECMWF’s data centre in the UK in February. The TEMS is a 60-node, air-cooled cluster with half a petabyte of high-performance parallel storage.

After the initial integration of the TEMS into the Centre’s systems, ECMWF installed the environment ‘module’ system and other third-party software packages commonly required. Soon after, several teams across the organisation started to explore which different combinations of compilers and Message Passing Interface (MPI) implementations work best for different scenarios.

The test system was opened to Member State developers at the beginning of July. It is fully representative of the software environment on the future system and makes it possible to port and test all libraries, utilities and applications, and is being used to develop the monitoring and operations tools that will be needed for the final system.

Test systemThe HPC Test and Early Migration System supplied by Atos was installed at ECMWF in February 2020.

Scalability

The Centre’s Scalability Programme encompasses internal and collaborative projects to ensure ECMWF can exploit the full potential of future computing architectures. A four-year implementation phase started in January and concentrates on the headline themes of portable and performant code; data-centric workflows; and machine learning – themes that form the core of the emphasis on computational science in ECMWF’s next ten-year Strategy to 2030.

A key target is to prepare the prediction system for CPU-GPU processor and deeper-memory architectures, so that a larger variety of technologies can be considered for the HPC infrastructure that will follow the Atos Sequana system. ECMWF won another HPC resource award on the second-largest infrastructure in the US (Summit) through the DOE INCITE programme to test km-scale ensemble simulations at scale and assess both scientific and computing performance of future prediction systems.

Partnerships created through external funding continue to drive progress in this area, notably EU-funded projects such as ESCAPE-2, EPiGRAM-HS, LEXIS and MAESTRO and the ESiWACE-2 and HiDALGO centres of excellence. A joint initiative launched by Atos and ECMWF in October will complement this work. This new Centre of Excellence in HPC, AI and Quantum computing for Weather and Climate will be a platform for collaborative research and development between ECMWF and Atos and their key technology partners AMD, DDN, Mellanox and Nvidia.

Its focus will be on investigating and exploiting state-of-the-art technologies for weather and climate applications, such as advancing machine learning methods, exploiting GPU technology for weather and climate applications, and the development or provision of tools and libraries to improve the usability and utilization of operational HPC systems. The EU-funded MAELSTROM project to be coordinated by ECMWF will particularly focus on pushing machine learning methods for supporting data processing and computing along the entire prediction workflow.

During the year, ECMWF contributed to the Strategic Research Agenda (SRA-4) of the European Technology Platform for High-Performance Computing (ETP4HPC), resulting in the inclusion of weather and climate prediction as a feature application. This has also pushed prediction into a new multi-disciplinary digital technology activity, the trans-continuum initiative joining ETP4HPC, ECSO, BDVA, 5G IA, EU-Maths-In, CLAIRE, AIOTI and HiPEAC.

The Centre was granted Observer status in the European Open Science Cloud (EOSC) Association. Being an Observer will allow ECMWF to follow developments in this area and endeavour to contribute to the Association’s vision through the Strategic Research and Innovation Agenda.

Alongside this scalability work, ECMWF worked with ESA and EUMETSAT to define the baseline architecture and first-generation deliverables of a new European Commission initiative called Destination Earth (DestinE), which could substantially accelerate the community developments in this area.

Single precision to accelerate forecast production

ECMWF is investigating ways of reducing the computational cost of producing weather forecasts by using reduced numerical precision in the calculations.

The use of single precision – 32-bit calculations rather than the traditional 64-bit (double precision) – will be key in moving towards finer-resolution operational ensemble forecasts. It will free up vital computational resources for forecast production and will thus maximise the benefits from the investment in ECMWF’s new HPCF.

Development of a single-precision variant of the IFS took place in collaboration with the University of Oxford, and this will go into operations in 2021. The use of single precision allows a higher vertical resolution in the ensemble forecast for no extra cost, with a concomitant improvement in forecast skill.

This work is being extended to the NEMO community ocean model used operationally at ECMWF, in collaboration with the Barcelona Supercomputing Center.

The single-precision variant of NEMO has been tested for long ocean-only integrations, and early results indicate favourable forecast performance compared with double precision. Yet, as in the atmosphere, the single-precision NEMO integrations are up to 40% cheaper.

The next stage of testing will consider coupled simulations. The technical infrastructure to allow the first ever fully single-precision coupled atmosphere–ocean simulations has already been established. This development will be especially important for the seasonal forecasting system, where the ocean model accounts for about 60% of the computational cost.

Single precisionThe chart shows the change in sea-surface temperature (SST) error (K) compared to observations when moving from double precision to single precision for NEMO simulations at the high-resolution operational configuration of 0.25° global resolution, for a 26-year period. The change in error is practically zero across most of the globe.

Several teams in ECMWF, together with the security teams in the Member States, worked hard to successfully protect staff and organisation data.

A key target is to prepare the prediction system for CPU-GPU processor and deeper-memory architectures.