The following documents are the deliverables from Year 2 of the CRESTA project. Many of these provide updates to the deliverables produced in Year 1.

D2.1.2 Architectural developments towards exascale

High Performance Computing (HPC) is a growing market. It is beginning to be seen as vital for a nation's scientific and industrial competitiveness; more countries are providing funding for research into HPC, for instance China which has seen a significant growth over the last few years such that the fastest machine in the world is Chinese. The quest to make a supercomputer with Exascale performance requires significant technological advances, particularly given the limited power budget that such a machine will have. Such advances may come from the wider computing market, where the enormous growth in mobile computing is driving research into power efficient technology, or from research funding specifically for HPC.

To understand what an Exascale machine may look like it is informative to look at trends in relevant technology. Underlying trends in both semiconductor and communication technology drive advances across the computing landscape. These lead to advances in system building blocks; processors, memory, interconnect and software. By looking at company roadmaps some trends become clear, firstly the growth in heterogeneous systems involving different types of processor such as a traditional general purpose CPU and GPU. Secondly, the trend towards integration of components in System-on-Chip (SoC) silicon. Thirdly, the growth in licensing intellectual property, such as processor designs, to other manufacturers.

Several factors are important when considering HPC system architecture trends. These include performance, programmability and usability, power usage and efficiency, cost of procurement and cost of ownership. The TOP500 list provides 20 years worth of data to analyse to look at architecture trends. There has been a move towards using commodity components over custom technology, however this has seen raw floating point performance emphasised at the cost of improvements in memory, interconnect and I/O. An example in architecture trends is provided by looking at the development of the Cray XC30 system.

An Exascale machine is only useful if it has applications capable of using it. The CRESTA co-design applications provide an excellent source of information of the impact of architecture trends on application performance and design. Heterogeneous systems are seen as inevitable; however they have to be easier for application developers to exploit. This will be achieved by providing better integration, particularly through a single addressable memory space, and more importantly through the provision of standard, well supported programming models and languages. Highly parallel systems with millions of processors will need a matching high performance interconnect to allow the system to be fully exploited by applications. Although the wider market may provide advances in power efficient processor technology, funding for HPC specific research into interconnect, programming models and application development will be required.

D2.3.1 Operating systems at the extreme scale

Standard commodity operating systems have evolved to serve the needs of desktop users and business application servers, which have very different requirements to HPC systems and applications. In general, commodity operating systems are not fit-for-purpose, even for current petascale machines, without extensive customisation.

The impact of operating system activities on application performance is not fully understood and is hard to predict. Many HPC systems are configured or customised by a trial-and-error approach, dealing with particular performance problems as they occur, rather than by applying a systematic method.

Specialised operating systems, developed specifically for HPC machines, trade rich functionality for high performance. Scalability is achieved by only implementing a subset of “normal” operating system services, which impairs the usability of the system by application programmers.

Design decisions for specialised HPC operating systems are often influenced by, and sometimes compromised by, design decisions for novel HPC hardware. One example is that the BlueGene/L hardware did not provide cache-coherency between the two processing cores in a node, which prevented the operating system from supporting shared-memory.

The desire to make specialised systems more usable encourages the re-introduction of functionality that can have a negative effect on performance and scalability. Thread scheduling was not supported by the BlueGene/P operating system but has been re-introduced in the BlueGene/Q operating system. This increases usability for application programmers but introduces a source of unpredictable load-imbalance that could reduce scalability, especially at extreme scale.

Specialised HPC operating systems have been continuously researched and developed for at least 20 years, driven (at least in part) by emergent trends in hardware design. Current systems demonstrate that excellent operating system scalability up to petascale is achievable. Although it is possible for major advances to be made in operating system development via disruptive technologies, currently there is no consensus on the direction required.

D2.4.1 Alternative use of fat nodes

This report summarises the work that was undertaken in Task 2.4 “Alternative use of fat nodes” as part of CRESTA’s WP2 on “Underpinning and cross-cutting technologies”. More specifically, the report presents research into different ideas for the use of fat nodes on future systems, ranging from practical to more speculative approaches:

  • Co-location of workloads;
  • Offload servers;
  • Background processing of MPI communication;
  • Micro-kernels.

D2.5.1 Fault agnostic and asynchronous algorithms at exascale

The number of parts in HPC systems is set to increase significantly as their performance approaches the Exascale. This means that fault tolerance is an increasingly important aspect of the design of these systems. However it is also possible to consider software-hardware co-design as a solution to these problems. On the software side, this includes the development of fault tolerant algorithms. In general, this is a difficult problem, especially if faults are considered where part of the current state of a computation is lost. Other types of fault, however, do not involve such state loss: these include performance faults where a component (e.g. a processor or network link) does not fail, but performs at a slower rate than intended. Such faults are less catastrophic, but may be harder to detect.

Performance faults may not cause the computation to fail, but, for many algorithms, the synchronisation patterns mean that the whole computation can run at the speed of the slowest component. Asynchronous algorithms, which are often derived from synchronous counterparts by relaxing some or all of the synchronisation requirements, have the possibility of being much more tolerant to performance faults, though likely at the expense of poorer convergence rates.

In this deliverable, we select two asynchronous algorithms for the solution of large sparse linear systems (Jacobi and block Jacobi), and, using simulated slow cores and slow links on a real HPC system, quantify their ability to maintain performance in the presence of such faults by comparing them to their synchronous counterparts.

Our findings do indeed show that the algorithms have strong resilience to such faults, even when the loss of component performance reaches an order of magnitude. However, in some cases we observe that the asynchronous algorithms can exhibit undesirable convergence behaviour, and that care needs to be taken to avoid this. Finally, we discuss how such algorithms may be of interest in the contexts of alternative uses of fat nodes and power management.

D3.2.2 Adaptive runtime support design document

Subtask 3.2.2 “Hybrid and adaptive runtime systems” is developing an experimental runtime system that will explore the power of adaptive runtime support for exascale applications.

In deliverable D3.1 “State of the art and gap analysis - Development environment”, CRESTA performed an analysis of existing approaches in the field as well as technical boundary conditions and requirements.

The deliverable D3.2.1 provided a design of a runtime system that aims to develop further approaches to adapt simulation applications dynamically in the best way to computer systems and to extend such approaches to upcoming exascale architectures.This deliverable therefore proposed an adaptive runtime-support design where simulation applications based on a task-orientated programming model with hierarchical tasks are combined with runtime supporting performance analysis and runtime administration enabling an increased efficiency of large-scale numerical simulations.

To this updated deliverable D3.2.2 have been added conclusions that could be drawn from the ongoing implementation of the runtime administration and monitoring components. It points out that the overhead of the runtime system in a typical molecular dynamics simulation has to be expected at about 5% allowing noticeable performance improvements of the overall runtime. A new performance monitoring API has been developed with the aim to allow the use of IPM with low overhead in the runtime system.

D3.3.2 Performance Analysis Tools design document

This document (“Performance Analysis Design Document, D.3.3.2”) is an update of the previous D3.3.1. to present possible designs, planned modifications and extensions to the existing application performance analysis tools Score-P and Vampir to address scalability and heterogeneity.

We describe the designs and extensions for the performance monitoring tool Score-P, i.e. collection of different kinds of performance counter and integration within the monitoring system, reduction of the amount of data to address scalability issues identified within the gap analysis (D3.1), and what extension will be done to address applications’ demands on heterogeneity. We specify the designs and extensions in terms of scalability and heterogeneity of the performance analysis and visualisation tool Vampir, then we present how to ensure that any extensions that we provide are well-tested and suitable for productive use, and finally we address the state of fault-tolerance.

D3.4.2 Debugging design document

This document describes designs, extension steps, and ideas that will allow the debugger Allinea DDT and the automatic runtime correctness tool MUST to adapt towards Exascale needs. We use deliverable D3.1 “State of the art and gap analysis” as a roadmap for these extensions. We extend the first version of this document from project month 10 and refine our designs and plans where we gained additional knowledge or feedback.

D3.5.2 Compiler support for exascale

A study of the performance of the computational kernels relevant to the Nek5000 CRESTA co-design application was completed last year (Compiler Support for Exascale, CRESTA Project Deliverable D3.5.1).  This included a brief study of CPU performance and a more in-depth study of performance on a single GPU. The GPU study used the PGI compiler suite and OpenACC accelerator directives, coupled with auto-tuning compiler technology from the University of Edinburgh School of Informatics. A standalone benchmark version of the full Nek5000 application, called Nekbone, was subsequently ported to large-scale Cray GPU parallel systems using the Cray OpenACC compiler and then optimised by hand. The design of a CRESTA auto-tuning framework was also developed, and a prototype implementation produced. In this study we draw these three strands together and use the CRESTA auto-tuner on the Nekbone kernels to attempt to produce an optimised accelerated version for Cray hardware whose performance can be compared with the hand-optimised accelerator code. We also perform an in-depth investigation of a similar approach applied to the CPU version of Nekbone to enable comparisons of the auto-tuning procedures and performances achieved between CPU and GPU.

D4.2.1 Prediction Model for identifying limiting Hardware Factors

Hardware is one of the main factors to consider for the efficient use of massive parallel systems. It is also important to understand the main limiting factors that influence the efficiency of existing and developing programs. To successfully exploit an exascale system both hardware and software need consideration.

The purpose of this document is to support the further implementation of library “exascale algorithms and solvers” in the CRESTA work package 4 (WP4). We have performed many tests on different platforms to determine their differences and most important limiting factors.

D4.3.1 Initial prototype of exascale algorithms and solvers for project internal validation (Software)

Deliverable D4.3.1 is a software deliverable. This document describes the software, a prototype parallel numerical library targeted at Exascale systems.

As previously discussed in the deliverables of WP4, “D4.1.1 Overview of major limiting factors of existing algorithms and libraries” and “D4.2.1 Prediction Model for identifying limiting Hardware Factors”, the Exascale is going to require an increase in the efficiency, in the sense of scalability and performance, of algorithms due to the very large degree of parallelism that will be required. As well as efficient algorithms highly efficient implementations of those algorithms are also required. In addition to the increase in the degree of parallelism Exascale systems are expected to be significantly more complex than current systems with many different levels of memory and communication hierarchies. This will make it very difficult to optimize codes for Exascale systems. Many codes will require significant rewriting to make the best use of these systems. The availability of parallel numerical libraries designed for Exascale systems should significantly reduce the development costs of this process. We have evaluated a number of existing numerical libraries that implement linear solvers (such as PETSc or Trilinos) though these are scalable on current hardware they haven’t achieved, in our opinion, the highest possible efficiency (see more details in D4.1.1 and D4.2.1). In addition current solver libraries do not properly address key issues at the Exascale such as the overlap of communication and calculation. Though Fourier transforms are an important part of many simulations and node-local FFT libraries are widely used, most major applications implement their own distributed FFTs using a combination node-local FFT libraries and explicit MPI communications. We believe that this is because the currently available parallel FFT libraries place too many constraints on the data decomposition of the rest of the application.

For all of the above reasons we are developing a new library (the CRESTA Exascale Library, CEL in short) addressing these two important classes of numerical problem: linear solvers and multi-dimensional Fourier transforms. This initial prototype of the library will form the basis for further testing and improvements. Ultimately the optimized library will be integrated with the CRESTA applications.

D4.4.1 Initial prototype for optimised reduction approaches for project internal validation (Software)

Collective reduction operations such as global summation of a collection of floating point numbers is an important operation in numerical simulations and used for instance as convergence criterion to control iterative numerical solvers.

Particularly the summation of floating point numbers suffers from inaccuracy due to limited numerical precision and round-off errors. While there are numerical schemes to mitigate these effects as for instance the Kahan Summation algorithm, collective summations in an MPI application are beyond the control of the user and may introduce large error for large numbers of MPI ranks. However, a priori it cannot be determined whether an application is affected by these numerical inaccuracies and to what extent. The user needs to verify this, possibly for every input data set.

As we move to Exascale computing, with possibly millions of MPI processes, the number of terms in a summation reduction approaches the limit were numerical errors will reach a level that can no longer be disregarded a priori. We have developed a prototypical version of a library that allows the user to replace the MPI collective reduction, specifically for summation, with a high-precision version. This can be used to test whether an application or a use-case is affected by inaccuracies in the MPI summations.

However, this will only show difference due to inaccuracies in the MPI part, not the computations done locally. To analyse those, we also provide a set of routines to do local summation of a vector of values at high precision. These routines can be used by the application developer in critical sections of the code. It is worth noting, that the high-precision optimized version of MPI reduce as well as the routine to do local summations are slower than their standard counterparts. The user thus needs to trade off performance for accuracy on a case-by-case basis.

This deliverable has not evaluated possible support of the networking hardware for summation or other reduction operations. Nonetheless, we recommend adding computing capabilities using high-precision math to the networking interfaces of future Exascale systems as well as to use high-precision buffers for data transport in reduction operations. The performance impact should be minimal with dedicated hardware.

D4.5.2 Microbenchmark Suite (Software)

Task 4.5 is concerned, amongst others, with the optimisation of collective communication operations. Collective operations involve multiple participants rather than only two as is found in point-to-point communication operations. Examples of collective operations are synchronisation barriers, or reductions over the full computational domain in order to find the sum/min/max of a particular quantity. Such operations are very common in most distributed applications.  

This document briefly describes the software deliverable Collectives Microbenchmark Suite which is used within the CRESTA project firstly to assess progress of the optimisation work on collective operations, but also secondly as a tool to analyse the characteristics of the implementation of collectives. For the users or developers of parallel applications, the benchmark suite may help in assessing which implementation of collective should be chosen in a specific use-case.

D4.5.3 Non-Blocking Collectives Runtime Library (Software)

Most algorithms of scientific computing involve communication patterns that are performed collectively across a large number of processing elements. Hence, the scalability of many applications is often bound by collective operations in which even minor load imbalances or other inefficiencies during these phases can cause a stall across a significant number of processes. This holds also for the most of the CRESTA co-design applications.

In order to scale applications to hundreds of thousands of cores, new approaches for collective communication will be needed. These could be, for example, the use of asynchronous algorithms in combination with remote-memory access (also called one-sided) operations, especially when supported by hardware; utilization of non-blocking collectives that allow for overlapping the communication overhead with computation; optimization of communication patterns to improve concurrency but avoid interconnect contention, and so forth.

This document describes a platform for studying scalability bottlenecks caused by collective operations: the CRESTA Collective Communication Library. It basically allows for an application developer to experiment with various alternative implementations for a particular set of collectives with minimal changes into the application source. These implementations include in addition to the traditional collective operations of the message-passing interface (MPI) library the non-blocking collectives as introduced in the most recent version of the MPI standard, collectives implemented with partitioned global address space (PGAS) languages (yet currently only with Fortran coarrays) as well as with remote-memory access operations (also referred to as one-sided communication) available in the MPI library. Furthermore, it defines an application-programming interface (API) where the initiation and finalization of a collective operation are performed in separate stages, to allow for performing other work while the collective communication occurs in the background, here referred to as the split-phase API. The library itself is free software.

D5.1.3 Pre-processing: first prototype tools for exascale mesh partitioning and mesh analysis available

This deliverable is a software deliverable, providing a first prototype interface for pre-processing steering named PPStee. In this document we provide a brief overview of the software. This software feeds into the simulation cycle a graph or mesh data and various communication costs and work load from all simulation loop components. It uses state-of-the-art partitioning libraries to provide an overall simulation load-balance and can be extended with further functionality such as mesh manipulation methods or connection to a fault tolerance framework.

In this document we sketch features and properties of PPStee and show advantages and disadvantages of its architecture. We illustrate the integration into a simulation work flow in terms of both data flow in combination with PPStee and actual implementation using a basic usage example. We point out the current software status and future work.

D5.1.4 Pre-processing: revision of system, data format and algorithms definition for exascale systems

In CRESTA Deliverable 5.1.1, we analysed the current situation of simulations regarding pre-processing and gave a system definition: main aim is a closer simulation cycle including all simulation parts and an improved overall simulation load balance. CRESTA Deliverable 5.1.2 studied algorithms of partitioning libraries used for pre-processing so far and pointed at basic properties required for the graph data format. These requirements culminated in the development of the prototype pre-processing steering interface PPStee introduced in CRESTA Deliverable 5.1.3.

Here we review the design of PPStee and collect performance data to evaluate this prototype tool. The integration of PPStee into HemeLB was relatively simple, as intended, and allows for performance tests of HemeLB with various geometries and all three by PPStee supported partitioners, ParMETIS, PTScotch and Zoltan. Runtime measurements with up to 2048 cores on HECToR are presented as first results. PPStee’s runtime overhead vanishes and ensures usage of PPStee without a priori drawbacks. The configuration using PTScotch performs, in general, slightly worse but reveals scalability issues starting at 512 cores. HemeLB with PPStee using Zoltan suffers from a constant loss in runtime, the reason is yet unknown. Further investigations will, in particular, focus on graph data conversion, scalability and usage of partitioner-characteristic routines and parameters to enable a better match to specific simulation data.

Lastly, we address CRESTA’s co-design vehicle OpenFOAM. Simulations using OpenFOAM are not a priori suited for application of PPStee due to the nature of OpenFOAM being a box of separated tools of solvers and utilities. However, PPStee may be applicable if each phase of a simulation using the OpenFOAM framework is aggregated into one monolithic program. The OpenFOAM co-design team is currently investigating the feasibility of this monolithic program.

D5.2.3 Post-processing: first prototype tools for exascale interactive data exploration and visualisation

This deliverable is a software deliverable, providing a software prototype for exascale interactive data exploration and visualisation. The main purpose of this associated report is to present and document the prototype software, which has been developed using a co-design process with the HemeLB code.

This software builds on two previous deliverables associated with Task 5.2 within Work Package 5 (D5.2.1 and D5.2.2). These described and studied the challenges, system requirements, system architecture and data structure for exascale data post-processing. The first two deliverables served as a theoretical foundation for the upcoming software development and design.

In accordance with the previous deliverable, this software aims to provide in-situ processing of the simulation data, interactive visualisation for exascale CFD simulations, and further computational steering capability of the on-going simulation. In-situ post and interactive visualisation provides the user with the possibility to explore the simulation result on-the-fly, while computational steering allows the user to change and modify an on-going simulation process by modifying simulation parameters.

In this deliverable, we deliver a software prototype which was co-designed and integrated into HemeLB. This prototype provides a fundamental structure of interactive data post-processing for HemeLB, allowing developers to evaluate the design of our proposed post-processing system and visualisation algorithms. We present an initial attempt to visualise a HemeLB simulation with a newly implemented visualisation and steering client. We also outline future plans and on-going work regarding software implementation.

D5.2.4 Post-processing: revision of system, data format and algorithms definition for exascale systems

This deliverable reviews the system definition and post processing algorithms that have been proposed in work package 5.2.1. In this work package, we study and evaluate the post-processing system architecture as well as the visualization algorithms. To evaluate the reliability and the compliance of the system, we test the prototyped post-processing tools which were co-designed for HemeLB.

Interactive data exploration and visualization are two major goals in exascale data post-processing. While pre-processing of the simulation focuses on mesh creation and partitioning, post-processing of the simulation is targeted at providing visualisations of the simulation outputs, which serves as a tool to explore and analyse the simulation results.

In an exascale environment, real-time visualisation of the simulation mesh, its partitioning and intermediate simulation results are important for an on-going simulation. It does not only make it possible to analyse intermediate simulation results, but also enables the user to detect and foresee failures in a running simulation process. In work package 5, we focused on developing users tools which provides in-situ and interactive post-processing for analysing running simulations.

In the past months of the ongoing project, work package 5 has established a major collaboration for co-design with HemeLB, which is a Lattice-Boltzman based fluid dynamics simulation code. We developed our post-processing system and algorithms in order to provide interactive and in-situ visualization for HemeLB. However, the proposed ideas and algorithms are not limited to this single type of solver. Instead, they can be also applied to other kinds of fluid simulation solvers such as OpenFoam and so on.  

D5.3.3 Remote hybrid rendering: first prototype tool

This document accompanies the software delivered as the first prototype of remote hybrid rendering.

Remote hybrid rendering is used to access remote exascale simulations from immersive projection environments over the Internet. The display system may range from a desktop computer to an immersive virtual environment such as a CAVE. The display system forwards user input to the visualisation cluster, which uses highly scalable methods to render images of the post-processed simulation data and returns them to the display system. The display system enriches these with context information before they are shown.

Together with the documentation extracted from the source code in the appendix, this document describes the first prototype for remote hybrid rendering. It has been implemented as plug-ins to the virtual reality renderer OpenCOVER of the visualization system COVISE. The source code of these plug-ins is open and can be retrieved from the CRESTA project subversion repository.

While implementing the prototype, some changes to the protocol draft for remote hybrid rendering became necessary.

Future versions of the tool will be improved regarding bandwidth requirements and scalability.

D5.3.4 Remote hybrid rendering: revision of system, protocol definition for exascale systems

Remote hybrid rendering (RHR) is developed to access remote exascale simulations from immersive projection environments over the Internet. The display system may range from a desktop computer to an immersive virtual environment such as a CAVE. The display system forwards user input to the visualisation cluster, which uses highly scalable methods to render images of the post-processed simulation data and returns them to the display system. The display system enriches these with context information before they are shown. This technique decouples interaction from rendering of large data and is able to cope with growing data set sizes as the amount of data transfer scales with the size of the output images.

Since D5.3.3, a prototype of RHR is available. This document describes its implementation and the algorithms developed for this prototype, especially for compressing depth images. Also, while implementing the prototype, some changes to the protocol draft in D5.3.2 for RHR became necessary. This document lists the necessary revisions. In addition, the performance of the prototype is examined.

Future versions of the RHR tool will be improved regarding bandwidth requirements and scalability.

D6.1.2 Roadmap to exascale (Update 1)

This document contains an update to the initial roadmap for the CRESTA codes described in Deliverable D6.1.1. The main progress and main updates to the original roadmaps for the separate codes are summarized in Section 1.1. Actions related to co-design progress for each application are summarized in Section 1.2.