Deliverables

The following documents are the deliverables from Year 3 of the CRESTA project:

D2.5.2 – Fault-­Tolerance and Operating Systems, Programming Models and Integration

By raising computational performance through increased parallelism, single core failures in modern supercomputers have become a more important and more expensive issue. This is due to the fact that with core count the number of overall components within a system rises. Thus, as the mean time between failures of every single component does not grow as fast as the number of components in a supercomputer, the overall mean time between failures of the system shrinks.

Failure prevention is used in many parts of the machine; the standard Message Passing Interface (MPI) mechanism of dealing with faults is to abort the entire computation if any of its ranks encounters a failure. The traditional handling of these failures is using checkpoint/restart techniques. However, as the overhead of these implementations grows with core count, their application diverts to inefficiency.

Fault tolerant parallel distributed memory models enable the code to recover from failures and continue execution although some parts of the system have been lost indefinitely. Although not yet part of the MPI standard, there is an active working group around Fault-Tolerant (FT) MPI. The work presented here assesses different developments of fault tolerant parallel execution models and shows the obstacles that have to be resolved in order to be applicable for user codes.

Furthermore current trends in different distributed memory approaches are presented and a classification of error treatment in different system levels is given. The responsibilities of operating systems, communication runtime environments, communication libraries and application environments are specified. The presented work focusses on the operating systems and communication libraries with special focus on the applicability to the CRESTA co-design applications.

 D2.5.3 – Proactive Fault Tolerance

Due to the massive scale of envisioned exascale computing systems, hardware and software faults are expected to be the rule rather than the exception, making it necessary to improve the resiliency of exascale applications. However, current approaches to tolerate faults in HPC applications are limited to transparent hardware mechanisms that exhibit low overheads such as ECC protection of memories or network links, or global checkpoint/restart mechanisms to deal with errors at the application level. Due to the limited I/O bandwidth of exascale systems, the latter mechanism is expected to be unusable without significant improvements in scalability.

Proactive approaches to fault tolerance are based on predicting future failures and taking preventive action to avoid the impact of these failures onto running applications. These methods offer the possibility to reduce apparent fault rates seen by the application and thus lower the cost of reactive approaches such as checkpoint/restart mechanisms.

Large-scale parallel applications are almost universally composed of many independent processes executing on a distributed-memory machine. This allows all application processes to be evacuated from a node using process migration schemes to prevent the failure of the node from affecting the application. We investigated two different approaches to accomplishing this migration: Migrating bare-metal application processes in a heterogeneous system, and using industry standard virtualisation techniques to migrate complete virtual machines.

Proposed exascale system architectures use heterogeneous mixes of processing elements such as general-purpose CPU cores and GPGPU style accelerator cores to improve performance and energy efficiency, a trend already manifested in today's highend HPC systems. Since GPGPU and other thin cores are likely to lack features needed for efficient execution of OS services, programs on such cores generally delegate the task to general-purpose CPU or fat cores. By extending this mechanism it is possible to decouple application processes from the underlying OS kernel and apply different resiliency strategies to both parts, which is the inspiration to the first approach we examined.

Virtualisation techniques are on the other hand at the core of the emerging cloud computing infrastructures that already today deploy data-centres with 100,000 servers, which is in the range of the expected node count for exascale systems. Generalpurpose CPU vendors have therefore included effective hardware support for virtualisation into their processor designs, which may be exploitable by HPC systems using the same hardware.

Our experiments indicate that both approaches are feasible and can be carried out without modification to the application code, which protects investments into application software. This flexibility comes at the price of some performance loss, which can be minimised by improvement of the underlying hardware and software. Further performance improvements could be possible by minor modifications to the application code to take problem-specific knowledge into account when migrating application processes. On the other hand, our findings indicate that unrelated application mechanisms, such as load-balancing or check-pointing, could also be exploited to implement process migration.

D2.6.3 – Power measurement across algorithms

This deliverable seeks to summarise the need for power and energy measurements on a variety of contemporary architectures. These architectures provide guidance to future exascale system design choices. We demonstrate that even on these architectures, there is now a clear tradeoff between runtime performance and energy or power consumption for different applications and algorithms. This shows that the time is now right for application developers to begin to consider these metrics when selecting algorithms for computation, even if the simple model that we introduce shows that the current savings are not yet significant enough to warrant different charging models for HPC systems.

Whilst overall energy and power consumption are appropriate measurements for single-purpose, standalone benchmark codes, real application developers require energy and power information to be integrated with other application-tracing performance data.

In Section 3, we describe how the Score-P and Vampir performance measurement and presentation tools have been adapted to provide a clear, easy-to-use view of evolving power and energy consumption during application execution.

In Section 4, we focus on two examples of current HPC architectures, the Cray XC30 with nodes containing either pure CPU or CPU plus GPU accelerator. We consider a number of different algorithm and runtime configuration choices and show how energy and power consumption are affected. We also dynamically varied the processor clockspeed. We present a number of different metrics by which algorithm choice can be tuned, relevant to different HPC scenarios.

In Section 5, we consider in more detail the power consumption of a multi core processor. Using the eight-core Sandy bridge processor we measured its power consumption when the algorithmic kernel operations ran on up to eight cores with all possible frequencies and with data in different hierarchy levels of memory. We introduce a new approximation formula for the power consumption, which is very suitable for the kernel operations considered. The results show that the question of energy consumption of both the algorithms and hardware is not trivial. We conclude that a more flexible strategy for choosing the optimal frequency and the number of active cores could lead to more efficient energy consumption of supercomputers.

In Section 6, we test the performance and energy usage of the NAS Parallel Benchmarks on low-power technologies that have their origins in the mobile processing world. Using two different quad-core ARM CPUs, one optimised for performance and the other for energy efficiency, we investigate the trade-off between performance (i.e. wall time to solution) and total energy consumed. Although the best absolute performance is always delivered by the more powerful A15 CPU, for certain workloads (and if pure performance is not a constraining factor) the low-power A7 CPU provides a real alternative in terms of overall energy efficiency. Although mobile processors are not a realistic resource for production HPC workloads, the results obtained on the ARM CPUs confirm that it is indeed possible to trade performance for energy efficiency (and vice versa).

We conclude that current CPUs (whether designed for HPC or low-power environments) are still far from satisfying the stringent performance and power budget requirements for an exascale architecture. Nonetheless, we do see that changes in algorithm and runtime configuration do have a marked effect on application energy efficiency of contemporary processors. We therefore feel that the field is now at a point where a meaningful co-design process can begin for the energy-efficient hardware and applications that are needed for an exascale era. For this to work, application developers need detailed feedback on how their changes affect power consumption, and the tools described in this deliverable are a vital step in this direction. Finally, a low-level understanding of power consumption (as we have begun here) will be needed for systemware to provide energy-efficiency supporting interfaces between the hardware and the applications.

D3.6.2 – Domain Specific Language (DSL) for expressing parallel auto-­‐tuning

This document describes a domain-specific language (DSL) that serves as the central component of an autotuning framework for the tuning of parallel applications. We describe what features the DSL was designed to provide, how it fits within a wider autotuning framework and outline the initial implementation.

Our initial approach was to start from scratch without detailed reference to, or consideration of, existing autotuning technology but starting from the basis of a specific set of requirements we considered important. One reason for this is that the remit of the CRESTA project is to define a distinct European approach. We also need to be sure that we can be in control of (or define) an environment that will support particular aspects of tuning parallel applications.

The following sections introduce and outline the scope and objectives of WP3 task 3.2.1 giving an overview of what the actual deliverable addresses. We then consider the objectives for the DSL and move on to describing the specification and how it would be implemented within an autotuning framework.

This document is an update of the initial DSL specification (D3.6.1). Since that work we have incorporated changes based on experience in two areas:

  1. We implemented a mockup implementation as a platform for testing the practicality of the DSL and made it available to CRESTA partners
  2. The mockup script was used in the tuning of NekBone (D3.5.2) and new feature and implementation requests came from this work.

 D3.7 – Frameworks for Exascale Applications

This deliverable reports on the development, extensions and modifications to different frameworks that have been developed by CRESTA WP3 to enable efficient execution of parallel applications on exascale machines.

We describe first the development of a new framework, called "targetDP", to express thread and instruction level parallelism for lattice-based codes. CRESTA participation in standardization committees, such as the MPI Forum and OpenACC and OpenMP committees, is briefly described.

We present a first mock-up implementation of the CRESTA DSL specification to enable automatic tuning of OpenACC codes.

The software architecture and the performance of two components (runtime administration and monitoring components) of the CRESTA run-time system are provided.

Extensions and modifications to the Score-P and Vampir performance monitoring and analysis tools are presented. To deal effectively with large amount of data from performance hardware counters, selective instrumentation and monitoring, hierarchical buffer management, runtime event reduction and message matching have been implemented. In addition, we report on how to handle file system limitations, to support performance monitoring for new programming systems and application using hybrid approaches, and how to monitor energy and network performance hardware counters.

Together with the extensions and modifications to Allinea DDT and Dresden Technical University MUST debuggers, we describe the integration of the MUST MPI correctness checker into Allinea DDT parallel debugger.

D3.8 – Final release of adaptive runtime systems

In the CRESTA project, part of the effort was devoted to designing different frameworks in the CRESTA development environment of exascale applications. This deliverable is a software deliverable, delivering a final release of the runtime environment. In this associated report, we present the installation instructions for the runtime system environment developed within the CRESTA project.

The runtime environment developed within this project has been shown to benefit applications on the medium- to large- scale and represents a significant step on the software's roadmap towards the utilisation with applications on the scale of future exascale platforms.

D3.9 – Final Release of Performance Analysis Tools

This deliverable reports on the availability of, and access to, the final versions of the performance analysis tools developed in WP3 of the CRESTA project. This includes two tools:

  • Score-P and
  • Vampir

Score-P is a highly scalable tool to monitor parallel applications. It supports a wide range of programming languages and parallel programming paradigms. In addition, Score-P provides different monitoring modes namely profiling, tracing, and online monitoring.

Vampir is a graphical tool to visualize and analyze applications monitored with Score-P in post-mortem. It provides different displays and techniques to visualize the details of highly parallel applications in a scalable and user-friendly way.

 D3.10 – Final Release of Debugging Tools

This deliverable reports on the availability of and access to the final versions of the debugging tools developed in WP3 of the CRESTA project. This includes two tools:

  • Allinea DDT and
  • MUST

The first tool is a highly scalable debugger for parallel applications. It supports a wide range of programming languages and parallel programming paradigms. MUST provides automatic correctness checks that detect a wide range of MPI usage errors.

These tools aid efficient development workflows for highly scalable applications. Allinea DDT can be applied to investigate the root cause for faulty application behaviour. Without such a tool, developers can spend large quantities of time to find the root cause of an error. If an error only occurs at extreme scale (say above 100,000 processes), developers may even give up.

MUST on the other hand both serves to analyze the cause of an application crash due to an incorrect result—if the issue is MPI related—and to regularly check applications for hidden errors. Without this tool, many common mistakes have a high chance of not being spotted, especially type-matching and communication buffer errors.

The released versions of the tools include the extensions that the CRESTA proposal describes. This includes functionality extensions, such as novel correctness checks for MUST; usability extensions, such as the Allinea DDT-MUST integration; and scalability improvements, such as the scalable versions of MUST correctness checks. Further, we monitored the results of WP2 to understand available fault tolerance mechanisms. However, at the current point in time, it is not clear whether existing checkpoint-restart mechanisms for fault tolerance will suffice, or whether exascale software will require novel techniques. Both Allinea DDT and MUST support checkpoint-restart schemes, while they can often still operate/complete if a fault occurs.

D3.11 – Experiences With Benchmarks and Co-­‐design Applications

This deliverable reports on the experiences gained with applying the methods and tools developed in WP3 to benchmarks and co-design CRESTA applications developed in WP6. We describe the experience with benchmarks and application for each WP3 task ("Programming models", "Compilation and runtime environments". "Performance analysis tools","Debuggers"). For each framework developed in WP3, a critical review of outstanding issues is performed and future research directions are outlined.

We describe first the experiences gained with the PGAS programming model by developing a Coarray Fortran benchmark suite, using the Coarray Fortran in the IFS application to calculate Legendre Transforms and implementing Fast Fourier Transforms in UPC. In addition, we report the first results using the targetDP programming framework in Ludwig, a lattice Boltzmann application.

We investigate the use of compiler support for GPU programming by porting the NekBone, a skeleton version of Nek5000 code, to multi-GPU systems and present the performance results. We describe the co-design work with OpenACC in GROMACS. We use a first implementation of an auto-tuning system for OpenACC code to tune the OpenACC version of the Nek5000 code. The co-design work, involving the development of the adaptive runtime system and Nek5000, is described, and the use of different components of the runtime systems in benchmarks is presented.

The new features of Score-P and Vampir (support for new programming systems and new hardware counters, selective monitoring and enhanced scalability) are used in CRESTA applications: Nek5000, OpenFOAM, IFS, HemeLB, Gromacs.

The Allinea DDT and MAP tools and MUST correctness checker are used in HemeLB CRESTA application to detect and analyze software errors and correctness on largescale HemeLB simulations.

D4.3.2 – Community prototype of exascale algorithms and solver (software)

This deliverable describes a software package that is a prototype for linear solvers and FFTs intended to run on exascale systems. The software is available on a GitHiub server at: http://gitlab.excess-project.eu/numlibs/

The CEL library is used within the CRESTA project firstly to support co-design applications in the field of numerical libraries and secondly as a framework for the development and evaluation of some of the other new and promising disruptive technologies, which can be used to improve the efficiency of parallel applications.

D4.4.2 – Community prototype for optimized reduction approaches

Collective reduction operations such as global summation of a collection of floating point numbers is an important operation in numerical simulations and used for instance as a convergence criterion to control iterative numerical solvers.

The summation of floating point numbers in particular suffers from inaccuracy due to limited numerical precision and round-off errors. Numerical schemes to mitigate the effects of round-off errors in user code and MPI reduce operations have been discussed in a previous document, i.e. D4.4.1 [1].

Based on the work presented in D4.4.1, we have developed the High-Precision Reduction library (libHPR) which offers various algorithms for critical local reductions. Further, we have developed wrappers for MPI collective reductions (libHPRmpi) that allow users to replace the MPI collective reduction, specifically for summation, with a high-precision version.

It is worth noting that the high-precision optimized version of MPI reduce as well as the routine to do local summations are slower than their standard counterparts. The user thus needs to trade off performance for accuracy on a case-by-case basis.

This document is organized as follows: section 2 gives a brief introduction to the topic; section 2.3 lists the various routines provided by the library, and gives details about how to obtain and use the libraries; finally in section 4 we briefly summarize and draw conclusions.

 D4.5.4 – Non-­blocking collectives in one or more of the co-­design applications

Deliverable 4.3.2 contains a description of the work done under CRESTA in the field of the non-blocking collectives. The experiences and insights in this field have been incorporated in the two papers "Benchmarking MPI Collectives" (1) and "MPI collectives at scale" (2), which were presented at Supercomputing Conference 2014.

The objective of this document is to collect key information about non-blocking collectives and their possible use in High Performance Computing. Detailed information can be obtained in the above-mentioned papers.

In sections 3 and 4 we study the influence of the entering time on some existing collective operations and the possibility of hiding the parallel overhead caused by the operation by overlapping the communication with computation or other work.

The information about the usage of the non-blocking collective operation MPI_Iallreduce in the iterative solvers of NekBone (3) and CEL library (4) is provided in section 5.

Section 6 contains information about the integration of non-blocking collectives into a production application, HemeLB (5).

D5.1.5 – Pre-­‐processing: final tools for exascale mesh partitioning and mesh analysis

CRESTA deliverable D5.1.3 [3] introduced PPStee in a first prototype version. The preprocessing interface PPStee is designed to balance the load of the overall simulation. It specifically includes all simulation parts, thus extends load balance from simulation core to pre-processing and post-processing tasks, and visualisation. Well-known thirdparty partitioning libraries are used to calculate the data distribution that is required for the load balance. Deliverable 5.1.4 [4] applied a first analysis of PPStee's features and performance results and marked points where further investigation would be crucial. The most important points were test runs on significantly higher core counts and the integration of PPStee into another CRESTA co-design code. Both are addressed in the document.

This software deliverable D5.1.5 provides a feature-finalised version of the software PPStee. PPStee source code and its documentation can be downloaded from the CRESTA Subversion repository (/wp5/preprocessing). We have eliminated a number of bugs and included further software tests to expand compliance with CRESTA's software standards. Based on our observation of HemeLB using PPStee, we spotted parts that needed improvement; hence we implemented Zoltan query functions and fully revised the weights management. However, we focussed on both investigation hints provided by the analysis in D5.1.4. First, we performed test runs of HemeLB with PPStee on higher core counts and larger geometries, giving better insight into the scalability. Second, we integrated PPStee into the development version of CRESTA's co-design vehicle NEK5000 that implements p4est [10], i.e. a tool for mesh analysis and mesh manipulation. We present here the first performance results.

D5.1.6 – Pre-­‐processing: tool evaluation and investigation with application data

Following the analysis and system definition of pre-processing for exascale systems in [1] and a study of data formats and algorithms especially tackling CRESTA's co-design applications in [2], the pre-processing interface PPStee was developed and introduced in [3]. It is designed to balance the load of the overall simulation specifically including communication and computation costs of all simulation parts. [4] and [5] provided first implementations into HemeLB [9] and Nek5000 [14] accompanied by simulation runs on up to 12k cores with HemeLB on HECToR [10] and ARCHER [11].

We continued our effort with HemeLB and investigated the results of weighted decomposition. These results of simulation runs on up to 24k cores of ARCHER with geometries containing up to 5.6 million lattice sites are presented and analysed. Additionally, we provide a new example for PPStee application. Together with two small scripts and the HemeLB pre-processing tool protopart, this example forms a tool chain that enables a thorough a priori analysis of the partitioning and, ultimately, ensemble simulation runs of HemeLB.

Furthermore, we performed simulation runs of the CRESTA-modified version of Nek5000 that includes adaptive mesh refinement and PPStee partitioning. The results of runs on up to 48k cores on ARCHER confirm the simple integration and general applicability of PPStee and its usefulness regarding a comparison of the partitioning quality of the supported partitioning libraries.

D5.2.5: Post processing tools for interactive data visualization and exploration

In this deliverable, we present the post processing tools and systems designed for enabling data exploration and post processing towards exa-scale. Providing run-time inspection to the ongoing simulation and enabling interactive exploration of the simulation are two major challenges for post processing for large simulations.

While pre-processing of the simulation focuses on mesh creation and partitioning, postprocessing of the simulation is targeted at providing visualisations of the simulation outputs, which serves as a tool to explore and analyse the simulation results.

A key concept in data post processing for exa-scale simulation is to provide In-situ data inspection and to minimize moving data around. Inspecting simulation results at run time and providing a first visualization allows the simulation experts to monitor the process of the running simulation and to prevent early failure.

In this work package, we present the in-situ monitoring tool and the underlying system designed and for HemeLB which is further applicable for large simulations in general. Such on-line monitoring client/system provides the user with an in-situ inspection of the on-going simulation. It does not require outputting data to local storage. Instead, rendered image cache is streamed to the monitoring client, thus minimizes the data transfer and keeps maximum data locality for large parallel systems.

In comparison to Deliverable 5.2.4, which presents a system algorithm review, this deliverable continues with the software development of the online-monitoring parts and the front-end interactive applications. Aside from implemented the in-situ monitoring tool, we further integrate the scientific visualization algorithm, which allows further investigation of the HemeLB simulation output. While the in-situ monitoring provides a first step in inspecting run-time simulation results, an integrated visualization system in virtual environments will enable the user with intuitive and explorative perception of each simulation time-step. Moreover, we also present the recently initiated co-design work with an ELMFIRE simulation.

D5.2.6: Post-­‐processing: tool evaluation and investigation with application data

In this deliverable, we present the evaluation of our developed in-situ post-processing software infrastructure designed for enabling data exploration and post-processing towards exascale.

As the last deliverable for work package 5.2, we give a detailed review of the software architecture, and evaluate our in-situ post-processing algorithms during an on-going simulation. We focus on evaluating the time needed to prepare the in-situ operations, for extracting features, and for rendering. We investigate the scalability and interactivity of the proposed in-situ processing tools. All proposed methods have been evaluated using HemeLB as the core application for the in-situ analysis.

D5.3.5 – Remote hybrid rendering: final tools

This document accompanies the software delivered as the final tool for remote hybrid rendering (RHR).

RHR is used to access remote exascale simulations from immersive projection environments over the Internet. The display system may range from a desktop computer to an immersive virtual environment such as a CAVE. The display system forwards user input to the visualisation cluster, which uses highly scalable methods to render images of the post-processed simulation data and returns them to the display system. The display system enriches these with context information before they are shown. RHR decouples local interaction from remote rendering and thus guarantees smooth interactivity during exploration of large remote data sets.

Together with the documentation extracted from the source code, this document describes the final tool for remote hybrid rendering. The client has been implemented as a plug-in to the OpenGL based virtual reality renderer OpenCOVER [5] of the visualization systems COVISE [6] and Vistle [14]. A server is also implemented as a plug-in for OpenCOVER. It can be used together with a compositor plug-in to scale the performance with the number of available GPUs. The source code of these plug-ins is open and can be retrieved from the CRESTA project subversion repository.

In order to scale with the number of nodes on systems that do not provide OpenGL support, a CPU based data-parallel interactive ray casting render module for Vistle has been implemented. This renderer also provides a server for remote hybrid rendering. This software is available as part of Vistle from its publically accessible GitHub repository [14].

Synchronization between the nodes attached to a tiled display naturally happens in the client application, as all data transfer is funnelled through the head nodes of the local and remote systems [3]. On the other hand, reprojection of 2.5D images according to current viewing parameters automatically brings all tiles into a synchronized state.

While implementing the first prototype of RHR for D5.3.3 [4], it became clear that the protocol proposed in D5.3.2 [3] based on the DoW had to be changed: interaction, including multi-touch interaction, and head tracking have to be handled by the client application, as these have to affect the handling of the local context information as well. In D5.3.4 [5] the protocol was updated to reflect this. Hence, the protocol for RHR only sends viewing parameters, derived from user interaction and head tracking, from client to server, which responds with 2.5D images, which are merged with locally rendered content.

This design enables the cooperation of light-weight renderers with display programs that contain most of the application logic and interaction handling. This allows for easy integration of RHR with a multitude of applications that operate on a 3-dimensional domain. The sole requirement is that the application is able to generate colour images together with depth data describing the distance of the visible pixels to the viewer. Within CRESTA, this would allow to extend the use of RHR from OpenFOAM to all other applications and especially HemeLB, as this already comes with its own integrated image generator.

First experience gained during development shows that the performance of the system could benefit from further latency reduction by providing better compression, and more overlap between rendering and compression/transmission. Additionally, it seems worthwhile to provide a software framework that enables easy integration of a RHR server into existing rendering software.

D5.3.6 – Remote hybrid rendering: tool evaluation and investigation with application data

Remote hybrid rendering (RHR) is used to access remote exascale simulations from immersive projection environments over the Internet. The display system may range from a desktop computer to an immersive virtual environment such as a CAVE [10]. The display system forwards user input to the visualization cluster, which uses highly scalable methods to render images of the post-processed simulation data and returns them to the display system. The display system enriches these with context information rendered locally, before they are shown. RHR decouples local interaction from remote rendering and thus guarantees smooth interactivity during exploration of large remote data sets.

The protocol for RHR only sends viewing parameters, derived from user interaction and head tracking, from client to server, which responds with 2.5D images, which are merged with locally rendered content. This design enables the cooperation of lightweight renderers with display programs that contain most of the application logic and interaction handling. This allows for easy integration of RHR with a multitude of applications that operate on a 3-dimensional domain. The sole requirement is that the application is able to generate color images together with depth data describing the distance of the visible pixels to the viewer.

For evaluating RHR, the distributed memory parallel visualization tool Vistle [2] has been implemented. RHR is composed with a scalable rendering system employing sort-last parallel rendering. With a CPU based remote ray caster [21], extraction of isosurfaces and cutting surfaces can be controlled interactively from virtual environments. This system was used successfully on various display systems. Interaction is smooth due to high local display update rates. Cutting surfaces and isosurfaces are generated based on input from within the virtual environment. The goal of decoupling interaction from remote rendering latencies has been achieved.

Compared to classic remote rendering, RHR allows for lightweight rendering server implementations. In a context where the rendering server is replicated many times, e.g. for in situ visualization tasks, this is an advantage.

D6.1.3 – Roadmap to exascale (update 2)

This document contains an update to the two initial roadmaps for the CRESTA codes described in Deliverables D6.1.1 and D6.1.2. Description of the main developments and application performance improvements conducted during the project, as well as an update to the original roadmaps for the separate codes, are summarized in Section 1.1. Activities related to the co-design progress are summarized in Section 1.2 for each application separately.

D6.4 -­ Exemplar scientific simulations

This document describes exemplar scientific simulations for each CRESTA application. In the final two deliverables of work package 6, i.e., "D6.1.2 Roadmap to exascale (Update 2)" and "D6.5 Peta to exascale enabled applications", these simulations will be used to show the success of the software development carried out during the project.

In the following, we give a short overview of each simulation and their scientific impact for each application.

ELMFIRE: In order to optimize the structure of fusion reactors, understanding of turbulent plasma transport is required. With exemplar scientific simulations in ELMFIRE, the scientists seek to better understand confinement properties of the fusion plasma. For computations at a small scale, an excellent match between the experiments and computational results has been observed. For modelling ITER-sized problems, exascale computing is needed.

GROMACS: With GROMACS, two different types of exemplar simulations are described. The first type aims at a better understanding of liquid/surface interactions at an atomic level and the second type considers the modelling of state transfers in biomolecular systems. Both type of simulations have direct applications in medicine and in drug design. Exascale computational resources are needed for modelling of realistic systems up to the continuum limit.

HemeLB: Exemplar simulations in HemeLB consider cerebrovascular blood flow in an authentic geometry. Since some aspects of such flows, such as wall-shear stresses, are difficult to obtain or reproduce in a clinical setting, simulations allow increased understanding and better treatment of conditions that can ultimately cause health issues, financial cost and ultimately even the death of a patient. Exascale computational resources are required in order to run the simulations within prespecified clinical time limits.

IFS: Enabling more detailed grid resolutions having added physical interactions is known to increase the accuracy of numerical weather forecasts. This can help save lives as well as result in capital savings. With exemplar scientific simulations of IFS, by using actual data from the hurricane Sandy, the meteorologists aim to understand how the approximation with better grid resolution and additional physics relate to the accuracy of the numerical model. Another objective is to study if it is feasible to do computations with increased resolution and complexity within the ECMWFs operational requirement of computing 240 weather forecast days per day.

Nek5000: Simulation of incompressible fluid flow in pipes with a circular cross-section is important in many applications such as gas and oil pipelines. However, in order to solve the resulting problems at scale with turbulence, exascale computational resources are required. To this end, by using OpenACC directives, Nek5000 has been enabled to exploit GPUs.

OpenFOAM: Pump turbines are commonly used in hydro power plants to generate electricity. The exemplar simulation with OpenFOAM is to model a whole Francis pump turbine. This enables the engineers to study the behaviour of the flow vortex and the resulting pressure pulsations in order to improve the turbine design. Since a long time evolution combined with a fine timescale is needed, exascale computational resources are required.

D6.5 – Peta to exascale enabled applications (Software)

This deliverable, "D6.5 Peta to exascale enabled applications (Software)", describes the status of the CRESTA applications at the end of the project. In short, the licensing policy, availability and the performance improvements achieved during CRESTA for each of the applications are:

ELMFIRE: Proprietary license, however access available on request. Contact details to use when applying for a license are given in the Elmfire section of this document. A domain decomposition version of ELMFIRE has been developed during CRESTA, resulting in a significant reduction in memory consumption.

GROMACS: Licensed under LGPLv2 and available in a public code repository. CRESTA modifications included in the main trunk. Performance and scalability (through ensemble computations) have been significantly improved during CRESTA.

HemeLB: Licensed under GPLv3 and available in a public code repository. CRESTA modifications included in the main trunk. Both performance and scalability have been significantly improved during CRESTA.

IFS: Proprietary license, contact details for applying for a license given in the IFS section of this document. CRESTA modifications included in the main trunk. Both performance and scalability have been significantly improved during CRESTA.

NEK5000: Licensed under GPLv3 and available in a public code repository. CRESTA modifications for OpenACC included in the main trunk, AMR modifications to be included in the trunk as soon as scalable pressure preconditioner has been implemented. By using OpenACC to offload computations to GPGPUs, performance improved during CRESTA.

OpenFOAM: Licensed under GPLv3 and available in a public code repository. Development effort ceased after M24. No improvements made to the code trunk during CRESTA.

Note that this document is closely linked to CRESTA deliverable "D6.1.3 Roadmap to exascale (update 2)". D6.1.3 contains thorough performance analysis and a roadmap to reach exascale performance. This document considers the practical issues, such as how and from where to obtain the source code and how to to apply or use the modifications implemented during CRESTA with the application.

 

The following documents are the deliverables from Year 2 of the CRESTA project. Many of these provide updates to the deliverables produced in Year 1.

D2.1.2 Architectural developments towards exascale

High Performance Computing (HPC) is a growing market. It is beginning to be seen as vital for a nation's scientific and industrial competitiveness; more countries are providing funding for research into HPC, for instance China which has seen a significant growth over the last few years such that the fastest machine in the world is Chinese. The quest to make a supercomputer with Exascale performance requires significant technological advances, particularly given the limited power budget that such a machine will have. Such advances may come from the wider computing market, where the enormous growth in mobile computing is driving research into power efficient technology, or from research funding specifically for HPC.

To understand what an Exascale machine may look like it is informative to look at trends in relevant technology. Underlying trends in both semiconductor and communication technology drive advances across the computing landscape. These lead to advances in system building blocks; processors, memory, interconnect and software. By looking at company roadmaps some trends become clear, firstly the growth in heterogeneous systems involving different types of processor such as a traditional general purpose CPU and GPU. Secondly, the trend towards integration of components in System-on-Chip (SoC) silicon. Thirdly, the growth in licensing intellectual property, such as processor designs, to other manufacturers.

Several factors are important when considering HPC system architecture trends. These include performance, programmability and usability, power usage and efficiency, cost of procurement and cost of ownership. The TOP500 list provides 20 years worth of data to analyse to look at architecture trends. There has been a move towards using commodity components over custom technology, however this has seen raw floating point performance emphasised at the cost of improvements in memory, interconnect and I/O. An example in architecture trends is provided by looking at the development of the Cray XC30 system.

An Exascale machine is only useful if it has applications capable of using it. The CRESTA co-design applications provide an excellent source of information of the impact of architecture trends on application performance and design. Heterogeneous systems are seen as inevitable; however they have to be easier for application developers to exploit. This will be achieved by providing better integration, particularly through a single addressable memory space, and more importantly through the provision of standard, well supported programming models and languages. Highly parallel systems with millions of processors will need a matching high performance interconnect to allow the system to be fully exploited by applications. Although the wider market may provide advances in power efficient processor technology, funding for HPC specific research into interconnect, programming models and application development will be required.

D2.3.1 Operating systems at the extreme scale

Standard commodity operating systems have evolved to serve the needs of desktop users and business application servers, which have very different requirements to HPC systems and applications. In general, commodity operating systems are not fit-for-purpose, even for current petascale machines, without extensive customisation.

The impact of operating system activities on application performance is not fully understood and is hard to predict. Many HPC systems are configured or customised by a trial-and-error approach, dealing with particular performance problems as they occur, rather than by applying a systematic method.

Specialised operating systems, developed specifically for HPC machines, trade rich functionality for high performance. Scalability is achieved by only implementing a subset of “normal” operating system services, which impairs the usability of the system by application programmers.

Design decisions for specialised HPC operating systems are often influenced by, and sometimes compromised by, design decisions for novel HPC hardware. One example is that the BlueGene/L hardware did not provide cache-coherency between the two processing cores in a node, which prevented the operating system from supporting shared-memory.

The desire to make specialised systems more usable encourages the re-introduction of functionality that can have a negative effect on performance and scalability. Thread scheduling was not supported by the BlueGene/P operating system but has been re-introduced in the BlueGene/Q operating system. This increases usability for application programmers but introduces a source of unpredictable load-imbalance that could reduce scalability, especially at extreme scale.

Specialised HPC operating systems have been continuously researched and developed for at least 20 years, driven (at least in part) by emergent trends in hardware design. Current systems demonstrate that excellent operating system scalability up to petascale is achievable. Although it is possible for major advances to be made in operating system development via disruptive technologies, currently there is no consensus on the direction required.

D2.4.1 Alternative use of fat nodes

This report summarises the work that was undertaken in Task 2.4 “Alternative use of fat nodes” as part of CRESTA’s WP2 on “Underpinning and cross-cutting technologies”. More specifically, the report presents research into different ideas for the use of fat nodes on future systems, ranging from practical to more speculative approaches:

  • Co-location of workloads;
  • Offload servers;
  • Background processing of MPI communication;
  • Micro-kernels.

D2.5.1 Fault agnostic and asynchronous algorithms at exascale

The number of parts in HPC systems is set to increase significantly as their performance approaches the Exascale. This means that fault tolerance is an increasingly important aspect of the design of these systems. However it is also possible to consider software-hardware co-design as a solution to these problems. On the software side, this includes the development of fault tolerant algorithms. In general, this is a difficult problem, especially if faults are considered where part of the current state of a computation is lost. Other types of fault, however, do not involve such state loss: these include performance faults where a component (e.g. a processor or network link) does not fail, but performs at a slower rate than intended. Such faults are less catastrophic, but may be harder to detect.

Performance faults may not cause the computation to fail, but, for many algorithms, the synchronisation patterns mean that the whole computation can run at the speed of the slowest component. Asynchronous algorithms, which are often derived from synchronous counterparts by relaxing some or all of the synchronisation requirements, have the possibility of being much more tolerant to performance faults, though likely at the expense of poorer convergence rates.

In this deliverable, we select two asynchronous algorithms for the solution of large sparse linear systems (Jacobi and block Jacobi), and, using simulated slow cores and slow links on a real HPC system, quantify their ability to maintain performance in the presence of such faults by comparing them to their synchronous counterparts.

Our findings do indeed show that the algorithms have strong resilience to such faults, even when the loss of component performance reaches an order of magnitude. However, in some cases we observe that the asynchronous algorithms can exhibit undesirable convergence behaviour, and that care needs to be taken to avoid this. Finally, we discuss how such algorithms may be of interest in the contexts of alternative uses of fat nodes and power management.

D3.2.2 Adaptive runtime support design document

Subtask 3.2.2 “Hybrid and adaptive runtime systems” is developing an experimental runtime system that will explore the power of adaptive runtime support for exascale applications.

In deliverable D3.1 “State of the art and gap analysis - Development environment”, CRESTA performed an analysis of existing approaches in the field as well as technical boundary conditions and requirements.

The deliverable D3.2.1 provided a design of a runtime system that aims to develop further approaches to adapt simulation applications dynamically in the best way to computer systems and to extend such approaches to upcoming exascale architectures.This deliverable therefore proposed an adaptive runtime-support design where simulation applications based on a task-orientated programming model with hierarchical tasks are combined with runtime supporting performance analysis and runtime administration enabling an increased efficiency of large-scale numerical simulations.

To this updated deliverable D3.2.2 have been added conclusions that could be drawn from the ongoing implementation of the runtime administration and monitoring components. It points out that the overhead of the runtime system in a typical molecular dynamics simulation has to be expected at about 5% allowing noticeable performance improvements of the overall runtime. A new performance monitoring API has been developed with the aim to allow the use of IPM with low overhead in the runtime system.

D3.3.2 Performance Analysis Tools design document

This document (“Performance Analysis Design Document, D.3.3.2”) is an update of the previous D3.3.1. to present possible designs, planned modifications and extensions to the existing application performance analysis tools Score-P and Vampir to address scalability and heterogeneity.

We describe the designs and extensions for the performance monitoring tool Score-P, i.e. collection of different kinds of performance counter and integration within the monitoring system, reduction of the amount of data to address scalability issues identified within the gap analysis (D3.1), and what extension will be done to address applications’ demands on heterogeneity. We specify the designs and extensions in terms of scalability and heterogeneity of the performance analysis and visualisation tool Vampir, then we present how to ensure that any extensions that we provide are well-tested and suitable for productive use, and finally we address the state of fault-tolerance.

D3.4.2 Debugging design document

This document describes designs, extension steps, and ideas that will allow the debugger Allinea DDT and the automatic runtime correctness tool MUST to adapt towards Exascale needs. We use deliverable D3.1 “State of the art and gap analysis” as a roadmap for these extensions. We extend the first version of this document from project month 10 and refine our designs and plans where we gained additional knowledge or feedback.

D3.5.2 Compiler support for exascale

A study of the performance of the computational kernels relevant to the Nek5000 CRESTA co-design application was completed last year (Compiler Support for Exascale, CRESTA Project Deliverable D3.5.1).  This included a brief study of CPU performance and a more in-depth study of performance on a single GPU. The GPU study used the PGI compiler suite and OpenACC accelerator directives, coupled with auto-tuning compiler technology from the University of Edinburgh School of Informatics. A standalone benchmark version of the full Nek5000 application, called Nekbone, was subsequently ported to large-scale Cray GPU parallel systems using the Cray OpenACC compiler and then optimised by hand. The design of a CRESTA auto-tuning framework was also developed, and a prototype implementation produced. In this study we draw these three strands together and use the CRESTA auto-tuner on the Nekbone kernels to attempt to produce an optimised accelerated version for Cray hardware whose performance can be compared with the hand-optimised accelerator code. We also perform an in-depth investigation of a similar approach applied to the CPU version of Nekbone to enable comparisons of the auto-tuning procedures and performances achieved between CPU and GPU.

D4.2.1 Prediction Model for identifying limiting Hardware Factors

Hardware is one of the main factors to consider for the efficient use of massive parallel systems. It is also important to understand the main limiting factors that influence the efficiency of existing and developing programs. To successfully exploit an exascale system both hardware and software need consideration.

The purpose of this document is to support the further implementation of library “exascale algorithms and solvers” in the CRESTA work package 4 (WP4). We have performed many tests on different platforms to determine their differences and most important limiting factors.

D4.3.1 Initial prototype of exascale algorithms and solvers for project internal validation (Software)

Deliverable D4.3.1 is a software deliverable. This document describes the software, a prototype parallel numerical library targeted at Exascale systems.

As previously discussed in the deliverables of WP4, “D4.1.1 Overview of major limiting factors of existing algorithms and libraries” and “D4.2.1 Prediction Model for identifying limiting Hardware Factors”, the Exascale is going to require an increase in the efficiency, in the sense of scalability and performance, of algorithms due to the very large degree of parallelism that will be required. As well as efficient algorithms highly efficient implementations of those algorithms are also required. In addition to the increase in the degree of parallelism Exascale systems are expected to be significantly more complex than current systems with many different levels of memory and communication hierarchies. This will make it very difficult to optimize codes for Exascale systems. Many codes will require significant rewriting to make the best use of these systems. The availability of parallel numerical libraries designed for Exascale systems should significantly reduce the development costs of this process. We have evaluated a number of existing numerical libraries that implement linear solvers (such as PETSc or Trilinos) though these are scalable on current hardware they haven’t achieved, in our opinion, the highest possible efficiency (see more details in D4.1.1 and D4.2.1). In addition current solver libraries do not properly address key issues at the Exascale such as the overlap of communication and calculation. Though Fourier transforms are an important part of many simulations and node-local FFT libraries are widely used, most major applications implement their own distributed FFTs using a combination node-local FFT libraries and explicit MPI communications. We believe that this is because the currently available parallel FFT libraries place too many constraints on the data decomposition of the rest of the application.

For all of the above reasons we are developing a new library (the CRESTA Exascale Library, CEL in short) addressing these two important classes of numerical problem: linear solvers and multi-dimensional Fourier transforms. This initial prototype of the library will form the basis for further testing and improvements. Ultimately the optimized library will be integrated with the CRESTA applications.

D4.4.1 Initial prototype for optimised reduction approaches for project internal validation (Software)

Collective reduction operations such as global summation of a collection of floating point numbers is an important operation in numerical simulations and used for instance as convergence criterion to control iterative numerical solvers.

Particularly the summation of floating point numbers suffers from inaccuracy due to limited numerical precision and round-off errors. While there are numerical schemes to mitigate these effects as for instance the Kahan Summation algorithm, collective summations in an MPI application are beyond the control of the user and may introduce large error for large numbers of MPI ranks. However, a priori it cannot be determined whether an application is affected by these numerical inaccuracies and to what extent. The user needs to verify this, possibly for every input data set.

As we move to Exascale computing, with possibly millions of MPI processes, the number of terms in a summation reduction approaches the limit were numerical errors will reach a level that can no longer be disregarded a priori. We have developed a prototypical version of a library that allows the user to replace the MPI collective reduction, specifically for summation, with a high-precision version. This can be used to test whether an application or a use-case is affected by inaccuracies in the MPI summations.

However, this will only show difference due to inaccuracies in the MPI part, not the computations done locally. To analyse those, we also provide a set of routines to do local summation of a vector of values at high precision. These routines can be used by the application developer in critical sections of the code. It is worth noting, that the high-precision optimized version of MPI reduce as well as the routine to do local summations are slower than their standard counterparts. The user thus needs to trade off performance for accuracy on a case-by-case basis.

This deliverable has not evaluated possible support of the networking hardware for summation or other reduction operations. Nonetheless, we recommend adding computing capabilities using high-precision math to the networking interfaces of future Exascale systems as well as to use high-precision buffers for data transport in reduction operations. The performance impact should be minimal with dedicated hardware.

D4.5.2 Microbenchmark Suite (Software)

Task 4.5 is concerned, amongst others, with the optimisation of collective communication operations. Collective operations involve multiple participants rather than only two as is found in point-to-point communication operations. Examples of collective operations are synchronisation barriers, or reductions over the full computational domain in order to find the sum/min/max of a particular quantity. Such operations are very common in most distributed applications.  

This document briefly describes the software deliverable Collectives Microbenchmark Suite which is used within the CRESTA project firstly to assess progress of the optimisation work on collective operations, but also secondly as a tool to analyse the characteristics of the implementation of collectives. For the users or developers of parallel applications, the benchmark suite may help in assessing which implementation of collective should be chosen in a specific use-case.

D4.5.3 Non-Blocking Collectives Runtime Library (Software)

Most algorithms of scientific computing involve communication patterns that are performed collectively across a large number of processing elements. Hence, the scalability of many applications is often bound by collective operations in which even minor load imbalances or other inefficiencies during these phases can cause a stall across a significant number of processes. This holds also for the most of the CRESTA co-design applications.

In order to scale applications to hundreds of thousands of cores, new approaches for collective communication will be needed. These could be, for example, the use of asynchronous algorithms in combination with remote-memory access (also called one-sided) operations, especially when supported by hardware; utilization of non-blocking collectives that allow for overlapping the communication overhead with computation; optimization of communication patterns to improve concurrency but avoid interconnect contention, and so forth.

This document describes a platform for studying scalability bottlenecks caused by collective operations: the CRESTA Collective Communication Library. It basically allows for an application developer to experiment with various alternative implementations for a particular set of collectives with minimal changes into the application source. These implementations include in addition to the traditional collective operations of the message-passing interface (MPI) library the non-blocking collectives as introduced in the most recent version of the MPI standard, collectives implemented with partitioned global address space (PGAS) languages (yet currently only with Fortran coarrays) as well as with remote-memory access operations (also referred to as one-sided communication) available in the MPI library. Furthermore, it defines an application-programming interface (API) where the initiation and finalization of a collective operation are performed in separate stages, to allow for performing other work while the collective communication occurs in the background, here referred to as the split-phase API. The library itself is free software.

D5.1.3 Pre-processing: first prototype tools for exascale mesh partitioning and mesh analysis available

This deliverable is a software deliverable, providing a first prototype interface for pre-processing steering named PPStee. In this document we provide a brief overview of the software. This software feeds into the simulation cycle a graph or mesh data and various communication costs and work load from all simulation loop components. It uses state-of-the-art partitioning libraries to provide an overall simulation load-balance and can be extended with further functionality such as mesh manipulation methods or connection to a fault tolerance framework.

In this document we sketch features and properties of PPStee and show advantages and disadvantages of its architecture. We illustrate the integration into a simulation work flow in terms of both data flow in combination with PPStee and actual implementation using a basic usage example. We point out the current software status and future work.

D5.1.4 Pre-processing: revision of system, data format and algorithms definition for exascale systems

In CRESTA Deliverable 5.1.1, we analysed the current situation of simulations regarding pre-processing and gave a system definition: main aim is a closer simulation cycle including all simulation parts and an improved overall simulation load balance. CRESTA Deliverable 5.1.2 studied algorithms of partitioning libraries used for pre-processing so far and pointed at basic properties required for the graph data format. These requirements culminated in the development of the prototype pre-processing steering interface PPStee introduced in CRESTA Deliverable 5.1.3.

Here we review the design of PPStee and collect performance data to evaluate this prototype tool. The integration of PPStee into HemeLB was relatively simple, as intended, and allows for performance tests of HemeLB with various geometries and all three by PPStee supported partitioners, ParMETIS, PTScotch and Zoltan. Runtime measurements with up to 2048 cores on HECToR are presented as first results. PPStee’s runtime overhead vanishes and ensures usage of PPStee without a priori drawbacks. The configuration using PTScotch performs, in general, slightly worse but reveals scalability issues starting at 512 cores. HemeLB with PPStee using Zoltan suffers from a constant loss in runtime, the reason is yet unknown. Further investigations will, in particular, focus on graph data conversion, scalability and usage of partitioner-characteristic routines and parameters to enable a better match to specific simulation data.

Lastly, we address CRESTA’s co-design vehicle OpenFOAM. Simulations using OpenFOAM are not a priori suited for application of PPStee due to the nature of OpenFOAM being a box of separated tools of solvers and utilities. However, PPStee may be applicable if each phase of a simulation using the OpenFOAM framework is aggregated into one monolithic program. The OpenFOAM co-design team is currently investigating the feasibility of this monolithic program.

D5.2.3 Post-processing: first prototype tools for exascale interactive data exploration and visualisation

This deliverable is a software deliverable, providing a software prototype for exascale interactive data exploration and visualisation. The main purpose of this associated report is to present and document the prototype software, which has been developed using a co-design process with the HemeLB code.

This software builds on two previous deliverables associated with Task 5.2 within Work Package 5 (D5.2.1 and D5.2.2). These described and studied the challenges, system requirements, system architecture and data structure for exascale data post-processing. The first two deliverables served as a theoretical foundation for the upcoming software development and design.

In accordance with the previous deliverable, this software aims to provide in-situ processing of the simulation data, interactive visualisation for exascale CFD simulations, and further computational steering capability of the on-going simulation. In-situ post and interactive visualisation provides the user with the possibility to explore the simulation result on-the-fly, while computational steering allows the user to change and modify an on-going simulation process by modifying simulation parameters.

In this deliverable, we deliver a software prototype which was co-designed and integrated into HemeLB. This prototype provides a fundamental structure of interactive data post-processing for HemeLB, allowing developers to evaluate the design of our proposed post-processing system and visualisation algorithms. We present an initial attempt to visualise a HemeLB simulation with a newly implemented visualisation and steering client. We also outline future plans and on-going work regarding software implementation.

D5.2.4 Post-processing: revision of system, data format and algorithms definition for exascale systems

This deliverable reviews the system definition and post processing algorithms that have been proposed in work package 5.2.1. In this work package, we study and evaluate the post-processing system architecture as well as the visualization algorithms. To evaluate the reliability and the compliance of the system, we test the prototyped post-processing tools which were co-designed for HemeLB.

Interactive data exploration and visualization are two major goals in exascale data post-processing. While pre-processing of the simulation focuses on mesh creation and partitioning, post-processing of the simulation is targeted at providing visualisations of the simulation outputs, which serves as a tool to explore and analyse the simulation results.

In an exascale environment, real-time visualisation of the simulation mesh, its partitioning and intermediate simulation results are important for an on-going simulation. It does not only make it possible to analyse intermediate simulation results, but also enables the user to detect and foresee failures in a running simulation process. In work package 5, we focused on developing users tools which provides in-situ and interactive post-processing for analysing running simulations.

In the past months of the ongoing project, work package 5 has established a major collaboration for co-design with HemeLB, which is a Lattice-Boltzman based fluid dynamics simulation code. We developed our post-processing system and algorithms in order to provide interactive and in-situ visualization for HemeLB. However, the proposed ideas and algorithms are not limited to this single type of solver. Instead, they can be also applied to other kinds of fluid simulation solvers such as OpenFoam and so on.  

D5.3.3 Remote hybrid rendering: first prototype tool

This document accompanies the software delivered as the first prototype of remote hybrid rendering.

Remote hybrid rendering is used to access remote exascale simulations from immersive projection environments over the Internet. The display system may range from a desktop computer to an immersive virtual environment such as a CAVE. The display system forwards user input to the visualisation cluster, which uses highly scalable methods to render images of the post-processed simulation data and returns them to the display system. The display system enriches these with context information before they are shown.

Together with the documentation extracted from the source code in the appendix, this document describes the first prototype for remote hybrid rendering. It has been implemented as plug-ins to the virtual reality renderer OpenCOVER of the visualization system COVISE. The source code of these plug-ins is open and can be retrieved from the CRESTA project subversion repository.

While implementing the prototype, some changes to the protocol draft for remote hybrid rendering became necessary.

Future versions of the tool will be improved regarding bandwidth requirements and scalability.

D5.3.4 Remote hybrid rendering: revision of system, protocol definition for exascale systems

Remote hybrid rendering (RHR) is developed to access remote exascale simulations from immersive projection environments over the Internet. The display system may range from a desktop computer to an immersive virtual environment such as a CAVE. The display system forwards user input to the visualisation cluster, which uses highly scalable methods to render images of the post-processed simulation data and returns them to the display system. The display system enriches these with context information before they are shown. This technique decouples interaction from rendering of large data and is able to cope with growing data set sizes as the amount of data transfer scales with the size of the output images.

Since D5.3.3, a prototype of RHR is available. This document describes its implementation and the algorithms developed for this prototype, especially for compressing depth images. Also, while implementing the prototype, some changes to the protocol draft in D5.3.2 for RHR became necessary. This document lists the necessary revisions. In addition, the performance of the prototype is examined.

Future versions of the RHR tool will be improved regarding bandwidth requirements and scalability.

D6.1.2 Roadmap to exascale (Update 1)

This document contains an update to the initial roadmap for the CRESTA codes described in Deliverable D6.1.1. The main progress and main updates to the original roadmaps for the separate codes are summarized in Section 1.1. Actions related to co-design progress for each application are summarized in Section 1.2.

D2.1.1 Architectural developments towards exascale

Technological evolution can be thought of as a combination of Incremental and disruptive changes. Predicting future evolution is hard but can be attempted by a combination of extrapolation of the incremental trends and identification of potentially disruptive technologies.

Evolution of hardware architectures to the Exascale is likely to be dominated by power consumption.

Power considerations will limit clock speed so Exascale performance will only be achievable via an increase in parallelism rather than by any significant increase in the speed of individual operations. This is a key concern given the difficulty that many current applications have achieving good parallel scaling on current Petascale systems.

Memory performance is a key factor in determining the performance of applications on current system. Though there are promising developments in memory technology that might go some way towards addressing the memory-wall, memory will continue to be one of the key system parameters at the Exascale. Memory is expected to contribute an increased fraction of the total power costs and so the ratio of memory capacity to computational capability is expected to be much less than in current systems.

The electrical communications between different parts of a node are expected to be a more significant fraction of the overall time and energy costs. As a result node architectures are expected to become more hierarchical and memory access times within a node are expected to become significantly more non-uniform so applications will not only need to exhibit a high degree of parallelism but also a high degree of locality to make good use of these systems.

Inter node communications will have to utilise optical technology to achieve acceptable performance within a reasonable power budget.

D2.2.1 Simulating and modeling exascale technology

Simulation and modeling are important tools in the development of exascale systems. There are very few other mechanisms for evaluating our designs for exascale hardware and software other than developing models of their behavior and simulating these models in a computer. The behavior of both hardware and software needs to be modeled.

In the early stages of the design process these models need to be quite simple and abstract. This allows us to develop and evolve our designs quickly and efficiently. If we attempt to use overly complex models in these early design stages then we will waste time and resources performing overly detailed simulations of design choices that will be abandoned before the final system is built.

D2.6.1 CRESTA benchmark suite

Benchmarks are widely used in High Performance Computing (HPC). They are used to measure system performance, either to get a general indication of the system’s technical capabilities, or to gauge its suitability to a particular application. They are also used to assess the performance of individual applications across a range of systems and HPC architectures. HPC benchmarks therefore cover a wide range of measurements, from low-level computation and communication operations, through computational kernels up to full applications.

In the context of the CRESTA project, benchmarks are useful in two different areas. The first of these is in the understanding the impact of changes to the system-ware on application performance. This requires an assessment of the impact of the changes on the area targeted by them, for example reducing communication latency. However, key to CRESTA are the six co-design applications. Therefore it is also important to understand the impact of system-ware changes on the application performance itself. Secondly, benchmarks are useful to the application developers. They provide a means to assess the impact of changes to the applications, and may also be used to inform design changes in the quest for exascale performance; for instance algorithm choice or the use of different programming models.

D2.6.2 Best practise in performance analysis and optimisation

This document describes the best practices in performance analysis and optimisation defined in Task 2.6.2 of WP 2 of the CRESTA project. This document should guide application developers in the process of tuning and optimising their codes for performance. It focuses on application performance optimisation and analysis, and describes which application performance monitoring techniques should be used in which situation, which performance issues may occur, how the issues can be detected, and which tools should be used in which order to accomplish common performance analysis tasks. Furthermore, this document presents the application performance analysis tools of the CRESTA project Score-P and Vampir. Scalasca, one of the profile analysis tools of Score-P, is also presented to provide a complete workflow of performance analysis tools for an application performance analysis. In general, the application performance optimisation and analysis starts with lightweight monitoring of the application, either by a job-monitoring tool or by a coarse-grained sample-based profiling to identify potential problem areas such as communication, memory, or I/O. Afterwards, a more-detailed call-path profiling should be used to identify phases and functions of interest, and also to locate the main performance problems. These functions and regions of interest can be analysed in more detail by using selective event tracing.

D3.1 State of the art and gap analysis

Development environments provide the tools to ease the implementation of scientific algorithms in computer codes, enable applications to run efficiently on parallel supercomputers, allow fast and non-invasive performance monitoring and analysis, and permit prompt detection of code errors. This document presents the state of the art for scientific programming development environments, discusses requirements for exascale computing, and outlines the future work of the CRESTA development environment work-package to enable the CRESTA co-design applications to achieve exascale performance.

D3.2.1 Adaptive runtime supposrt design document

This deliverable provides a design of a runtime system that aims to develop further approaches to adapt simulation applications dynamically in the best way to computer systems and to extend such approaches to upcoming exascale architectures. This deliverable therefore proposes an adaptive runtime-support design where simulation applications based on a task-orientated programming model with hierarchical tasks are combined with runtime supporting performance analysis and runtime administration enabling an increased efficiency of large-scale numerical simulations.

D3.3.1 Performance Analysis Tools design document

In this document we present possible designs, planned modifications and extensions to the existing application performance analysis tools Score-P [1] and Vampir [2] to address scalability and heterogeneity.

We organised the document as follows: in Section 2 we describe the designs and extensions for the performance monitoring tool Score-P, i.e. collection of different kinds of performance counter and integration within the monitoring system, reduction of the amount of data to address scalability issues identified within the gap analysis (D3.1), and what extension will be done to address applications’ demands on heterogeneity.  In Section 3 we will specify the designs and extensions in terms of scalability and heterogeneity of the performance analysis and visualisation tool Vampir.  Within Section 4 we present how to ensure that any extensions that we provide are well-tested and suitable for productive use.  Finally, in Section 5 we address the state of fault-tolerance.

D3.4.1 Debugging design document

This document describes designs, extension steps, and ideas that will allow the debugger Allinea DDT and the automatic runtime correctness tool MUST to cope with exascale needs. We use deliverable D3.1 “State of the art and gap analysis” as a roadmap for these extensions. We will provide a second version of this document in project month 24 to extend this document and add detail or new knowledge where necessary.

D3.5.1 Compiler support for exascale

A survey of compiler technologies relevant to exascale was performed in a previous CRESTA deliverable [1]. Two particular issues that were identified were the growing requirement for compilers to support CPU accelerators, and the possible advantages of auto-tuning to produce better performing code on today’s increasingly complex and heterogeneous processors. Motivated by this, we study key kernels from one of the CRESTA co-design applications, Nek5000. We investigate to what extent user-level source modifications affect performance at different levels of optimisation across a range of compilers. We then study new GPU-enabled version of the kernels, written for this study using the new OpenACC standard for accelerator directives, to explore current compiler capabilities for heterogeneous architectures. Finally, we attempt to optimise the performance of OpenACC using auto-tuning technology developed at the University of Edinburgh.

D3.6.1 Domain Specific Language (DSL) for expressing parallel auto-tuning specification

This document describes a domain-specific language (DSL) that serves as the central component of an autotuning framework for the tuning of parallel applications. We describe what features the DSL was designed to provide, how it fits within a wider autotuning framework and outline the initial implementation.

D4.1.1 Overview of major limiting factors of existing algorithms and libraries

In this deliverable, we give an overview of the main external libraries that are of importance for the CRESTA Co-design applications. This document presents the state of the art for the use of external libraries in Co-design applications, and presents new methods for their use. The problems faced today and in the future are described. A list and associated brief explanation are provided for algorithms that are currently being used or are under development. Some of the implementations have already been tested on most modern platforms. For this purpose, Cray XE6 and NEC Nehalem platforms have been utilised. The deliverable shows how the algorithms can be described mathematically to study their suitability for future exascale systems. Statistics have been collected for the verification of such models, which can be expanded at any time, depending on the requirements of future models. Finally there is a description of the CRESTA SVN repository for WP4, which should unite the software that we are going to use and develop in the CRESTA project.

D5.1.1 Pre-processing: analysis and system definition for exascale systems

As a first step to develop a proper pre-processing for massively parallel simulations, one has to evaluate the current status of various pre-processing tasks and how they perform on and interact with a possible exascale system. Only if this information is at hand it is possible to derive effectively a road map on how and where further effort has to be expended to achieve exascale-ready pre-processing techniques at the end of the CRESTA project.

Covering prior initialisation, disk I/O and mesh partitioning we summarise the outline of pre-processing. Up until now most work was put into mesh partitioning where currently multilevel methods dominate due to their computing efficiency and thus are omnipresent. All partitioning tools under closer inspection, ParMETIS, PTScotch and Zoltan, use multilevel methods. Therefore all these libraries share the uncertainty of scaling on more than several thousands of cores. Consequentially an improvement of these methods is reflected in the requirements on future pre-processing to achieve load-balancing in an exascale environment.

D5.1.2 Pre-processing: data format (hierarchical, regions of interest) and algorithms definition

With exascale computing, pre-processing becomes ever more important in order to increase overall performance and thus to lower costs. In this deliverable we focus on the usage and usability of partitioners in an exascale environment. Specifically, the data structures used, the underlying algorithms and coupling to other parts of simulation codes are tackled.

D5.2.1 Post-processing: analysis and system definition for exascale systems

This deliverable gives an overview how post-processing can be designed and implemented to support interactive data exploration and visualisation in exascale environments. Post-processing is concerned with extracting visualisation primitives from solver results, for example computing a stream surface given a flow field. The related workpackage task 5.3 handles the subsequent rendering of these primitives to produce a screen representation (cf. [1]). Within workpackage task 5.1, methods for pre-processing are explored which partition an input data set to minimize communication in the solver phase (cf. [2]). The post-processing algorithms then have to operate on data in the given partitioning.

While the performance of extreme-scale simulations is a key aspect, pre- and post-processing are additional important steps. Mesh creation and partitioning define the accuracy of the simulation results, and visualisation is used to finally analyse the simulated phenomena.

Furthermore, real-time visualisation of the simulation mesh, its partitioning and intermediate simulation results is also important during a simulation run. This not only makes it possible to analyse intermediate results, but also to detect and solve problems arising during a simulation. This avoids wasting a lot of CPU time for inappropriate parameter configurations.

Interactive visualisations during runtime, so-called online-monitoring, come with additional challenges and require for a combination of different solutions described in this document.

D5.2.2 Post-processing: data format (hierarchical, multi-resolution) and algorithms definition

This deliverable focuses on the data structures and algorithms in post-processing exascale simulation data. We also present our co-design work with workpackage 6.

A general overview of data structures and post processing algorithms as they are applied to exascale data sets is presented. We elaborate and discuss the advantages and disadvantages of each algorithm before applying as an exemplar to the HemeLB application from workpackage 6.

D5.3.1 Remote hybrid rendering: analysis and system definition for exascale systems

In this work, we introduce remote hybrid rendering strategies to make exascale resources available for interactive visualisation of large-scale numerical simulation data. We use the CRESTA co-design vehicles OpenFOAM [14] and HemeLB [13] to work out system requirements for a remote rendering solution in large-scale cluster environments. We compare different rendering strategies based on the Sort-Last algorithm according to their suitability for compositing in large-scale cluster environments on the road to exascale systems. Sort-Last was selected since rendering of large-scale distributed data generated by distributed post-processing systems is inherently supported by the algorithm with adequate communication overhead between the partitions. There are multiple possibilities for compositing of rendered images in the Sort-Last algorithm. The Binary Swap algorithm and its derivative, 2-3 Swap, were discovered to be promising candidates for Sort-Last compositing of interactive renderings in large-scale clusters. We propose various additions to existing remote rendering architectures to minimise the network bandwidth needed in the system during the compositing phase.

Concluding this deliverable, a software architecture for remote parallel rendering in large-scale cluster environments is defined consisting of a parallel rendering component and a parallel compositor component. Different modes of communication between these two components are sketched that have to be supported for interactive, low-latency post-processing and rendering. These modes of communication lead to the definition of an interface between these two components, alongside with the definition of an interface between the post-processing environment (see WP 5.2), and the remote hybrid rendering solution that is going to be developed in WP 5.3.

D5.3.2 Remote hybrid rendering: protocol definition for exascale systems

In this work we define a protocol for remote hybrid rendering. Remote hybrid rendering is used to access remote exascale simulations from immersive projection environments over the Internet. This protocol is used for the flow of communication between the head node of a visualisation cluster and the head node of a display system. The display system may range from a desktop computer to an immersive virtual environment such as a CAVE (Cave Automatic Virtual Environment). The display system forwards user input to the visualisation cluster, which uses highly scalable methods to render images of the post-processed simulation data and returns them to the display system. The display system enriches these with context information before they are shown.

D6.1.1 Roadmap to exascale (Initial release)

The ‘Roadmap to Exascale’ contains an overview of how the CRESTA applications codes may be developed such that they can take advantage of future computers with computational capacity in the exaflop/second realm. The summary for the codes are as follows:

ELMFIRE: needs to be developed such that it includes a true domain decomposition scheme. With the present algorithm it does not seem likely that the code can run efficiently on exascale systems, as it would require unrealistic amount of memory. Also the data handling needs to be made parallel in order to cope with exascale data.

GROMACS: is going to be developed towards exascale according to three different strategies: (i) improve wall-clock-time/iteration (ii) soft-scaling improvements for large simulations and (iii) ‘ensemble’ approach for efficient simulations of systems with large fluctuations for which large amount of statistics and/or optimization is needed. These approaches include efficient implementation of GP-GPU algorithms, O(N) FFT and parallel I/O.

OpenFOAM: At EPCC the development will focus on the most recent version of OpenFOAM from the OpenFOAM foundation [1].

At the Institute of Fluid Mechanics and Hydraulic Machinery, the University of Stuttgart version OpenFOAM-extend-1.6 [2] will be used. The code has been extended by additional features, including the General Grid Interface (GGI). By these extensions OpenFOAM has become a powerful open source CFD software package specialized in turbo machinery. The interest here will be to simulate a whole hydraulic machine on exascale architectures with the OpenFOAM version mentioned above.

NEK5000: will be developed towards exascale scalability by implementing new theoretical solutions for parallelism like: adaptive refinements, alternative discretisation and hybrid parallelisation. Extra care will be taken with respect to exascale boundary conditions, data handling and load balancing.

IFS: will be developed by utilizing Fortran co-arrays to overlap calculations and communication for Legendre transforms and semi-Lagrangian halo calculations. Load balancing for Fourier transforms is another area that will be optimized.

HemeLB: is going to need an improved visualization scheme for handling exascale data sets. Also the meshing procedure will be improved.

D6.2 Needs analysis

This ‘Needs analysis’ contains an overview of what the CRESTA applications codes need in order to be developed towards the exaflop/second realm. The summaries of the requirements for the co-design codes are as follows:

ELMFIRE: would benefit from real-time visualisation, automated hardware failure recovery, an exascale linear solver and file I/O, exascale profiling and debugging. These aspects are being dealt with together with other work packages within the project.

GROMACS: The main challenge for Gromacs is to improve scaling for lattice summation electrostatic interactions through more efficient FFT libraries, or by using completely different algorithms that do not involve all-to-all communication. For the remaining part of the code, the requirements are more focused on improving parallelism on all levels, including tasks over CPUs/cores/GPUs, and improving IO data handling for extremely large systems. For small systems, the exascale needs will have to be addressed with ensemble techniques that do not require synchronization between all nodes for each time step.

OpenFOAM: OpenFOAM® is considered to scale well to thousands of cores. Some important use cases need the OpenFOAM®-extend versions including the GGI (General Grid Interface). The possible bottleneck GGI seems to work well after maintenances had been done few months ago from the OpenFOAM®-extend community. OpenFOAM®-extend, however, still has to be prepared for the exascale architectures.

NEK5000: needs an integration of the pre- and post-processing codes, a p-type mesh refinement, hybrid parallelism, parallel file I/O, and an investigation of the exascale performance of BLAS, among other things. These issues are developed with the other WPs.

IFS: needs to take advantage of Fortran co-arrays. In order for proper exascale implementation the code needs profiling and debugging tools that can handle this type of data-structures. Furthermore, Fourier transforms and multi-grid will be developed in cooperation with other WPs.

HEMELB: would benefit from e.g. exascale visualization, debugging and profiling tools, failure tolerant MPI and dynamic domain decomposition. These issues will be investigated in cooperation with other WPs.