Invited Speakers

Mark Parsons, CRESTA

Title: CRESTA: Collaborative Research into Exascale Systemware, Tools and Applications

The CRESTA project uses a novel approach to exascale system co-design which focuses on the use of a small, representative set of applications to inform and guide software and systemware developments. By limiting our work to a small set of representative applications, we are developing key insights into the necessary changes to applications and system software required to compute at this scale.  In CRESTA, we recognise that incremental improvements are simply not enough and we need to look at disruptive changes to the HPC software stack from the operating system, through tools and libraries, to the applications themselves. In this talk I will present an overview of our work in this area.

Dana Knoll, LANL

Title: CoCoMANS: Computational Co-design for Multi-scale Applications in the Natural Sciences

CoCoMANS is a three year project at Los Alamos National Laboratory (LANL) intended to advanced LANL's understanding of useful computational co-design practices and processes. The exascale future will bring many challenges to current practices and processes in general large- scale computational science. The CoCoMANS project is meeting these challenges by forging a qualitatively new predictive-science capability exploiting evolving high-performance computer architectures for multiple national-security-critical application areas including materials, plasmas, and climate by simultaneously evolving the four corners of science, methods, software, and hardware in an integrated computational co-design process. We are developing new Sapplications-based, self-consistent, two-way, scale-bridging methods that have broad applicability to the targeted science. These algorithms will map well to emerging heterogeneous computing models (while concurrently guiding hardware and software to maturity), and provide the algorithmic acceleration necessary to probe new scientific challenges at unprecedented scales. Expected outcomes of the CoCoMANS project include 1) demonstrating a paradigm shift in the pursuit of world-class science at LANL by bringing a wide spectrum of pertinent expertise in the computer and natural sciences in to play; 2) delivering a documented computational co-design knowledge-base, built upon an evolving software infrastructure; 3) fostering broad, long-term research collaborations with appropriate hardware partners; and 4) deepening our understanding of, and experience-base with, consistent, two-way, scale-bridging algorithms. In this talk we will provide an overview of the CoCoMANS project and provide some results from the project to date.

Giovanni Lapenta, KU Leuven

Title: Codesign effort centred on space weather

Space weather, the study of the highly dynamic conditions in the coupled system formed by the Sun, the Earth and the interplanetary space, provides an exciting challenge for computer science. Space weather brings together the modeling of multiple physical processes happening in a system with hugely varying local scales and with a multiplicity of concurring processes. The high variety of methods and processes provides the testing grounds for new and upcoming computer architectures. Our approach is based on taking one powerful and widely used approach, the particle in cell method, as our central reference point. We then investigate how it performs on existing and upcoming platforms, we reconsider its design and its practice of implementation with an eye towards innovative (and even revolutionary) formulations centered around co-design: can we rethink the method so that the needs, the requests and the possibility in the hardware, in the software middle-layer and in the physics itself can be combined in the best way. Insisting on algorithms and mathematical-physcs model developed in the 50's is probably not the best use of petascale and exascale. We report on our recent experience built on the best combination of algorithms (using implicit formulations) of software implementations and programming languages and the best perspectives in upcoming hardware. The work reported here is partly completed as part of the EC-FP7 networks DEEP (deep-project.eu) and SWIFF (swiff.eu) and of the Intel Exascience Lab (exascience.com).

AccEPTED Talks

Title: InifinBand CORE-Direct and Dynamically Connected Transport Service: A Hardware-Application Co-Design Case Study

Authors: Noam Bloch, Richard Graham, Gilad Shainer, Todd Wilde Mellanox Technologies

The challenges facing the High Performance Computing community, as it moves towards Exascale computing encompass the full system, starting at the hardware, and up through application software. The challenges require application software to use unprecedented levels of parallelism, using systems in constant flux, with the cost to move data at such scale, keeping within the energy budget, posing a large challenge for hardware designers. With the enormity of such challenges, it is essential that application development and hardware design be done cooperatively, to help close the gap between todays programming practices and the design constraints facing hardware designers as well as to enable application to take full advantage of the hardware capabilities. Over the past several years, Mellanox Technologies has been working in close cooperation with application developers in developing new communication technologies. CORE-Direct and the newly developed Dynamic Connected Transport being the outcome of such co-design efforts. This talk will describe these capabilities, and discuss future co-design plans.

The CORE-Direct technology has been developed to address scalability issues faced by application using collective communication, including effects of system noise. A large fraction of scientific simulations tend to use such functionality, with the performance of collective communication often being a limiting factor for application scalability. With collective algorithms being used to guide hardware and software development, the CORE-Direct functionality has been developed to offload collective communication progression to the Host Channel Adapter (HCA), leaving the Central Processing Unit to continue to perform other work, as the collective communication progresses. Such an implementation is provides hardware support for the implementation of asynchronous non-blocking collective communication, as well as the means of addressing some of the system noise problems. This functionality has been available in Mellanox HCA's since ConnectX-2, and has been shown to provide both good absolute performance, as well as well as effective asynchronous communication support.

In addition to addressing the scalability of InfiniBand' s collective communication support, hardware support has been added to the new Connect-IB for scalable point-to-point communications. A new transport has been added called Dynamically Connected Transport (DC). With this transport, the hardware creates reliable connections dynamically, with the number of connections required scaling based on application communication characteristics and single host communication capabilities. As such, this forms the basis for a reliable scalable transport substrate, aimed at supporting application needs at the Exascale. This talk will describe the co-design principals used in developing both the CORE-Direct and the Dynamically Connected Transport capabilities. Detailed results will be discussed from experiments performed using the CORE-Direct functionality, and very early results from using the DC transport. Lessons learned as well as a discussion of continued future application-hardware co-design work will also be discussed.

Title: Enabling In-situ Pre- and Post-Processing for Exascale Hemodynamic Simulations - A Co-Design Study with the Sparse Geometry Lattice Boltzmann Code HemeLB

Authors: Fang Chen*, Markus Flatken*, Achim Basermann*, Andreas Gerndt*, James Hetherington⁺, Timm Krüger⁺, Gregor Matura⁺, Rupert Nash⁺

*German Aerospace Center (DLR), ⁺University College London

Today’s fluid simulations deal with complex geometries and numerical data on an extreme scale. As computation approaches the exascale, it will no longer be possible to write and store the full-sized data set. In-situ data analysis and scientific visualization provide feasible solutions to the analysis of complex large scaled CFD simulations. To bring pre- and post-processing to the exascale we must consider modifications to data structure and memory layout, and address latency and error resiliency. In this respect, a particular challenge is the exascale data processing for the sparse geometry lattice Boltzmann code HemeLB, intended for hemodynamic simulations.

In this paper, we assess the needs and challenges of HemeLB users and sketch a co-design infrastructure and system architecture for pre- and post-processing the simulation data. To enable in-situ data visualization and analysis during a running simulation, post-processing needs to work on a reduced subset of the original data. Particular choices of data structure and visualization techniques need to be co-designed with the application scientists in order to achieve efficient and interactive data processing and analysis. In this work, we focus on the hierarchical data structure and suitable visualization techniques which provide possible solutions to interactive in-situ data processing at exascale.

Title: Communication Performance Analysis of CRESTA’s Co-Design Applications NEK5000

Authors: Michael Schliephake and Erwin Laure, SeRC-Swedish e-Science Research Center and PDC Royal Institute of Technology, Swede

The EU FP7 project CRESTA addresses the exascale challenge that requires new solutions with respect to algorithms, programming models, and system software amongst many others. CRESTA has chosen a co-design approach between the joint development of important HPC applications with proven high performance, and system software supporting high application efficiency.

We present results from one of the on-going co-design development efforts. They exemplify CRESTA’s approach to co-design in general, and challenges of application developers in the design of MPI communications on current and upcoming large-scale systems in particular. This co-design effort started with the analysis of the CRESTA application NEK5000, which represents important classes of numerical simulation codes.

NEK5000 is an open-source solver for calculations in computational fluid dynamics and scales well to more than 250.000 cores. An analysis of the performance of its communication infrastructure is presented. It turns out that its implementation based on an assumed hypergraph network topology shows very good performance on different system architectures.

Finally, we discuss conclusions drawn from the application analysis with respect to their further development in order to facilitate larger systems in the future. Another important use of the knowledge gained from the performance analysis will be its application in the implementation of run-time services in order to support dynamic load balancing. This will close the co-design circle of application needs that inspire system software developments that in turn improve the application significantly.

Title: A PGAS implementation by co-design of the ECMWF Integrated Forecasting System (IFS)

Authors: George Mozdzynski*, Mats Hamrud*, Nils Wedi1, Jens Doleschal, TUD, and Harvey Richardson, CRAY

* ECMWF

ECMWF is a partner in the Collaborative Research into Exascale Systemware, Tools and Applications (CRESTA) project, funded by a recent EU call for projects in Exa-scale computing, software and simulation (ICT-2011.9.13). The significance of the research carried out within the CRESTA project is that it will demonstrate techniques required to scale the current generation of petascale simulation codes towards the performance levels required for running on future exascale systems.

Within CRESTA, ECMWF is exploring the use of Fortran2008 coarrays; in particular it is possibly the first time that coarrays have been used in a world leading production application within the context of OpenMP parallel regions. The purpose of these optimizations is primarily to allow the overlap of computation and communication, and further, in the case of the semi-Lagrangian optimization, to reduce the volume of the data communicated by removing the need for a constant width halo for computing the trajectory of particles of air backwards in time. The importance of this research is such that if these developments are successful then the IFS model can continue to use the spectral method to 2025 and beyond for the currently planned model resolutions on an exascale sized system. This research is further significant as the techniques used should be applicable to other hybrid MPI/OpenMP codes with the potential to overlap computation and communication.

In a nutshell, IFS is a spectral, semi-implicit, semi-Lagrangian weather prediction code, where model data exists in three spaces, namely, grid-point, Fourier and spectral space. In a single time-step data is transposed between these spaces so that the respective grid-point, Fourier and spectral computations are independent over two of the three co-ordinate directions in each space. Fourier transforms are performed between grid-point and Fourier space, and Legendre transforms are performed between Fourier and spectral space.

At ECMWF, this same model is used in an Ensemble Prediction System (EPS) suite where today 51 models are run at lower resolution with perturbed input conditions to provide probabilistic information to complement the accuracy of the high resolution deterministic forecast. The EPS suite is a perfect candidate to run on future exascale systems, with each ensemble member being independent of other such jobs. Increase the number of members and their resolution and trivially we can fill an exascale system. There will always be a need for a high resolution deterministic forecast which is more challenging to scale and the reason for ECMWF’s focus in the CRESTA project.

Today ECMWF uses a 16 km global grid for its operational deterministic model, and plans to scale up to a 10 km grid in 2014-15, followed by a 5 km grid in 2020-21, and a 2.5 km grid in 2025-26. These planned resolution increases will require IFS to run efficiently on about a million cores by 2025.

The current status of the coarray scalability developments to IFS will be presented in this talk, including an outline of planned developments.