Acceleration of low-latency gravitational wave searches using Maxwell-microarchitecture GPUs [IMA]

Low-latency detections of gravitational waves (GWs) are crucial to enable prompt follow-up observations to astrophysical transients by conventional telescopes. We have developed a low-latency pipeline using a technique called Summed Parallel Infinite Impulse Response (SPIIR) filtering, realized by a Graphic Processing Unit (GPU). In this paper, we exploit the new \textit{Maxwell} memory access architecture in NVIDIA GPUs, namely the read-only data cache, warp-shuffle, and cross-warp atomic techniques. We report a 3-fold speed-up over our previous implementation of this filtering technique. To tackle SPIIR with relatively few filters, we develop a new GPU thread configuration with a nearly 10-fold speedup. In addition, we implement a multi-rate scheme of SPIIR filtering using Maxwell GPUs. We achieve more than 100-fold speed-up over a single core CPU for the multi-rate filtering scheme. This results in an overall of 21-fold CPU usage reduction for the entire SPIIR pipeline.

Read this paper on arXiv…

X. Guo, Q. Chu, S. Chung, et. al.
Thu, 9 Feb 17

Comments: N/A

OpenCluster: A Flexible Distributed Computing Framework for Astronomical Data Processing [IMA]

The volume of data generated by modern astronomical telescopes is extremely large and rapidly growing. However, current high-performance data processing architectures/frameworks are not well suited for astronomers because of their limitations and programming difficulties. In this paper, we therefore present OpenCluster, an open-source distributed computing framework to support rapidly developing high-performance processing pipelines of astronomical big data. We first detail the OpenCluster design principles and implementations and present the APIs facilitated by the framework. We then demonstrate a case in which OpenCluster is used to resolve complex data processing problems for developing a pipeline for the Mingantu Ultrawide Spectral Radioheliograph. Finally, we present our OpenCluster performance evaluation. Overall, OpenCluster provides not only high fault tolerance and simple programming interfaces, but also a flexible means of scaling up the number of interacting entities. OpenCluster thereby provides an easily integrated distributed computing framework for quickly developing a high-performance data processing system of astronomical telescopes and for significantly reducing software development expenses.

Read this paper on arXiv…

S. Wei, F. Wang, H. Deng, et. al.
Thu, 19 Jan 17

Comments: N/A

Performance Optimisation of Smoothed Particle Hydrodynamics Algorithms for Multi/Many-Core Architectures [CL]

We describe a strategy for code modernisation of Gadget, a widely used community code for computational astrophysics. The focus of this work is on node-level performance optimisation, targeting current multi/many-core Intel architectures. We identify and isolate a sample code kernel, which is representative of a typical Smoothed Particle Hydrodynamics (SPH) algorithm. The code modifications include threading parallelism optimisation, change of the data layout into Structure of Arrays (SoA), auto-vectorisation and algorithmic improvements in the particle sorting. We measure lower execution time and improved threading scalability both on Intel Xeon ($2.6 \times$ on Ivy Bridge) and Xeon Phi ($13.7 \times$ on Knights Corner) systems. First tests on second generation Xeon Phi (Knights Landing) demonstrate the portability of the devised optimisation solutions to upcoming architectures.

Read this paper on arXiv…

F. Baruffa, L. Iapichino, N. Hammer, et. al.
Tue, 20 Dec 16

Comments: 18 pages, 5 figures, submitted

Learning an Astronomical Catalog of the Visible Universe through Scalable Bayesian Inference [CL]

Celeste is a procedure for inferring astronomical catalogs that attains state-of-the-art scientific results. To date, Celeste has been scaled to at most hundreds of megabytes of astronomical images: Bayesian posterior inference is notoriously demanding computationally. In this paper, we report on a scalable, parallel version of Celeste, suitable for learning catalogs from modern large-scale astronomical datasets. Our algorithmic innovations include a fast numerical optimization routine for Bayesian posterior inference and a statistically efficient scheme for decomposing astronomical optimization problems into subproblems.
Our scalable implementation is written entirely in Julia, a new high-level dynamic programming language designed for scientific and numerical computing. We use Julia’s high-level constructs for shared and distributed memory parallelism, and demonstrate effective load balancing and efficient scaling on up to 8192 Xeon cores on the NERSC Cori supercomputer.

Read this paper on arXiv…

J. Regier, K. Pamnany, R. Giordano, et. al.
Fri, 11 Nov 16

Comments: submitting to IPDPS’17

A Survey of High Level Frameworks in Block-Structured Adaptive Mesh Refinement Packages [CL]

Over the last decade block-structured adaptive mesh refinement (SAMR) has found increasing use in large, publicly available codes and frameworks. SAMR frameworks have evolved along different paths. Some have stayed focused on specific domain areas, others have pursued a more general functionality, providing the building blocks for a larger variety of applications. In this survey paper we examine a representative set of SAMR packages and SAMR-based codes that have been in existence for half a decade or more, have a reasonably sized and active user base outside of their home institutions, and are publicly available. The set consists of a mix of SAMR packages and application codes that cover a broad range of scientific domains. We look at their high-level frameworks, and their approach to dealing with the advent of radical changes in hardware architecture. The codes included in this survey are BoxLib, Cactus, Chombo, Enzo, FLASH, and Uintah.

Read this paper on arXiv…

A. Dubey, A. Almgren, J. Bell, et. al.
Fri, 28 Oct 16

Comments: N/A

Extreme Scale-out SuperMUC Phase 2 – lessons learned [CL]

In spring 2015, the Leibniz Supercomputing Centre (Leibniz-Rechenzentrum, LRZ), installed their new Peta-Scale System SuperMUC Phase2. Selected users were invited for a 28 day extreme scale-out block operation during which they were allowed to use the full system for their applications. The following projects participated in the extreme scale-out workshop: BQCD (Quantum Physics), SeisSol (Geophysics, Seismics), GPI-2/GASPI (Toolkit for HPC), Seven-League Hydro (Astrophysics), ILBDC (Lattice Boltzmann CFD), Iphigenie (Molecular Dynamic), FLASH (Astrophysics), GADGET (Cosmological Dynamics), PSC (Plasma Physics), waLBerla (Lattice Boltzmann CFD), Musubi (Lattice Boltzmann CFD), Vertex3D (Stellar Astrophysics), CIAO (Combustion CFD), and LS1-Mardyn (Material Science). The projects were allowed to use the machine exclusively during the 28 day period, which corresponds to a total of 63.4 million core-hours, of which 43.8 million core-hours were used by the applications, resulting in a utilization of 69%. The top 3 users were using 15.2, 6.4, and 4.7 million core-hours, respectively.

Read this paper on arXiv…

N. Hammer, F. Jamitzky, H. Satzger, et. al.
Wed, 7 Sep 16

Comments: 10 pages, 5 figures, presented at ParCo2015 – Advances in Parallel Computing, held in Edinburgh, September 2015. The final publication is available at IOS Press through this http URL

SpECTRE: A Task-based Discontinuous Galerkin Code for Relativistic Astrophysics [HEAP]

We introduce a new relativistic astrophysics code, SpECTRE, that combines a discontinuous Galerkin method with a task-based parallelism model. SpECTRE’s goal is to achieve more accurate solutions for challenging relativistic astrophysics problems such as core-collapse supernovae and binary neutron star mergers. The robustness of the discontinuous Galerkin method allows for the use of high-resolution shock capturing methods in regions where (relativistic) shocks are found, while exploiting high-order accuracy in smooth regions. A task-based parallelism model allows efficient use of the largest supercomputers for problems with a heterogeneous workload over disparate spatial and temporal scales. We argue that the locality and algorithmic structure of discontinuous Galerkin methods will exhibit good scalability within a task-based parallelism framework. We demonstrate the code on a wide variety of challenging benchmark problems in (non)-relativistic (magneto)-hydrodynamics. We demonstrate the code’s scalability including its strong scaling on the NCSA Blue Waters supercomputer up to the machine’s full capacity of 22,380 nodes using 671,400 threads.

Read this paper on arXiv…

L. Kidder, S. Field, F. Foucart, et. al.
Fri, 2 Sep 16

Comments: 39 pages, 13 figures, and 7 tables