# Multi-GPU maximum entropy image synthesis for radio astronomy [IMA]

The maximum entropy method (MEM) is a well known deconvolution technique in radio-interferometry. This method solves a non-linear optimization problem with an entropy regularization term. Other heuristics such as CLEAN are faster but highly user dependent. Nevertheless, MEM has the following advantages: it is unsupervised, it has an statistical basis, it has a better resolution and better image quality under certain conditions. This work presents a high performance GPU version of non-gridded MEM, which is tested using interferometric and simulated data. We propose a single-GPU and a multi-GPU implementation for single and multi-spectral data, respectively. We also make use of the Peer-to-Peer and Unified Virtual Addressing features of newer GPUs which allows to exploit transparently and efficiently multiple GPUs. Several ALMA data sets are used to demonstrate the effectiveness in imaging and to evaluate GPU performance. The results show that a speedup from 1000 to 5000 times faster than a sequential version can be achieved, depending on data and image size. This has allowed us to reconstruct the HD142527 CO(6-5) short baseline data set in 2.1 minutes, instead of the 2.5 days that takes on CPU.

M. Carcamo, P. Roman, S. Casassus, et. al.
Thu, 9 Mar 17
36/54

|

# Acceleration of low-latency gravitational wave searches using Maxwell-microarchitecture GPUs [IMA]

Low-latency detections of gravitational waves (GWs) are crucial to enable prompt follow-up observations to astrophysical transients by conventional telescopes. We have developed a low-latency pipeline using a technique called Summed Parallel Infinite Impulse Response (SPIIR) filtering, realized by a Graphic Processing Unit (GPU). In this paper, we exploit the new \textit{Maxwell} memory access architecture in NVIDIA GPUs, namely the read-only data cache, warp-shuffle, and cross-warp atomic techniques. We report a 3-fold speed-up over our previous implementation of this filtering technique. To tackle SPIIR with relatively few filters, we develop a new GPU thread configuration with a nearly 10-fold speedup. In addition, we implement a multi-rate scheme of SPIIR filtering using Maxwell GPUs. We achieve more than 100-fold speed-up over a single core CPU for the multi-rate filtering scheme. This results in an overall of 21-fold CPU usage reduction for the entire SPIIR pipeline.

X. Guo, Q. Chu, S. Chung, et. al.
Thu, 9 Feb 17
38/67

|

# OpenCluster: A Flexible Distributed Computing Framework for Astronomical Data Processing [IMA]

The volume of data generated by modern astronomical telescopes is extremely large and rapidly growing. However, current high-performance data processing architectures/frameworks are not well suited for astronomers because of their limitations and programming difficulties. In this paper, we therefore present OpenCluster, an open-source distributed computing framework to support rapidly developing high-performance processing pipelines of astronomical big data. We first detail the OpenCluster design principles and implementations and present the APIs facilitated by the framework. We then demonstrate a case in which OpenCluster is used to resolve complex data processing problems for developing a pipeline for the Mingantu Ultrawide Spectral Radioheliograph. Finally, we present our OpenCluster performance evaluation. Overall, OpenCluster provides not only high fault tolerance and simple programming interfaces, but also a flexible means of scaling up the number of interacting entities. OpenCluster thereby provides an easily integrated distributed computing framework for quickly developing a high-performance data processing system of astronomical telescopes and for significantly reducing software development expenses.

S. Wei, F. Wang, H. Deng, et. al.
Thu, 19 Jan 17
3/42

|

# Performance Optimisation of Smoothed Particle Hydrodynamics Algorithms for Multi/Many-Core Architectures [CL]

We describe a strategy for code modernisation of Gadget, a widely used community code for computational astrophysics. The focus of this work is on node-level performance optimisation, targeting current multi/many-core Intel architectures. We identify and isolate a sample code kernel, which is representative of a typical Smoothed Particle Hydrodynamics (SPH) algorithm. The code modifications include threading parallelism optimisation, change of the data layout into Structure of Arrays (SoA), auto-vectorisation and algorithmic improvements in the particle sorting. We measure lower execution time and improved threading scalability both on Intel Xeon ($2.6 \times$ on Ivy Bridge) and Xeon Phi ($13.7 \times$ on Knights Corner) systems. First tests on second generation Xeon Phi (Knights Landing) demonstrate the portability of the devised optimisation solutions to upcoming architectures.

F. Baruffa, L. Iapichino, N. Hammer, et. al.
Tue, 20 Dec 16
85/88

Comments: 18 pages, 5 figures, submitted

# Learning an Astronomical Catalog of the Visible Universe through Scalable Bayesian Inference [CL]

Celeste is a procedure for inferring astronomical catalogs that attains state-of-the-art scientific results. To date, Celeste has been scaled to at most hundreds of megabytes of astronomical images: Bayesian posterior inference is notoriously demanding computationally. In this paper, we report on a scalable, parallel version of Celeste, suitable for learning catalogs from modern large-scale astronomical datasets. Our algorithmic innovations include a fast numerical optimization routine for Bayesian posterior inference and a statistically efficient scheme for decomposing astronomical optimization problems into subproblems.
Our scalable implementation is written entirely in Julia, a new high-level dynamic programming language designed for scientific and numerical computing. We use Julia’s high-level constructs for shared and distributed memory parallelism, and demonstrate effective load balancing and efficient scaling on up to 8192 Xeon cores on the NERSC Cori supercomputer.

J. Regier, K. Pamnany, R. Giordano, et. al.
Fri, 11 Nov 16
11/40

# A Survey of High Level Frameworks in Block-Structured Adaptive Mesh Refinement Packages [CL]

Over the last decade block-structured adaptive mesh refinement (SAMR) has found increasing use in large, publicly available codes and frameworks. SAMR frameworks have evolved along different paths. Some have stayed focused on specific domain areas, others have pursued a more general functionality, providing the building blocks for a larger variety of applications. In this survey paper we examine a representative set of SAMR packages and SAMR-based codes that have been in existence for half a decade or more, have a reasonably sized and active user base outside of their home institutions, and are publicly available. The set consists of a mix of SAMR packages and application codes that cover a broad range of scientific domains. We look at their high-level frameworks, and their approach to dealing with the advent of radical changes in hardware architecture. The codes included in this survey are BoxLib, Cactus, Chombo, Enzo, FLASH, and Uintah.

A. Dubey, A. Almgren, J. Bell, et. al.
Fri, 28 Oct 16
37/73

# Extreme Scale-out SuperMUC Phase 2 – lessons learned [CL]

In spring 2015, the Leibniz Supercomputing Centre (Leibniz-Rechenzentrum, LRZ), installed their new Peta-Scale System SuperMUC Phase2. Selected users were invited for a 28 day extreme scale-out block operation during which they were allowed to use the full system for their applications. The following projects participated in the extreme scale-out workshop: BQCD (Quantum Physics), SeisSol (Geophysics, Seismics), GPI-2/GASPI (Toolkit for HPC), Seven-League Hydro (Astrophysics), ILBDC (Lattice Boltzmann CFD), Iphigenie (Molecular Dynamic), FLASH (Astrophysics), GADGET (Cosmological Dynamics), PSC (Plasma Physics), waLBerla (Lattice Boltzmann CFD), Musubi (Lattice Boltzmann CFD), Vertex3D (Stellar Astrophysics), CIAO (Combustion CFD), and LS1-Mardyn (Material Science). The projects were allowed to use the machine exclusively during the 28 day period, which corresponds to a total of 63.4 million core-hours, of which 43.8 million core-hours were used by the applications, resulting in a utilization of 69%. The top 3 users were using 15.2, 6.4, and 4.7 million core-hours, respectively.

N. Hammer, F. Jamitzky, H. Satzger, et. al.
Wed, 7 Sep 16
46/61

Comments: 10 pages, 5 figures, presented at ParCo2015 – Advances in Parallel Computing, held in Edinburgh, September 2015. The final publication is available at IOS Press through this http URL

# SpECTRE: A Task-based Discontinuous Galerkin Code for Relativistic Astrophysics [HEAP]

We introduce a new relativistic astrophysics code, SpECTRE, that combines a discontinuous Galerkin method with a task-based parallelism model. SpECTRE’s goal is to achieve more accurate solutions for challenging relativistic astrophysics problems such as core-collapse supernovae and binary neutron star mergers. The robustness of the discontinuous Galerkin method allows for the use of high-resolution shock capturing methods in regions where (relativistic) shocks are found, while exploiting high-order accuracy in smooth regions. A task-based parallelism model allows efficient use of the largest supercomputers for problems with a heterogeneous workload over disparate spatial and temporal scales. We argue that the locality and algorithmic structure of discontinuous Galerkin methods will exhibit good scalability within a task-based parallelism framework. We demonstrate the code on a wide variety of challenging benchmark problems in (non)-relativistic (magneto)-hydrodynamics. We demonstrate the code’s scalability including its strong scaling on the NCSA Blue Waters supercomputer up to the machine’s full capacity of 22,380 nodes using 671,400 threads.

L. Kidder, S. Field, F. Foucart, et. al.
Fri, 2 Sep 16
6/49

Comments: 39 pages, 13 figures, and 7 tables

# A Communication Efficient and Scalable Distributed Data Mining for the Astronomical Data [IMA]

Fri, 24 Jun 16
46/47

Comments: Accepted in Astronomy and Computing, 2016, 20 Pages, 19 Figures

|

# Mathematical Foundations of the GraphBLAS [CL]

The GraphBLAS standard (GraphBlas.org) is being developed to bring the potential of matrix based graph algorithms to the broadest possible audience. Mathematically the Graph- BLAS defines a core set of matrix-based graph operations that can be used to implement a wide class of graph algorithms in a wide range of programming environments. This paper provides an introduction to the mathematics of the GraphBLAS. Graphs represent connections between vertices with edges. Matrices can represent a wide range of graphs using adjacency matrices or incidence matrices. Adjacency matrices are often easier to analyze while incidence matrices are often better for representing data. Fortunately, the two are easily connected by matrix mul- tiplication. A key feature of matrix mathematics is that a very small number of matrix operations can be used to manipulate a very wide range of graphs. This composability of small number of operations is the foundation of the GraphBLAS. A standard such as the GraphBLAS can only be effective if it has low performance overhead. Performance measurements of prototype GraphBLAS implementations indicate that the overhead is low.

J. Kepner, P. Aaltonen, D. Bader, et. al.
Tue, 21 Jun 16
72/75

Comments: 9 pages; 11 figures; accepted to IEEE High Performance Extreme Computing (HPEC) conference 2016

# Splotch: porting and optimizing for the Xeon Phi [CL]

With the increasing size and complexity of data produced by large scale numerical simulations, it is of primary importance for scientists to be able to exploit all available hardware in heterogenous High Performance Computing environments for increased throughput and efficiency. We focus on the porting and optimization of Splotch, a scalable visualization algorithm, to utilize the Xeon Phi, Intel’s coprocessor based upon the new Many Integrated Core architecture. We discuss steps taken to offload data to the coprocessor and algorithmic modifications to aid faster processing on the many-core architecture and make use of the uniquely wide vector capabilities of the device, with accompanying performance results using multiple Xeon Phi. Finally performance is compared against results achieved with the GPU implementation of Splotch.

T. Dykes, C. Gheller, M. Rivi, et. al.
Wed, 15 Jun 16
20/54

Comments: Version 1, 11 pages, 14 figures. Accepted for publication in International Journal of High Performance Computing Applications (IJHPCA)

# SWIFT: Using task-based parallelism, fully asynchronous communication, and graph partition-based domain decomposition for strong scaling on more than 100,000 cores [CL]

M. Schaller, P. Gonnet, A. Chalk, et. al.
Fri, 10 Jun 16
33/54

Comments: 9 pages, 7 figures. Code, scripts and examples available at this http URL

# The Latin American Giant Observatory: a successful collaboration in Latin America based on Cosmic Rays and computer science domains [IMA]

In this work the strategy of the Latin American Giant Observatory (LAGO) to build a Latin American collaboration is presented. Installing Cosmic Rays detectors settled all around the Continent, from Mexico to the Antarctica, this collaboration is forming a community that embraces both high energy physicist and computer scientists. This is so because the data that are measured must be analytical processed and due to the fact that \textit{a priori} and \textit{a posteriori} simulations representing the effects of the radiation must be performed. To perform the calculi, customized codes have been implemented by the collaboration. With regard to the huge amount of data emerging from this network of sensors and from the computational simulations performed in a diversity of computing architectures and e-infrastructures, an effort is being carried out to catalog and preserve a vast amount of data produced by the water-Cherenkov Detector network and the complete LAGO simulation workflow that characterize each site. Metadata, Permanent Identifiers and the facilities from the LAGO Data Repository are described in this work jointly with the simulation codes used. These initiatives allow researchers to produce and find data and to directly use them in a code running by means of a Science Gateway that provides access to different clusters, Grid and Cloud infrastructures worldwide.

H. Asorey, R. Mayo-Garcia, L. Nunez, et. al.
Tue, 31 May 16
4/70

Comments: to be published in Proccedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing

|

# Fine tuning consensus optimization for distributed radio interferometric calibration [IMA]

We recently proposed the use of consensus optimization as a viable and effective way to improve the quality of calibration of radio interferometric data. We showed that it is possible to obtain far more accurate calibration solutions and also to distribute the compute load across a network of computers by using this technique. A crucial aspect in any consensus optimization problem is the selection of the penalty parameter used in the alternating direction method of multipliers (ADMM) iterations. This affects the convergence speed as well as the accuracy. In this paper, we use the Hessian of the cost function used in calibration to appropriately select this penalty. We extend our results to a multi-directional calibration setting, where we propose to use a penalty scaled by the squared intensity of each direction.

S. Yatawatta
Tue, 31 May 16
67/70

Comments: Draft, to be published in the Proceedings of the 24th European Signal Processing Conference (EUSIPCO-2016) in 2016, published by EURASIP

|

# Convection in Oblate Solar-Type Stars [SSA]

We present the first global 3D simulations of thermal convection in the oblate envelopes of rapidly-rotating solar-type stars. This has been achieved by exploiting the capabilities of the new Compressible High-ORder Unstructured Spectral difference (CHORUS) code. We consider rotation rates up to 85\% of the critical (breakup) rotation rate, which yields an equatorial radius that is up to 17\% larger than the polar radius. This substantial oblateness enhances the disparity between polar and equatorial modes of convection. We find that the convection redistributes the heat flux emitted from the outer surface, leading to an enhancement of the heat flux in the polar and equatorial regions. This finding implies that lower-mass stars with convective envelopes may not have darker equators as predicted by classical gravity darkening arguments. The vigorous high-latitude convection also establishes elongated axisymmetric circulation cells and zonal jets in the polar regions. Though the overall amplitude of the surface differential rotation, $\Delta \Omega$, is insensitive to the oblateness, the oblateness does limit the fractional kinetic energy contained in the differential rotation to no more than 61\%. Furthermore, we argue that this level of differential rotation is not enough to have a significant impact on the oblateness of the star.

J. Wang, M. Miesch and C. Liang
Fri, 18 Mar 16
4/53

# PageRank Pipeline Benchmark: Proposal for a Holistic System Benchmark for Big-Data Platforms [CL]

The rise of big data systems has created a need for benchmarks to measure and compare the capabilities of these systems. Big data benchmarks present unique scalability challenges. The supercomputing community has wrestled with these challenges for decades and developed methodologies for creating rigorous scalable benchmarks (e.g., HPC Challenge). The proposed PageRank pipeline benchmark employs supercomputing benchmarking methodologies to create a scalable benchmark that is reflective of many real-world big data processing systems. The PageRank pipeline benchmark builds on existing prior scalable benchmarks (Graph500, Sort, and PageRank) to create a holistic benchmark with multiple integrated kernels that can be run together or independently. Each kernel is well defined mathematically and can be implemented in any programming environment. The linear algebraic nature of PageRank makes it well suited to being implemented using the GraphBLAS standard. The computations are simple enough that performance predictions can be made based on simple computing hardware models. The surrounding kernels provide the context for each kernel that allows rigorous definition of both the input and the output for each kernel. Furthermore, since the proposed PageRank pipeline benchmark is scalable in both problem size and hardware, it can be used to measure and quantitatively compare a wide range of present day and future systems. Serial implementations in C++, Python, Python with Pandas, Matlab, Octave, and Julia have been implemented and their single threaded performance has been measured.

P. Dreher, C. Byun, C. Hill, et. al.
Tue, 8 Mar 16
82/83

Comments: 9 pages, 7 figures, to appear in IPDPS 2016 Graph Algorithms Building Blocks (GABB) workshop

# The Matsu Wheel: A Cloud-based Framework for Efficient Analysis and Reanalysis of Earth Satellite Imagery [CL]

Project Matsu is a collaboration between the Open Commons Consortium and NASA focused on developing open source technology for the cloud-based processing of Earth satellite imagery. A particular focus is the development of applications for detecting fires and floods to help support natural disaster detection and relief. Project Matsu has developed an open source cloud-based infrastructure to process, analyze, and reanalyze large collections of hyperspectral satellite image data using OpenStack, Hadoop, MapReduce, Storm and related technologies.
We describe a framework for efficient analysis of large amounts of data called the Matsu “Wheel.” The Matsu Wheel is currently used to process incoming hyperspectral satellite data produced daily by NASA’s Earth Observing-1 (EO-1) satellite. The framework is designed to be able to support scanning queries using cloud computing applications, such as Hadoop and Accumulo. A scanning query processes all, or most of the data, in a database or data repository.
We also describe our preliminary Wheel analytics, including an anomaly detector for rare spectral signatures or thermal anomalies in hyperspectral data and a land cover classifier that can be used for water and flood detection. Each of these analytics can generate visual reports accessible via the web for the public and interested decision makers. The resultant products of the analytics are also made accessible through an Open Geospatial Compliant (OGC)-compliant Web Map Service (WMS) for further distribution. The Matsu Wheel allows many shared data services to be performed together to efficiently use resources for processing hyperspectral satellite image data and other, e.g., large environmental datasets that may be analyzed for many purposes.

M. Patterson, N. Anderson, C. Bennett, et. al.
Tue, 23 Feb 16
75/78

Comments: 10 pages, accepted for presentation to IEEE BigDataService 2016

# Gravitational wave astrophysics, data analysis and multimessenger astronomy [IMA]

This paper reviews gravitational wave sources and their detection. One of the most exciting potential sources of gravitational waves are coalescing binary black hole systems. They can occur on all mass scales and be formed in numerous ways, many of which are not understood. They are generally invisible in electromagnetic waves, and they provide opportunities for deep investigation of Einstein’s general theory of relativity. Sect. 1 of this paper considers ways that binary black holes can be created in the universe, and includes the prediction that binary black hole coalescence events are likely to be the first gravitational wave sources to be detected. The next parts of this paper address the detection of chirp waveforms from coalescence events in noisy data. Such analysis is computationally intensive. Sect. 2 reviews a new and powerful method of signal detection based on the GPU-implemented summed parallel infinite impulse response filters. Such filters are intrinsically real time alorithms, that can be used to rapidly detect and localise signals. Sect. 3 of the paper reviews the use of GPU processors for rapid searching for gravitational wave bursts that can arise from black hole births and coalescences. In sect. 4 the use of GPU processors to enable fast efficient statistical significance testing of gravitational wave event candidates is reviewed. Sect. 5 of this paper addresses the method of multimessenger astronomy where the discovery of electromagnetic counterparts of gravitational wave events can be used to identify sources, understand their nature and obtain much greater science outcomes from each identified event.

H. Lee, E. Bigot, Z. Du, et. al.
Fri, 19 Feb 16
42/50

|

# Auto-Tuning Dedispersion for Many-Core Accelerators [CL]

In this paper, we study the parallelization of the dedispersion algorithm on many-core accelerators, including GPUs from AMD and NVIDIA, and the Intel Xeon Phi. An important contribution is the computational analysis of the algorithm, from which we conclude that dedispersion is inherently memory-bound in any realistic scenario, in contrast to earlier reports. We also provide empirical proof that, even in unrealistic scenarios, hardware limitations keep the arithmetic intensity low, thus limiting performance. We exploit auto-tuning to adapt the algorithm, not only to different accelerators, but also to different observations, and even telescopes. Our experiments show how the algorithm is tuned automatically for different scenarios and how it exploits and highlights the underlying specificities of the hardware: in some observations, the tuner automatically optimizes device occupancy, while in others it optimizes memory bandwidth. We quantitatively analyze the problem space, and by comparing the results of optimal auto-tuned versions against the best performing fixed codes, we show the impact that auto-tuning has on performance, and conclude that it is statistically relevant.

A. Sclocco, H. Bal, J. Hessels, et. al.
Wed, 20 Jan 16
5/58

Comments: 10 pages, published in the proceedings of IPDPS 2014

# A polyphase filter for many-core architectures [IMA]

In this article we discuss our implementation of a polyphase filter for real-time data processing in radio astronomy. We describe in detail our implementation of the polyphase filter algorithm and its behaviour on three generations of NVIDIA GPU cards, on dual Intel Xeon CPUs and the Intel Xeon Phi (Knights Corner) platforms. All of our implementations aim to exploit the potential for data reuse that the algorithm offers. Our GPU implementations explore two different methods for achieving this, the first makes use of L1/Texture cache, the second uses shared memory. We discuss the usability of each of our implementations along with their behaviours. We measure performance in execution time, which is a critical factor for real-time systems, we also present results in terms of bandwidth (GB/s), compute (GFlop/s) and type conversions (GTc/s). We include a presentation of our results in terms of the sample rate which can be processed in real-time by a chosen platform, which more intuitively describes the expected performance in a signal processing setting. Our findings show that, for the GPUs considered, the performance of our polyphase filter when using lower precision input data is limited by type conversions rather than device bandwidth. We compare these results to an implementation on the Xeon Phi. We show that our Xeon Phi implementation has a performance that is 1.47x to 1.95x greater than our CPU implementation, however is not insufficient to compete with the performance of GPUs. We conclude with a comparison of our best performing code to two other implementations of the polyphase filter, showing that our implementation is faster in nearly all cases. This work forms part of the Astro-Accelerate project, a many-core accelerated real-time data processing library for digital signal processing of time-domain radio astronomy data.

K. Adamek, J. Novotny and W. Armour
Thu, 12 Nov 15
37/61

Comments: 19 pages, 20 figures, 5 tables

|

# SWIFT: task-based hydrodynamics and gravity for cosmological simulations [IMA]

Simulations of galaxy formation follow the gravitational and hydrodynamical interactions between gas, stars and dark matter through cosmic time. The huge dynamic range of such calculations severely limits strong scaling behaviour of the community codes in use, with load-imbalance, cache inefficiencies and poor vectorisation limiting performance. The new swift code exploits task-based parallelism designed for many-core compute nodes interacting via MPI using asynchronous communication to improve speed and scaling. A graph-based domain decomposition schedules interdependent tasks over available resources. Strong scaling tests on realistic particle distributions yield excellent parallel efficiency, and efficient cache usage provides a large speed-up compared to current codes even on a single core. SWIFT is designed to be easy to use by shielding the astronomer from computational details such as the construction of the tasks or MPI communication. The techniques and algorithms used in SWIFT may benefit other computational physics areas as well, for example that of compressible hydrodynamics. For details of this open-source project, see http://www.swiftsim.com

T. Theuns, A. Chalk, M. Schaller, et. al.
Tue, 4 Aug 15
26/54

Comments: Proceedings of the EASC 2015 conference, Edinburgh, UK, April 21-23, 2015

|

# From Thread to Transcontinental Computer: Disturbing Lessons in Distributed Supercomputing [IMA]

We describe the political and technical complications encountered during the astronomical CosmoGrid project. CosmoGrid is a numerical study on the formation of large scale structure in the universe. The simulations are challenging due to the enormous dynamic range in spatial and temporal coordinates, as well as the enormous computer resources required. In CosmoGrid we dealt with the computational requirements by connecting up to four supercomputers via an optical network and make them operate as a single machine. This was challenging, if only for the fact that the supercomputers of our choice are separated by half the planet, as three of them are located scattered across Europe and fourth one is in Tokyo. The co-scheduling of multiple computers and the ‘gridification’ of the code enabled us to achieve an efficiency of up to $93\%$ for this distributed intercontinental supercomputer. In this work, we find that high-performance computing on a grid can be done much more effectively if the sites involved are willing to be flexible about their user policies, and that having facilities to provide such flexibility could be key to strengthening the position of the HPC community in an increasingly Cloud-dominated computing landscape. Given that smaller computer clusters owned by research groups or university departments usually have flexible user policies, we argue that it could be easier to instead realize distributed supercomputing by combining tens, hundreds or even thousands of these resources.

D. Groen and S. Zwart
Tue, 7 Jul 15
26/65

Comments: Accepted for publication in IEEE conference on ERRORs

|

# Separable projection integrals for higher-order correlators of the cosmic microwave sky: Acceleration by factors exceeding 100 [CL]

We study the optimisation and porting of the “Modal” code on Intel(R) Xeon(R) processors and/or Intel(R) Xeon Phi(TM) coprocessors using methods which should be applicable to more general compute bound codes. “Modal” is used by the Planck satellite experiment for constraining general non-Gaussian models of the early universe via the bispectrum of the cosmic microwave background. We focus on the hot-spot of the code which is the projection of bispectra from the end of inflation to spherical shell at decoupling which defines the CMB we observe. This code involves a three-dimensional inner product between two functions, one of which requires an integral, on a non-rectangular sparse domain. We show that by employing separable methods this calculation can be reduced to a one dimensional summation plus two integrations reducing the dimensionality from four to three. The introduction of separable functions also solves the issue of the domain allowing efficient vectorisation and load balancing. This method becomes unstable in certain cases and so we present a discussion of the optimisation of both approaches. By making bispectrum calculations competitive with those for the power spectrum we are now able to consider joint analysis for cosmological science exploitation of new data. We demonstrate speed-ups of over 100x, arising from a combination of algorithmic improvements and architecture-aware optimizations targeted at improving thread and vectorization behaviour. The resulting MPI/OpenMP code is capable of executing on clusters containing Intel(R) Xeon(R) processors and/or Intel(R) Xeon Phi(TM) coprocessors, with strong-scaling efficiency of 98.6% on up to 16 nodes. We find that a single coprocessor outperforms two processor sockets by a factor of 1.3x and that running the same code across a combination of processors and coprocessors improves performance-per-node by a factor of 3.38x.

J. Briggs, J. Jaykka, J. Fergusson, et. al.
Fri, 3 Apr 15
1/43

# StratOS: A Big Data Framework for Scientific Computing [IMA]

We introduce StratOS, a Big Data platform for general computing that allows a datacenter to be treated as a single computer. With StratOS, the process of writing a massively parallel program for a datacenter is no more complicated than writing a Python script for a desktop computer. Users can run pre-existing analysis software on data distributed over thousands of machines with just a few keystrokes. This greatly reduces the time required to develop distributed data analysis pipelines. The platform is built upon industry-standard, open-source Big Data technologies, from which it inherits fast data throughput and fault tolerance. StratOS enhances these technologies by adding an intuitive user interface, automated task monitoring, and other usability features.

N. Stickley and M. Aragon-Calvo
Tue, 10 Mar 15
4/77

|

# Building a scalable global data processing pipeline for large astronomical photometric datasets [IMA]

Astronomical photometry is the science of measuring the flux of a celestial object. Since its introduction, the CCD has been the principle method of measuring flux to calculate the apparent magnitude of an object. Each CCD image taken must go through a process of cleaning and calibration prior to its use. As the number of research telescopes increases the overall computing resources required for image processing also increases. Existing processing techniques are primarily sequential in nature, requiring increasingly powerful servers, faster disks and faster networks to process data. Existing High Performance Computing solutions involving high capacity data centres are complex in design and expensive to maintain, while providing resources primarily to high profile science projects. This research describes three distributed pipeline architectures, a virtualised cloud based IRAF, the Astronomical Compute Node (ACN), a private cloud based pipeline, and NIMBUS, a globally distributed system. The ACN pipeline processed data at a rate of 4 Terabytes per day demonstrating data compression and upload to a central cloud storage service at a rate faster than data generation. The primary contribution of this research is NIMBUS, which is rapidly scalable, resilient to failure and capable of processing CCD image data at a rate of hundreds of Terabytes per day. This pipeline is implemented using a decentralised web queue to control the compression of data, uploading of data to distributed web servers, and creating web messages to identify the location of the data. Using distributed web queue messages, images are downloaded by computing resources distributed around the globe. Rigorous experimental evidence is presented verifying the horizontal scalability of the system which has demonstrated a processing rate of 192 Terabytes per day with clear indications that higher processing rates are possible.

P. Doyle
Wed, 11 Feb 15
18/72

Comments: PhD Thesis, Dublin Institute of Technology

|

# Distributed Radio Interferometric Calibration [IMA]

Increasing data volumes delivered by a new generation of radio interferometers require computationally efficient and robust calibration algorithms. In this paper, we propose distributed calibration as a way of improving both computational cost as well as robustness in calibration. We exploit the data parallelism across frequency that is inherent in radio astronomical observations that are recorded as multiple channels at different frequencies. Moreover, we also exploit the smoothness of the variation of calibration parameters across frequency. Data parallelism enables us to distribute the computing load across a network of compute agents. Smoothness in frequency enables us reformulate calibration as a consensus optimization problem. With this formulation, we enable flow of information between compute agents calibrating data at different frequencies, without actually passing the data, and thereby improving robustness. We present simulation results to show the feasibility as well as the advantages of distributed calibration as opposed to conventional calibration.

S. Yatawatta
Wed, 4 Feb 15
48/59

Comments: submitted to MNRAS, low resolution figures

|

# Architecture, implementation and parallelization of the software to search for periodic gravitational wave signals [CL]

The parallelization, design and scalability of the \sky code to search for periodic gravitational waves from rotating neutron stars is discussed. The code is based on an efficient implementation of the F-statistic using the Fast Fourier Transform algorithm. To perform an analysis of data from the advanced LIGO and Virgo gravitational wave detectors’ network, which will start operating in 2015, hundreds of millions of CPU hours will be required – the code utilizing the potential of massively parallel supercomputers is therefore mandatory. We have parallelized the code using the Message Passing Interface standard, implemented a mechanism for combining the searches at different sky-positions and frequency bands into one extremely scalable program. The parallel I/O interface is used to escape bottlenecks, when writing the generated data into file system. This allowed to develop a highly scalable computation code, which would enable the data analysis at large scales on acceptable time scales. Benchmarking of the code on a Cray XE6 system was performed to show efficiency of our parallelization concept and to demonstrate scaling up to 50 thousand cores in parallel.

G. Poghosyan, S. Matta, A. Streit, et. al.
Tue, 3 Feb 15
37/80

Comments: 11 pages, 9 figures. Submitted to Computer Physics Communications

# Montblanc: GPU accelerated Radio Interferometer Measurement Equations in support of Bayesian Inference for Radio Observations [CL]

We present Montblanc, a GPU implementation of the Radio interferometer measurement equation (RIME) in support of the Bayesian inference for radio observations (BIRO) technique. BIRO uses Bayesian inference to select sky models that best match the visibilities observed by a radio interferometer. To accomplish this, BIRO evaluates the RIME multiple times, varying sky model parameters to produce multiple model visibilities. Chi-squared values computed from the model and observed visibilities are used as likelihood values to drive the Bayesian sampling process and select the best sky model.
As most of the elements of the RIME and chi-squared calculation are independent of one another, they are highly amenable to parallel computation. Additionally, Montblanc caters for iterative RIME evaluation to produce multiple chi-squared values. Only modified model parameters are transferred to the GPU between each iteration.
We implemented Montblanc as a Python package based upon NVIDIA’s CUDA architecture. As such, it is easy to extend and implement different pipelines. At present, Montblanc supports point and Gaussian morphologies, but is designed for easy addition of new source profiles. Montblanc’s RIME implementation is performant: On an NVIDIA K40, it is approximately 250 times faster than MeqTrees on a dual hexacore Intel E5{2620v2 CPU. Compared to the OSKAR simulator’s GPU-implemented RIME components it is 7.7 and 12 times faster on the same K40 for single and double-precision oating point respectively. However, OSKAR’s RIME implementation is more general than Montblanc’s BIRO-tailored RIME.
Theoretical analysis of Montblanc’s dominant CUDA kernel suggests that it is memory bound. In practice, profiling shows that is balanced between compute and memory, as much of the data required by the problem is retained in L1 and L2 cache.

S. Perkins, P. Maraism, J. Zwart, et. al.
Mon, 2 Feb 15
21/49

Comments: Submitted to Astronomy and Computing (this http URL). The code is available online at this https URL 26 pages long, with 13 figures, 6 tables and 3 algorithms

# Delivering SKA Science [IMA]

The SKA will be capable of producing a stream of science data products that are Exa-scale in terms of their storage and processing requirements. This Google-scale enterprise is attracting considerable international interest and excitement from within the industrial and academic communities. In this chapter we examine the data flow, storage and processing requirements of a number of key SKA survey science projects to be executed on the baseline SKA1 configuration. Based on a set of conservative assumptions about trends for HPC and storage costs, and the data flow process within the SKA Observatory, it is apparent that survey projects of the scale proposed will potentially drive construction and operations costs beyond the current anticipated SKA1 budget. This implies a sharing of the resources and costs to deliver SKA science between the community and what is contained within the SKA Observatory. A similar situation was apparent to the designers of the LHC more than 10 years ago. We propose that it is time for the SKA project and community to consider the effort and process needed to design and implement a distributed SKA science data system that leans on the lessons of other projects and looks to recent developments in Cloud technologies to ensure an affordable, effective and global achievement of SKA science goals.

P. Quinn, T. Axelrod, I. Bird, et. al.
Fri, 23 Jan 15
19/65

Comments: 27 pages, 14 figures, Conference: Advancing Astrophysics with the Square Kilometre Array June 8-13, 2014 Giardini Naxos, Italy

|

# 24.77 Pflops on a Gravitational Tree-Code to Simulate the Milky Way Galaxy with 18600 GPUs [GA]

We have simulated, for the first time, the long term evolution of the Milky Way Galaxy using 51 billion particles on the Swiss Piz Daint supercomputer with our $N$-body gravitational tree-code Bonsai. Herein, we describe the scientific motivation and numerical algorithms. The Milky Way model was simulated for 6 billion years, during which the bar structure and spiral arms were fully formed. This improves upon previous simulations by using 1000 times more particles, and provides a wealth of new data that can be directly compared with observations. We also report the scalability on both the Swiss Piz Daint and the US ORNL Titan. On Piz Daint the parallel efficiency of Bonsai was above 95%. The highest performance was achieved with a 242 billion particle Milky Way model using 18600 GPUs on Titan, thereby reaching a sustained GPU and application performance of 33.49 Pflops and 24.77 Pflops respectively.

J. Bedorf, E. Gaburov, M. Fujii, et. al.
Wed, 3 Dec 14
31/60

Comments: 12 pages, 4 figures, Published in: ‘Proceeding SC ’14 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis’. Gordon Bell Prize 2014 finalist

# Accelerating Cosmic Microwave Background map-making procedure through preconditioning [CEA]

Estimation of the sky signal from sequences of time ordered data is one of the key steps in Cosmic Microwave Background (CMB) data analysis, commonly referred to as the map-making problem. Some of the most popular and general methods proposed for this problem involve solving generalised least squares (GLS) equations with non-diagonal noise weights given by a block-diagonal matrix with Toeplitz blocks. In this work we study new map-making solvers potentially suitable for applications to the largest anticipated data sets. They are based on iterative conjugate gradient (CG) approaches enhanced with novel, parallel, two-level preconditioners. We apply the proposed solvers to examples of simulated non-polarised and polarised CMB observations, and a set of idealised scanning strategies with sky coverage ranging from nearly a full sky down to small sky patches. We discuss in detail their implementation for massively parallel computational platforms and their performance for a broad range of parameters characterising the simulated data sets. We find that our best new solver can outperform carefully-optimised standard solvers used today by a factor of as much as 5 in terms of the convergence rate and a factor of up to $4$ in terms of the time to solution, and to do so without significantly increasing the memory consumption and the volume of inter-processor communication. The performance of the new algorithms is also found to be more stable and robust, and less dependent on specific characteristics of the analysed data set. We therefore conclude that the proposed approaches are well suited to address successfully challenges posed by new and forthcoming CMB data sets.

M. Szydlarski, L. Grigori and R. Stompor
Thu, 14 Aug 14
36/54

|

# Optimizing performance per watt on GPUs in High Performance Computing: temperature, frequency and voltage effects [IMA]

The magnitude of the real-time digital signal processing challenge attached to large radio astronomical antenna arrays motivates use of high performance computing (HPC) systems. The need for high power efficiency (performance per watt) at remote observatory sites parallels that in HPC broadly, where efficiency is an emerging critical metric. We investigate how the performance per watt of graphics processing units (GPUs) is affected by temperature, core clock frequency and voltage. Our results highlight how the underlying physical processes that govern transistor operation affect power efficiency. In particular, we show experimentally that GPU power consumption grows non-linearly with both temperature and supply voltage, as predicted by physical transistor models. We show lowering GPU supply voltage and increasing clock frequency while maintaining a low die temperature increases the power efficiency of an NVIDIA K20 GPU by up to 37-48% over default settings when running xGPU, a compute-bound code used in radio astronomy. We discuss how temperature-aware power models could be used to reduce power consumption for future HPC installations. Automatic temperature-aware and application-dependent voltage and frequency scaling (T-DVFS and A-DVFS) may provide a mechanism to achieve better power efficiency for a wider range of codes running on GPUs

D. Price, M. Clark, B. Barsdell, et. al.
Thu, 31 Jul 14
12/52

Comments: submitted to Computer Physics Communications

|

# A Framework for HI Spectral Source Finding Using Distributed-Memory Supercomputing [IMA]

The latest generation of radio astronomy interferometers will conduct all sky surveys with data products consisting of petabytes of spectral line data. Traditional approaches to identifying and parameterising the astrophysical sources within this data will not scale to datasets of this magnitude, since the performance of workstations will not keep up with the real-time generation of data. For this reason, it is necessary to employ high performance computing systems consisting of a large number of processors connected by a high-bandwidth network. In order to make use of such supercomputers substantial modifications must be made to serial source finding code. To ease the transition, this work presents the Scalable Source Finder Framework, a framework providing storage access, networking communication and data composition functionality, which can support a wide range of source finding algorithms provided they can be applied to subsets of the entire image. Additionally, the Parallel Gaussian Source Finder was implemented using SSoFF, utilising Gaussian filters, thresholding, and local statistics. PGSF was able to search on a 256GB simulated dataset in under 24 minutes, significantly less than the 8 to 12 hour observation that would generate such a dataset.

S. Westerlund and C. Harris
Mon, 21 Jul 14
54/55

|

# D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database [CL]

Non-traditional, relaxed consistency, triple store databases are the backbone of many web companies (e.g., Google Big Table, Amazon Dynamo, and Facebook Cassandra). The Apache Accumulo database is a high performance open source relaxed consistency database that is widely used for government applications. Obtaining the full benefits of Accumulo requires using novel schemas. The Dynamic Distributed Dimensional Data Model (D4M)[this http URL] provides a uniform mathematical framework based on associative arrays that encompasses both traditional (i.e., SQL) and non-traditional databases. For non-traditional databases D4M naturally leads to a general purpose schema that can be used to fully index and rapidly query every unique string in a dataset. The D4M 2.0 Schema has been applied with little or no customization to cyber, bioinformatics, scientific citation, free text, and social media data. The D4M 2.0 Schema is simple, requires minimal parsing, and achieves the highest published Accumulo ingest rates. The benefits of the D4M 2.0 Schema are independent of the D4M interface. Any interface to Accumulo can achieve these benefits by using the D4M 2.0 Schema

J. Kepner, C. Anderson, W. Arcand, et. al.
Wed, 16 Jul 14
41/48

Comments: 6 pages; IEEE HPEC 2013

# GPU accelerated Hybrid Tree Algorithm for Collision-less N-body Simulations [IMA]

We propose a hybrid tree algorithm for reducing calculation and communication cost of collision-less N-body simulations. The concept of our algorithm is that we split interaction force into two parts: hard-force from neighbor particles and soft-force from distant particles, and applying different time integration for the forces. For hard-force calculation, we can efficiently reduce the calculation and communication cost of the parallel tree code because we only need data of neighbor particles for this part. We implement the algorithm on GPU clusters to accelerate force calculation for both hard and soft force. As the result of implementing the algorithm on GPU clusters, we were able to reduce the communication cost and the total execution time to 40% and 80% of that of a normal tree algorithm, respectively. In addition, the reduction factor relative the normal tree algorithm is smaller for large number of processes, and we expect that the execution time can be ultimately reduced down to about 70% of the normal tree algorithm.

T. Watanabe and N. Nakasato
Wed, 25 Jun 14
61/67

Comments: Paper presented at Fifth International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART2014)

|

# Achieving 100,000,000 database inserts per second using Accumulo and D4M [CL]

The Apache Accumulo database is an open source relaxed consistency database that is widely used for government applications. Accumulo is designed to deliver high performance on unstructured data such as graphs of network data. This paper tests the performance of Accumulo using data from the Graph500 benchmark. The Dynamic Distributed Dimensional Data Model (D4M) software is used to implement the benchmark on a 216-node cluster running the MIT SuperCloud software stack. A peak performance of over 100,000,000 database inserts per second was achieved which is 100x larger than the highest previously published value for any other database. The performance scales linearly with the number of ingest clients, number of database servers, and data size. The performance was achieved by adapting several supercomputing techniques to this application: distributed arrays, domain decomposition, adaptive load balancing, and single-program-multiple-data programming.

J. Kepner, W. Arcand, D. Bestor, et. al.
Fri, 20 Jun 14
2/48

Comments: 6 pages; to appear in IEEE High Performance Extreme Computing (HPEC) 2014

# Efficient and Scalable Algorithms for Smoothed Particle Hydrodynamics on Hybrid Shared/Distributed-Memory Architectures [CL]

This paper describes a new fast and implicitly parallel approach to neighbour-finding in multi-resolution Smoothed Particle Hydrodynamics (SPH) simulations. This new approach is based on hierarchical cell decompositions and sorted interactions, within a task-based formulation. It is shown to be faster than traditional tree-based codes, and to scale better than domain decomposition-based approaches on hybrid shared/distributed-memory parallel architectures, e.g. clusters of multi-cores, achieving a $40\times$ speedup over the Gadget-2 simulation code.

P. Gonnet
Thu, 10 Apr 14
36/57

# Creating A Galactic Plane Atlas With Amazon Web Services [IMA]

This paper describes by example how astronomers can use cloud-computing resources offered by Amazon Web Services (AWS) to create new datasets at scale. We have created from existing surveys an atlas of the Galactic Plane at 16 wavelengths from 1 {\mu}m to 24 {\mu}m with pixels co-registered at spatial sampling of 1 arcsec. We explain how open source tools support management and operation of a virtual cluster on AWS platforms to process data at scale, and describe the technical issues that users will need to consider, such as optimization of resources, resource costs, and management of virtual machine instances.

Wed, 25 Dec 13
11/23

|

# Calculation of Stochastic Heating and Emissivity of Cosmic Dust Grains with Optimization for the Intel Many Integrated Core Architecture [IMA]

Cosmic dust particles effectively attenuate starlight. Their absorption of starlight produces emission spectra from the near- to far-infrared, which depends on the sizes and properties of the dust grains, and spectrum of the heating radiation field. The near- to mid-infrared is dominated by the emissions by very small grains. Modeling the absorption of starlight by these particles is, however, computationally expensive and a significant bottleneck for self-consistent radiation transport codes treating the heating of dust by stars. In this paper, we summarize the formalism for computing the stochastic emissivity of cosmic dust, which was developed in earlier works, and present a new library HEATCODE implementing this formalism for the calculation for arbitrary grain properties and heating radiation fields. Our library is highly optimized for general-purpose processors with multiple cores and vector instructions, with hierarchical memory cache structure. The HEATCODE library also efficiently runs on co-processor cards implementing the Intel Many Integrated Core (Intel MIC) architecture. We discuss in detail the optimization steps that we took in order to optimize for the Intel MIC architecture, which also significantly benefited the performance of the code on general-purpose processors, and provide code samples and performance benchmarks for each step. The HEATCODE library performance on a single Intel Xeon Phi coprocessor (Intel MIC architecture) is approximately 2 times a general-purpose two-socket multicore processor system with approximately the same nominal power consumption. The library supports heterogeneous calculations employing host processors simultaneously with multiple coprocessors, and can be easily incorporated into existing radiation transport codes.

Wed, 20 Nov 13
18/54

|

# 2HOT: An Improved Parallel Hashed Oct-Tree N-Body Algorithm for Cosmological Simulation [IMA]

We report on improvements made over the past two decades to our adaptive treecode N-body method (HOT). A mathematical and computational approach to the cosmological N-body problem is described, with performance and scalability measured up to 256k ($2^{18}$) processors. We present error analysis and scientific application results from a series of more than ten 69 billion ($4096^3$) particle cosmological simulations, accounting for $4 \times 10^{20}$ floating point operations. These results include the first simulations using the new constraints on the standard model of cosmology from the Planck satellite. Our simulations set a new standard for accuracy and scientific throughput, while meeting or exceeding the computational efficiency of the latest generation of hybrid TreePM N-body methods.

Date added: Fri, 18 Oct 13

|

# Porting Large HPC Applications to GPU Clusters: The Codes GENE and VERTEX

We have developed GPU versions for two major high-performance-computing (HPC) applications originating from two different scientific domains. GENE is a plasma microturbulence code which is employed for simulations of nuclear fusion plasmas. VERTEX is a neutrino-radiation hydrodynamics code for “first principles”-simulations of core-collapse supernova explosions. The codes are considered state of the art in their respective scientific domains, both concerning their scientific scope and functionality as well as the achievable compute performance, in particular parallel scalability on all relevant HPC platforms. GENE and VERTEX were ported by us to HPC cluster architectures with two NVidia Kepler GPUs mounted in each node in addition to two Intel Xeon CPUs of the Sandy Bridge family. On such platforms we achieve up to twofold gains in the overall application performance in the sense of a reduction of the time to solution for a given setup with respect to a pure CPU cluster. The paper describes our basic porting strategies and benchmarking methodology, and details the main algorithmic and technical challenges we faced on the new, heterogeneous architecture.

Date added: Tue, 8 Oct 13