ACM Gordon Bell Finalist Gordon Bell Prize Finalist #1 A Fast Scalable Implicit Solver for Nonlinear Time-Evolution Earthquake City Problem on Low-Ordered Unstructured Finite Elements with Artificial Intelligence and Transprecision Computing Tsuyoshi Ichimura, Kohei Fujita, and Takuma Yamaguchi (University of Tokyo); Akira Naruse (Nvidia Corporation); Jack C. Wells (Oak Ridge National Laboratory); Thomas C. Schulthess (Swiss National Supercomputing Centre); Tjerk P. Straatsma and Christopher J. Zimmer (Oak Ridge National Laboratory); Maxime Martinasso (Swiss National Supercomputing Centre); and Kengo Nakajima, Muneo Hori, and Lalith Maddegedara (University of Tokyo) Abstract Abstract To address problems that occur due to earthquakes in urban areas, we propose a method that utilizes artificial intelligence (AI) and transprecision computing to accelerate a nonlinear dynamic low-order unstructured finite-element solver. The AI is used to improve the convergence of iterative solver leading to 5.56-fold reduction in arithmetic count from a standard solver, and FP16-FP21-FP32-FP64 computing is used to accelerate the sparse matrix-vector product kernel, which demonstrated 71.4% peak FP64 performance on Summit. This is 25.3 times faster than a standard solver and 3.99 times faster than the state-of-the-art SC14 Gordon Bell Finalist solver. Furthermore, the proposed solver demonstrated high scalability (88.8% on the K computer and 89.5% on Piz Daint), leading to 14.7% peak FP64 performance on 4096 nodes of Summit. The proposed approach utilizing AI and FP16 arithmetic has implications for accelerating other implicit solvers used for earthquake city simulations as well as various fields. 167-PFlops Deep Learning for Electron Microscopy: From Learning Physics to Atomic Manipulation Robert M. Patton, J. Travis Johnston, Steven R. Young, Catherine D. Schuman, Don D. March, Thomas E. Potok, Derek C. Rose, Seung-Hwan Lim, Thomas P. Karnowski, Maxim A. Ziatdinov, and Sergei V. Kalinin (Oak Ridge National Laboratory) Abstract Abstract An artificial intelligence system called MENNDL, which used 25,200 Nvidia Volta GPUs on Oak Ridge National Laboratory’s Summit machine, automatically designed an optimal deep learning network in order to extract structural information from raw atomic-resolution microscopy data. In a few hours, MENNDL creates and evaluates millions of networks using a scalable, parallel, asynchronous genetic algorithm augmented with a support vector machine to automatically find a superior deep learning network topology and hyper-parameter set than a human expert can find in months. For the application of electron microscopy, the system furthers the goal of improving our understanding of the electron-beam-matter interactions and real-time image-based feedback, which enables a huge step beyond human capacity toward nanofabricating materials automatically. MENNDL has been scaled to the 4,200 available nodes of Summit achieving a measured 152.5 PFlops, with an estimated sustained performance of 167 PFlops when the entire machine is available. Exascale Deep Learning for Climate Analytics Thorsten Kurth (Lawrence Berkeley National Laboratory), Sean Treichler and Joshua Romero (Nvidia Corporation), Mayur Mudigonda (Lawrence Berkeley National Laboratory), Nathan Luehr and Everett Phillips (Nvidia Corporation), Ankur Mahesh (Lawrence Berkeley National Laboratory), Michael Matheson (Oak Ridge National Laboratory), Jack Deslippe (Lawrence Berkeley National Laboratory), Massimiliano Fatica (Nvidia Corporation), Mr Prabhat (Lawrence Berkeley National Laboratory), and Michael Houston (Nvidia Corporation) Abstract Abstract We extract pixel-level masks of extreme weather patterns using variants of Tiramisu and DeepLabv3+ neural networks. We describe improvements to the software frameworks, input pipeline, and the network training algorithms necessary to efficiently scale deep learning on the Piz Daint and Summit systems. The Tiramisu network scales to 5300 P100 GPUs with a sustained throughput of 21.0 PF/s and parallel efficiency of 79.0%. DeepLabv3+ scales up to 27360 V100 GPUs with a sustained throughput of 325.8 PF/s and a parallel efficiency of 90.7% in single precision. By taking advantage of the FP16 Tensor Cores, a half-precision version of the DeepLabv3+ network achieves a peak and sustained throughput of 1.13 EF/s and 999.0 PF/s respectively. ACM Gordon Bell Finalist Gordon Bell Prize Finalist #2 Simulating the Weak Death of the Neutron in a Femtoscale Universe with Near-Exascale Computing Evan Berkowitz (Forschungszentrum Juelich); M.A. Clark (Nvidia Corporation); Arjun Gambhir (Lawrence Livermore National Laboratory, Lawrence Berkeley National Laboratory); Ken McElvain (University of California, Berkeley; Lawrence Berkeley National Laboratory); Amy Nicholson (University of North Carolina); Enrico Rinaldi (RIKEN BNL Research Center, Lawrence Berkeley National Laboratory); Pavlos Vranas (Lawrence Livermore National Laboratory, Lawrence Berkeley National Laboratory); André Walker-Loud (Lawrence Berkeley National Laboratory, Lawrence Livermore National Laboratory); Chia Cheng Chang (Lawrence Berkeley National Laboratory, RIKEN); Bálint Joó (Thomas Jefferson National Accelerator Facility); Thorsten Kurth (Lawrence Berkeley National Laboratory); and Kostas Orginos (College of William & Mary, Thomas Jefferson National Accelerator Facility) Abstract Abstract The fundamental particle theory called Quantum Chromodynamics (QCD) dictates everything about protons and neutrons, from their intrinsic properties to interactions that bind them into atomic nuclei. Quantities that cannot be fully resolved through experiment, such as the neutron lifetime (whose precise value is important for the existence of light-atomic elements that make the sun shine and life possible), may be understood through numerical solutions to QCD. We directly solve QCD using Lattice Gauge Theory and calculate nuclear observables such as neutron lifetime. We have developed an improved algorithm that exponentially decreases the time-to-solution and applied it on the new CORAL supercomputers, Sierra and Summit. We use run-time autotuning to distribute GPU resources, achieving 20% performance at low node count. We also developed optimal application mapping through a job manager, which allows CPU and GPU jobs to be interleaved, yielding 15% of peak performance when deployed across large fractions of CORAL. ShenTu: Processing Multi-Trillion Edge Graphs on Millions of Cores in Seconds Heng Lin (Tsinghua University, Fma Technology); Xiaowei Zhu (Tsinghua University, Qatar Computing Research Institute); Bowen Yu (Tsinghua University); Xiongchao Tang (Tsinghua University, Qatar Computing Research Institute); Wei Xue and Wenguang Chen (Tsinghua University); Lufei Zhang (State Key Laboratory of Mathematical Engineering and Advanced Computing); Torsten Hoefler (ETH Zurich); Xiaosong Ma (Qatar Computing Research Institute); Xin Liu (National Research Centre of Parallel Computer Engineering and Technology); Weimin Zheng (Tsinghua University); and Jingfang Xu (Beijing Sogou Technology Development Company) Abstract Abstract Graphs are an important abstraction used in many scientific fields. With the magnitude of graph-structured data constantly increasing, effective data analytics requires efficient and scalable graph processing systems. Although HPC systems have long been used for scientific computing, people have only recently started to assess their potential for graph processing, a workload with inherent load imbalance, lack of locality, and access irregularity. We propose ShenTu, the first general-purpose graph processing framework that can efficiently utilize an entire petascale system to process multi-trillion edge graphs in seconds. ShenTu embodies four key innovations: hardware specializing, supernode routing, on-chip sorting, and degree-aware messaging, which together enable its unprecedented performance and scalability. It can traverse an unprecedented 70-trillion-edge graph in seconds. Furthermore, ShenTu enables the processing of a spam detection problem on a 12-trillion edge Internet graph, making it possible to identify trustworthy and spam web pages directly at the fine-grained page level. Attacking the Opioid Epidemic: Determining the Epistatic and Pleiotropic Genetic Architectures for Chronic Pain and Opioid Addiction Wayne Joubert (Oak Ridge National Laboratory); Deborah Weighill (Oak Ridge National Laboratory, University of Tennessee); David Kainer (Oak Ridge National Laboratory); Sharlee Climer (University of Missouri, St Louis); Amy Justice (Yale University, US Department of Veterans Affairs); Kjiersten Fagnan (Lawrence Berkeley National Laboratory, US Department of Energy Joint Genome Institute); and Daniel Jacobson (Oak Ridge National Laboratory) Abstract Abstract We describe the CoMet application for large-scale epistatic Genome-Wide Association Studies (eGWAS) and pleiotropy studies. High performance is attained by transforming the underlying vector comparison methods into highly performant generalized distributed dense linear algebra operations. The 2-way and 3-way Proportional Similarity metric and Custom Correlation Coefficient are implemented using native or adapted GEMM kernels optimized for GPU architectures. By aggressive overlapping of communications, transfers and computations, high efficiency with respect to single GPU kernel performance is maintained up to the full Titan and Summit systems. Nearly 300 quadrillion element comparisons per second and over 2.3 mixed precision ExaOps are reached on Summit by use of Tensor Core hardware on the Nvidia Volta GPUs. Performance is four to five orders of magnitude beyond comparable state of the art. CoMet is currently being used in projects ranging from bioenergy to clinical genomics, including for the genetics of chronic pain and opioid addiction. Paper · Architectures, Data Analytics, Networks, Tech Program Reg Pass Next-Generation Networking Exploiting Idle Resources in a High-Radix Switch for Supplemental Storage Best Paper Finalists Matthias A. Blumrich, Nan Jiang, and Larry R. Dennison (Nvidia Corporation) Abstract Abstract A general-purpose switch for a high-performance network is usually designed with symmetric ports providing credit-based flow control and error recovery via link-level retransmission. Because port buffers must be sized for the longest links and modern asymmetric network topologies have a wide range of link lengths, we observe that there can be a significant amount of unused buffer memory, particularly in edge switches. We also observe that the tiled architecture used in many high-radix switches contains an abundance of internal bandwidth. We combine these observations to create a new switch architecture that allows ports to stash packets in unused buffers on other ports, accessible via excess internal bandwidth in the tiled switch. We explore this architecture through two use cases: end-to-end resilience and congestion mitigation. We find that stashing is highly effective and does not negatively impact network performance. Fine-Grained, Multi-Domain Network Resource Abstraction as a Fundamental Primitive to Enable High-Performance, Collaborative Data Sciences Qiao Xiang (Yale University); J. Jensen Zhang, X. Tony Wang, and Y. Jace Liu (Tongji University); Chin Guok (Lawrence Berkeley National Laboratory); Franck Le (IBM); John MacAuley (Lawrence Berkeley National Laboratory); Harvey Newman (California Institute of Technology); and Y. Richard Yang (Yale University) Abstract Abstract Multi-domain network resource reservation systems are being deployed, driven by the demand and substantial benefits of providing predictable network resources. However, a major lack of existing systems is their coarse granularity, due to the participating networks’ concern of revealing sensitive information, which can result in substantial inefficiencies. This paper presents Mercator, a novel multi-domain network resource discovery system to provide fine-grained, global network resource information, for collaborative sciences. The foundation of Mercator is a resource abstraction through algebraic-expression enumeration (i.e., linear inequalities/equations), as a compact representation of the available bandwidth in multi-domain networks. In addition, we develop an obfuscating protocol, to address the privacy concerns by ensuring that no participant can associate the algebraic expressions with the corresponding member networks. We also introduce a superset projection technique to increase Mercator’s scalability. Finally, we implement Mercator and demonstrate both its efficiency and efficacy through extensive experiments using real topologies and traces. Light-Weight Protocols for Wire-Speed Ordering Hans Eberle and Larry Dennison (Nvidia Corporation) Abstract Abstract We describe light-weight protocols for selective packet ordering in out-of-order networks that carry memory traffic. The protocols are designed for heterogeneous high-performance systems, in particular, accelerated systems with endpoints that have few resources available for interfacing the network. Paper · Clouds and Distributed Computing, File Systems, I/O, Storage, Tech Program Reg Pass Data and Storage SP-Cache: Load-Balanced, Redundancy-Free Cluster Caching with Selective Partition Yinghao Yu, Renfei Huang, Wei Wang, Jun Zhang, and Khaled Ben Letaief (Hong Kong University of Science and Technology) Abstract Abstract Data-intensive clusters increasingly employ in-memory solutions to improve I/O performance. However, the routinely observed file popularity skew and load imbalance create hotspots, which significantly degrades the benefits of in-memory solutions. Common approaches to tame load imbalance include copying multiple replicas of hot files and creating parity chunks using storage codes. Yet, these techniques either suffer from high memory redundancy or incur non-trivial encoding/decoding overhead. In this paper, we propose a different approach to achieve load balancing without memory redundancy or encoding/decoding overhead. Our solution, termed SP-Cache, selectively partitions files based on their popularity and evenly caches those partitions across the cluster. We develop an efficient algorithm to determine the optimal number of partitions for hot files—too few partitions are incapable of mitigating hotspots, while too many are susceptible to stragglers. EC2 deployment and trace-driven simulations show that, compared with existing solutions, SP-Cache reduces the read latencies by up to 40%. BESPOKV: Application Tailored Scale-Out Key-Value Stores Ali Anwar (IBM), Yue Cheng (George Mason University), Hai Huang (IBM), Jingoo Han (Virginia Tech), Hyogi Sim (Oak Ridge National Laboratory), Dongyoon Lee (Virginia Tech), Fred Douglis (Perspecta Labs), and Ali R. Butt (Virginia Tech) Abstract Abstract Enterprise KV stores are not well suited for HPC applications, and entail customization and cumbersome end-to-end KV design to extract the HPC application needs. In this paper we present BESPOKV, an adaptive, extensible, and scale-out KV store framework. BESPOKV decouples the KV store design into the control plane for distributed management and the data plane for local data store. BESPOKV takes as input a single-server KV store, called a datalet, and transparently enables a scalable and fault-tolerant distributed KV store service. The resulting distributed stores are also adaptive to consistency or topology requirement changes and can be easily extended for new types of services. Experiments show that BESPOKV-enabled distributed KV stores scale horizontally to a large number of nodes, and performs comparably and sometimes better than the state-of-the-art systems. Scaling Embedded In Situ Indexing with DeltaFS Qing Zheng, Charles D. Cranor, Danhao Guo, Gregory R. Ganger, George Amvrosiadis, and Garth A. Gibson (Carnegie Mellon University) and Bradley W. Settlemyer, Gary Grider, and Fan Guo (Los Alamos National Laboratory) Abstract Abstract Analysis of large-scale simulation output is a core element of scientific inquiry, but analysis queries may experience significant I/O overhead when the data is not structured for efficient retrieval. While in-situ processing allows for improved time-to-insight for many applications, scaling in-situ frameworks to hundreds of thousands of cores can be difficult in practice. The DeltaFS in-situ indexing is a new approach for in-situ processing of massive amounts of data to achieve efficient point and small-range queries. This paper describes the challenges and lessons learned when scaling this in-situ processing function to hundreds of thousands of cores. We propose techniques for scalable all-to-all communication that is memory and bandwidth efficient, concurrent indexing, and specialized LSM-Tree formats. Combining these techniques allows DeltaFS to control the cost of in-situ processing while maintaining 3 orders of magnitude query speedup when scaling alongside the popular VPIC particle-in-cell code to 131,072 cores. Paper · GPUs, Resiliency, State of the Practice, System Software, Tech Program Reg Pass Resilience GPU Age-Aware Scheduling to Improve the Reliability of Leadership Jobs on Titan Christopher Zimmer, Don Maxwell, Stephen McNally, Scott Atchley, and Sudharshan S. Vazhkudai (Oak Ridge National Laboratory) Abstract Abstract The increasing rate of failures on the Oak Ridge Leadership Computing Facility's (OLCF) Titan supercomputer, resulted in the replacement of 50% of its GPUs between 2015 and 2017. The largest jobs, also known as "leadership jobs'', continued to experience increased application failures. These jobs contained significant amounts of low-failure rate and high-failure rate GPUs. The impacts of these failures were felt more by leadership jobs due to longer wait times, runtimes, and higher charge rates. In this work, we have designed techniques to increase the use of low-failure GPUs in leadership jobs through targeted resource allocation. This employed two complementary techniques, updating both the system ordering and the allocation mechanisms. In simulation, the application of these techniques resulted in a 33% increase in low-failure GPU hours being assigned to leadership jobs. Our GPU Age-Aware Scheduling has been used in production on Titan since July of 2017. FlipTracker: Understanding Natural Error Resilience in HPC Applications Luanzheng Guo and Dong Li (University of California, Merced); Ignacio Laguna (Lawrence Livermore National Laboratory); and Martin Schulz (Technical University Munich) Abstract Abstract As high-performance computing systems scale in size and computational power, the danger of silent errors, i.e., errors that can bypass hardware detection mechanisms and impact application state, grows dramatically. Consequently, applications running on HPC systems need to exhibit resilience to such errors. Previous work has found that, for certain codes, this resilience can come for free, i.e., some applications are naturally resilient, but few works have shown the code patterns—combinations or sequences of computations—that make an application naturally resilient. In this paper, we present FlipTracker, a framework designed to extract these patterns using fine-grained tracking of error propagation and resilience properties, and we use it to present a set of computation patterns that are responsible for making representative HPC applications naturally resilient to errors. This not only enables a deeper understanding of resilience properties of these codes, but also can guide future application designs toward patterns with natural resilience. Doomsday: Predicting Which Node Will Fail When on Supercomputers Best Student Paper Finalists Anwesha Das and Frank Mueller (North Carolina State University) and Paul Hargrove, Eric Roman, and Scott Baden (Lawrence Berkeley National Laboratory) Abstract Abstract Predicting which node will fail and how soon remains a challenge for HPC resilience, yet may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing scalability up to exascale systems, but even for contemporary supercomputer architectures does it require substantial efforts to distill anomalous events from noisy raw logs. To this end, we propose a novel phrase extraction mechanism called TBP (time-based phrases) to pin-point node failures, which is unprecedented. Our study, based on real system data and statistical machine learning, demonstrates the feasibility to predict which specific node will fail in Cray systems. TBP achieves no less than 83% recall rates with lead times as high as 2 minutes. This opens up the door for enhancing prediction lead times for supercomputing systems in general, thereby facilitating efficient usage of both computing capacity and power in large scale production systems. Paper · Algorithms, Applications, Computational Biology, Scientific Computing, Tech Program Reg Pass Biology Applications Extreme Scale De Novo Metagenome Assembly Best Paper Finalists Evangelos Georganas (Intel Corporation) and Rob Egan, Steven Hofmeyr, Eugene Goltsman, Bill Arndt, Andrew Tritt, Aydin Buluc, Leonid Oliker, and Katherine Yelick (Lawrence Berkeley National Laboratory) Abstract Abstract Metagenome assembly is the process of transforming a set of short, overlapping, and potentially erroneous DNA segments from environmental samples into the accurate representation of the underlying microbiomes's genomes. State-of-the-art tools require large shared memory machines and cannot handle contemporary metagenome datasets that exceed terabytes in size. In this paper, we introduce the metaHipMer pipeline, a high-quality and high-performance metagenome assembler that employs an iterative de Bruijn graph approach. MetaHipMer leverages a specialized scaffolding algorithm that produces long scaffolds and accommodates the idiosyncrasies of metagenomes. MetaHipMer is end-to-end parallelized using the Unified Parallel C language and therefore can run seamlessly on shared and distributed-memory systems. Experimental results show that metaHipMer matches or outperforms the state-of-the-art tools in terms of accuracy. Moreover, metaHipMer scales efficiently to large concurrencies and is able to assemble previously intractable grand challenge metagenomes. Optimizing High Performance Distributed Memory Parallel Hash Tables for DNA k-mer Counting Tony C. Pan (Georgia Institute of Technology, School of Computational Science and Engineering); Sanchit Misra (Intel Corporation, Parallel Computing Lab); and Srinivas Aluru (Georgia Institute of Technology, School of Computational Science and Engineering) Abstract Abstract High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of high-throughput sequencing is the counting and indexing of fixed length substrings of DNA sequences called k-mers. Counting k-mers is often accomplished via hashing, and distributed memory k-mer counting algorithms for large data sets are memory access and network communication bound. In this work, we present two optimized distributed parallel hash table techniques that utilize cache friendly algorithms for local hashing, overlapped communication and computation to hide communication costs, and vectorized hash functions that are specialized for k-mer and other short key indices. On 4096 cores of the NERSC Cori supercomputer, our implementation completed index construction and query on an approximately 1 TB human genome dataset in just 11.8 seconds and 5.8 seconds, demonstrating speedups of 2.06x and 3.7x, respectively, over the previous state-of-the-art distributed memory k-mer counter. Redesigning LAMMPS for Petascale and Hundred-Billion-Atom Simulation on Sunway TaihuLight Xiaohui Duan, Ping Gao, Tingjian Zhang, Meng Zhang, and Weiguo Liu (Shandong University); Wusheng Zhang, Wei Xue, Haohuan Fu, Lin Gan, and Dexun Chen (Tsinghua University); Xiangxu Meng (Shandong University); and Guangwen Yang (Tsinghua University) Abstract Abstract Large-scale molecular dynamics (MD) simulations on supercomputers play an increasingly important role in many research areas. In this paper, we present our efforts on redesigning the widely used LAMMPS MD simulator for Sunway TaihuLight supercomputer and its ShenWei many-core architecture (SW26010). The memory constraints of SW26010 bring a number of new challenges for achieving efficient MD implementation on it. In order to overcome these constraints, we employ four levels of optimization: (1) a hybrid memory update strategy; (2) a software cache strategy; (3) customized transcendental math functions; and (4) a full pipeline acceleration. Furthermore, we redesign the code to enable all possible vectorization. Experiments show that our redesigned software on a single SW26010 processor can outperform over 100 E5-2650 cores for running the latest stable release (11Aug17) of LAMMPS. We also achieve a performance of over 2.43 PFlops for a Tersoff simulation when using 16,384 nodes on Sunway TaihuLight. Paper · Algorithms, Architectures, Data Analytics, Deep Learning, Networks, Scientific Computing, Visualization, Tech Program Reg Pass Large-Scale Algorithms Large-Scale Hierarchical K-Means for Heterogeneous Many-Core Supercomputers Liandeng Li (Tsinghua University; National Supercomputing Center, Wuxi); Teng Yu (University of St Andrews); Wenlai Zhao and Haohuan Fu (Tsinghua University; National Supercomputing Center, Wuxi); Chenyu Wang (University of St Andrews; National Supercomputing Center, Wuxi); Li Tan (Beijing Technology and Business University); Guangwen Yang (Tsinghua University; National Supercomputing Center, Wuxi); and John Thomson (University of St Andrews) Abstract Abstract This paper presents a novel design and implementation of k-means clustering algorithm targeting the Sunway TaihuLight supercomputer. We introduce a multi-level parallel partition approach that not only partitions by dataflow and centroid, but also by dimension. Our multi-level (nkd) approach unlocks the potential of the hierarchical parallelism in the SW26010 heterogeneous many-core processor and the system architecture of the supercomputer. TriCore: Parallel Triangle Counting on GPUs Yang Hu (George Washington University); Hang Liu (University of Massachusetts, Lowell); and H. Howie Huang (George Washington University) Abstract Abstract Triangle counting algorithm enumerates the triangles in a graph by identifying the common neighbors between two vertices of every edge. In this work, we present TriCore, a new GPU-based high-performance and scalable triangle counting system that consists of three main techniques. First, we design a binary search based counting algorithm that tremendously increases both thread parallelism and memory performance. Second, TriCore exploits a 2-D partition method to distribute the CSR representation across multiple GPUs, combined with a new streaming buffer to load the edge list from outside of GPUs. Third, we develop a dynamic workload management technique to balance the workload across multiple GPUs. Our evaluation demonstrates TriCore is 22× faster than the state-of-the-art parallel triangle counting projects. In addition, TriCore can not only process big graphs that are significant larger than the memory size of one GPU but also achieve 24× speedup when scaling to 32 GPUs. Distributed-Memory Hierarchical Compression of Dense SPD Matrices Best Student Paper Finalists Chenhan D. Yu (University of Texas), Severin Reiz (Technical University Munich), and George Biros (University of Texas) Abstract Abstract We present a distributed-memory algorithm for the hierarchical compression of SPD matrices. Our method is based on GOFMM, an algorithm that appeared in doi:10.1145/3126908.3126921. Paper · OpenMP, Performance, Power, Tools, Tech Program Reg Pass Performance and Energy Analysis A Parallelism Profiler with What-If Analyses for OpenMP Programs Nader Boushehrinejadmoradi, Adarsh Yoga, and Santosh Nagarakatte (Rutgers University) Abstract Abstract This paper proposes OMP-WHIP, a profiler that measures inherent parallelism in the program for a given input and provides what-if analyses to estimate improvements in parallelism. We propose a novel OpenMP series parallel graph representation (OSPG) that precisely captures series-parallel relations induced by various directives between different fragments of dynamic execution. OMP-WHIP constructs the OSPG and measures the computation performed by each dynamic fragment using hardware performance counters. This series-parallel representation along with the fine-grained measurement of computation is a performance model of the program for a given input, which enables computation of inherent parallelism. This novel performance model also enables what-if analyses where a programmer can estimate improvements in parallelism when bottlenecks are parallelized. We have used OMP-WHIP to identify parallelism bottlenecks in more than forty applications and then designed strategies to improve the speedup in seven applications. Energy Efficiency Modeling of Parallel Applications Mark Endrei, Chao Jin, Minh Ngoc Dinh, and David Abramson (University of Queensland); Heidi Poxon and Luiz DeRose (Cray Inc); and Bronis R. de Supinski (Lawrence Livermore National Laboratory) Abstract Abstract Energy efficiency has become increasingly important in high performance computing (HPC), as power constraints and costs escalate. Workload and system characteristics form a complex optimization search space in which optimal settings for energy efficiency and performance often diverge. Thus, we must identify trade-off options to find the desired balance. We present an innovative statistical model that accurately predicts the Pareto optimal trade-off options using only user-controllable parameters. Our approach can also tolerate both measurement and model errors. We study model training and validation using several HPC kernels, then with more complex workloads, including AMG and LAMMPS. We can calibrate an accurate model from as few as 12 runs, with prediction error of less than 10%. Our results identify trade-off options allowing up to 40% energy efficiency improvement at the cost of under 20% performance loss. For AMG, we reduce the required sample measurement time from 13 hours to 74 minutes. HPL and DGEMM Performance Variability on the Xeon Platinum 8160 Processor John D. McCalpin (University of Texas, Texas Advanced Computing Center) Abstract Abstract During initial testing of a large cluster equipped with Xeon Platinum 8160 processors, we observed infrequent, but significant, performance drops in HPL benchmark results. The variability was seen in both single node and multi-node runs, with approximately 0.4% of results more than 10% slower than the median. We were able to reproduce this behavior with a single-socket (24-core) DGEMM benchmark. Performance counter analysis of several thousand DGEMM runs showed that increased DRAM read traffic is the primary driver of increased execution time. Increased DRAM traffic in this benchmark is primarily generated by dramatically elevated snoop filter evictions, which arise due to the interaction of high-order (physical) address bits with the hash used to map addresses across the 24 coherence agents on the processor. These conflicts (and the associated performance variability) were effectively eliminated (for both DGEMM and HPL) by using 1 GiB large pages. Paper · Algorithms, Graph Algorithms, Linear Algebra, Machine Learning, Sparse Computation, Tech Program Reg Pass Algorithms on Sparse Data HiCOO: Hierarchical Storage of Sparse Tensors Best Student Paper Finalists Jiajia Li, Jimeng Sun, and Richard Vuduc (Georgia Institute of Technology) Abstract Abstract This paper proposes a new storage format for sparse tensors, called Hierarchical COOrdinate (HiCOO; pronounced: “haiku”). It derives from coordinate (COO) format, arguably the de facto standard for general sparse tensor storage. HiCOO improves upon COO by compressing the indices in units of sparse tensor blocks, with the goals of preserving the “mode-agnostic” simplicity of COO while reducing the bytes needed to represent the tensor and promoting data locality. We evaluate HiCOO by implementing a single-node, multicore-parallel version of the matricized tensor-times-Khatri-Rao product (MTTKRP) operation, which is the most expensive computational core in the widely used CANDECOMP/PARAFAC decomposition(CPD) algorithm. This MTTKRP implementation achieves up to 23.0× (6.8× on average) speedup over COO format and up to 15.6× (3.1× on average) speedup over another state-of-the-art format, compressed sparse fiber (CSF), by using less or comparable storage of them. When used within CPD, we also observe speedups against COO- and CSF-based implementations. Distributed Memory Sparse Inverse Covariance Matrix Estimation on High-Performance Computing Architectures Aryan Eftekhari (University of Lugano), Matthias Bollhöfer (Braunschweig University of Technology), and Olaf Schenk (University of Lugano) Abstract Abstract We consider the problem of estimating sparse inverse covariance matrices for high-dimensional datasets using the l1-regularized Gaussian maximum likelihood method. This task is particularly challenging as the required computational resources increase superlinearly with the dimensionality of the dataset. We introduce a performant and scalable algorithm which builds on the current advancements of second-order, maximum likelihood methods. The routine leverages the intrinsic parallelism in the linear algebra operations and exploits the underlying sparsity of the problem. The computational bottlenecks are identified and the respective subroutines are parallelized using an MPI-OpenMP approach. Experiments conducted on a Cray XC50 system at the Swiss National Supercomputing Center show that, in comparison to the state-of-the-art algorithms, the proposed routine provides significant strong scaling speedup with ideal scalability up to 128 nodes. The developed framework is used to estimate the sparse inverse covariance matrix of both synthetic and real-world datasets with up to 10 million dimensions. PruneJuice: Pruning Trillion-Edge Graphs to a Precise Pattern-Matching Solution Tahsin Reza, Matei Ripeanu, and Nicolas Tripoul (University of British Columbia) and Geoffrey Sanders and Roger Pearce (Lawrence Livermore National Laboratory) Abstract Abstract Pattern matching is a powerful graph analysis tool. Unfortunately, existing solutions have limited scalability, support only a limited set of search patterns, and/or focus on only a subset of the real-world problems associated with pattern matching. This paper presents a new algorithmic pipeline that: (i) enables highly scalable pattern matching on labeled graphs, (ii) supports arbitrary patterns, (iii) enables trade-offs between precision and time-to-solution (while always selecting all vertices and edges that participate in matches, thus offering 100% recall), and (iv) supports a set of popular data analytics scenarios. We implement our approach on top of HavoqGT and demonstrate its advantages through strong and weak scaling experiments on massive-scale real-world (up to 257 billion edges) and synthetic (up to 4.4 trillion edges) graphs, respectively, and at scales (1,024 nodes / 36,864 cores) orders of magnitude larger than used in the past for similar problems. Paper · Data Analytics, Performance, Programming Systems, Storage, Tools, Visualization, Tech Program Reg Pass Performance Optimization Studies Many-Core Graph Workload Analysis Stijn Eyerman, Wim Heirman, Kristof Du Bois, Joshua B. Fryman, and Ibrahim Hur (Intel Corporation) Abstract Abstract Graph applications have specific characteristics that are not common in other application domains. In this paper, we analyze multiple graph applications on current multi- and many-core processors and provide conclusions and recommendations for future designs. We provide new insights on executing graph applications on many-core processors. Lessons Learned from Analyzing Dynamic Promotion for User-Level Threading Shintaro Iwasaki (University of Tokyo), Abdelhalim Amer (Argonne National Laboratory), Kenjiro Taura (University of Tokyo), and Pavan Balaji (Argonne National Laboratory) Abstract Abstract A performance vs. practicality trade-off exists between user-level threading techniques. The community has settled mostly on a black-and-white perspective; fully fledged threads assume that suspension is imminent and incur overheads when suspension does not take place, and run-to-completion threads are more lightweight but less practical since they cannot suspend. Gray areas exist, however, whereby threads can start with minimal capabilities and then can be dynamically promoted to acquire additional capabilities when needed. This paper investigates the full spectrum of threading techniques from a performance vs. practicality trade-off perspective on modern multicore and many-core systems. Our results indicate that achieving the best trade-off highly depends on the suspension likelihood; dynamic promotion is more appropriate when suspension is unlikely and represents a solid replacement for run to completion, thanks to its lower programming constraints, while fully fledged threads remain the technique of choice when suspension likelihood is high. Topology-Aware Space-Shared Co-Analysis of Large-Scale Molecular Dynamics Simulations Preeti Malakar (Indian Institute of Technology Kanpur); Todd Munson, Christopher Knight, and Venkatram Vishwanath (Argonne National Laboratory); and Michael E. Papka (Argonne National Laboratory, Northern Illinois University) Abstract Abstract Analysis of scientific simulation data can be concurrently executed with simulation either in time- or space-shared mode. This mitigates the I/O bottleneck. However it results in either stalling the simulation for performing the analysis or transferring data for analysis. In this paper, we improve the throughput of space-shared in situ analysis of large-scale simulations by topology-aware mapping and optimal process decomposition. We propose node interconnect topology-aware process placement for simulation and analysis to reduce the data movement time. We also present an integer linear program for optimal 3D decompositions of simulation and analysis processes. We demonstrate our approach using molecular dynamics simulation on Mira, Cori and Theta supercomputers. Our mapping schemes, combined with optimal 3D process decomposition and code optimizations resulted in up to 30% lower execution times for space-shared in situ analysis than the default approach. Our mappings also reduce MPI collective I/O times by 10-40%. Paper · Networks, Resource Management, Scheduling, State of the Practice, System Software, Tech Program Reg Pass Resource Management and Interference RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management Maxime Martinasso, Miguel Gila, Mauro Bianco, Sadaf R. Alam, Colin McMurtrie, and Thomas C. Schulthess (Swiss National Supercomputing Centre) Abstract Abstract Leading hybrid and heterogeneous supercomputing systems process hundreds of thousands of jobs using complex scheduling algorithms and parameters. The centers operating these systems aim to achieve higher levels of resource utilization while being restricted by compliance with policy constraints. There is a critical need for a high-fidelity, high-performance tool with familiar interfaces that allows not only tuning and optimization of the operational job scheduler but also enables exploration of new resource management algorithms. We propose a new methodology and a tool called RM-Replay which is not a simulator but instead a fast replay engine for production workloads. Slurm is used as a platform to demonstrate the capabilities of our replay engine. Evaluation of an Interference-Free Node Allocation Policy on Fat-Tree Clusters Samuel D. Pollard (University of Oregon) and Nikhil Jain, Stephen Herbein, and Abhinav Bhatele (Lawrence Livermore National Laboratory) Abstract Abstract Interference between jobs competing for network bandwidth on a fat-tree cluster can cause significant variability and degradation in performance. These performance issues can be mitigated or completely eliminated if the resource allocation policy takes the network topology into account when allocating nodes to jobs. We implement a fat-tree network topology aware node allocation policy that allocates isolated partitions to jobs in order to eliminate inter-job interference. We compare the impact of this node allocation policy to a topology-oblivious policy with respect to the execution time of individual jobs with different communication patterns. We also evaluate the cluster's quality of service using metrics such as system utilization, schedule makespan, and job wait time for both policies. The results obtained for production workloads indicate that a topology-aware node allocation can provide interference-free execution without negatively impacting the cluster's quality of service. Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing Best Student Paper Finalists Staci A. Smith, Clara E. Cromey, and David K. Lowenthal (University of Arizona); Jens Domke (Tokyo Institute of Technology); and Nikhil Jain, Jayaraman J. Thiagarajan, and Abhinav Bhatele (Lawrence Livermore National Laboratory) Abstract Abstract On most high performance computing platforms, applications share network resources with other jobs running concurrently on the system. Inter-job network interference can have a significant impact on the performance of communication-intensive applications, and no satisfactory solutions yet exist for mitigating this degradation. Paper · Algorithms, Architectures, Memory, Networks, Parallel Programming Languages, Libraries, and Models, Power, Programming Systems, Scheduling, Tech Program Reg Pass Task-Based Programming Dynamic Tracing: Memoization of Task Graphs for Dynamic Task-Based Runtimes Wonchan Lee (Stanford University), Elliott Slaughter (SLAC National Accelerator Laboratory), Michael Bauer and Sean Treichler (Nvidia Corporation), Todd Warszawski (Stanford University), Michael Garland (Nvidia Corporation), and Alex Aiken (Stanford University) Abstract Abstract Many recent programming systems for both supercomputing and data center workloads generate task graphs to express computations that run on parallel and distributed machines. Due to the overhead associated with constructing these graphs the dependence analysis that generates them is often statically computed and memoized, and the resulting graph executed repeatedly at runtime. However, many applications require a dynamic dependence analysis due to data dependent behavior, but there are new challenges in capturing and re-executing task graphs at runtime. In this work, we introduce dynamic tracing, a technique to capture a dynamic dependence analysis of a trace that generates a task graph, and replay it. We show that an implementation of dynamic tracing improves strong scaling by an average of 4.9X and up to 7.0X on a suite of already optimized benchmarks. Runtime-Assisted Cache Coherence Deactivation in Task Parallel Programs Paul Caheny (Barcelona Supercomputing Center, Polytechnic University of Catalonia); Lluc Alvarez (Barcelona Supercomputing Center); Mateo Valero and Miquel Moretó (Barcelona Supercomputing Center, Polytechnic University of Catalonia); and Marc Casas (Barcelona Supercomputing Center) Abstract Abstract With increasing core counts, the scalability of directory-based cache coherence has become a challenging problem. To reduce the area and power needs of the directory, recent proposals reduce its size by classifying data as private or shared, and disable coherence for private data. However, existing classification methods suffer from inaccuracies and require complex hardware support with limited scalability. A Divide and Conquer Algorithm for DAG Scheduling Under Power Constraints Gökalp Demirci, Ivana Marincic, and Henry Hoffmann (University of Chicago) Abstract Abstract We consider the problem of scheduling a parallel computation–represented as a directed acyclic graph (DAG)–on a distributed parallel system with a global resource constraint–specifically a global power budget–and configurable resources, allowing a range of different power/performance tradeoffs. There is a rich body of literature on the independent problems of (1) scheduling DAGs and (2) scheduling independent applications under resource constraints. Very little, however, is known about the combined problem of scheduling DAGs under resource constraints. We present a novel approximation algorithm using a divide-and-conquer method for minimizing application execution time. We prove that the length of the schedule returned by our algorithm is always within O(log n)-factor of the optimum that can be achieved with selection of configurations for the tasks. We implement and test our algorithm on simulations of real application DAGs. We find that our divide-and-conquer method improves performance by up to 75% compared to greedy scheduling algorithms. Paper · Architectures, MPI, Networks, Performance, Programming Systems, State of the Practice, Tech Program Reg Pass MPI Optimization and Characterization Cooperative Rendezvous Protocols for Improved Performance and Overlap Best Student Paper Finalists S. Chakraborty, M. Bayatpour, J. Hashmi, H. Subramoni, and D. K. Panda (Ohio State University) Abstract Abstract With the emergence of larger multi-/many-core clusters, performance of large message communication is becoming more important. MPI libraries use different Rendezvous protocols to perform large message communication. However, existing Rendezvous protocols do not consider the overall communication pattern and make optimal use of the Sender and the Receiver CPUs. In this work, we propose a cooperative Rendezvous protocol that can provide up to 2x improvement in intra-node bandwidth and latency for large messages. We also propose a scheme to dynamically choose the best Rendezvous protocol for each message based on the communication pattern. Finally, we show how these improvements can increase the overlap of computation with intra-node and inter-node communication, and lead to application level benefits. We evaluate proposed designs on three different architectures including Intel Xeon, Knights Landing, and OpenPOWER with different HPC applications and obtain benefits up to 19% with Graph500, 16% with CoMD, and 10% with MiniGhost. Framework for Scalable Intra-Node Collective Operations Using Shared Memory Surabhi Jain, Rashid Kaleem, Marc Gamell Balmana, Akhil Langer, Dmitry Durnov, Alexander Sannikov, and Maria Garzaran (Intel Corporation) Abstract Abstract Collective operations are used in MPI programs to express common communication patterns, collective computations, or synchronizations. In many collectives, such as barrier or allreduce, the intra-node component of the collective is in the critical path, as the inter-node communication cannot start until the intra-node component has been executed. Thus, with increasing number of core counts in each node, intra-node optimizations that leverage the intra-node shared memory become increasingly important. Characterization of MPI Usage on a Production Supercomputer Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran (Argonne National Laboratory) Abstract Abstract MPI is the most prominent programming model used in scientific computing today. Despite it's importance, however, how scientific applications use it in production is not very well understood due to the lack of low overhead profiling tools. We used a lightweight profiling tool, called autoperf, to log the MPI usage characteristics of production applications on a large supercomputing system (Mira) and its corresponding development system (Cetus). Autoperf limits the amount of information that it records in order to keep the overhead to a minimum while still storing enough data to derive useful insights. MPI usage statistics have been collected for over 100K jobs that were run within a 2-year period and are analyzed. The analysis of this data is intended as a mechanism to provide useful insights for MPI developers and network hardware developers for their next generation of improvements, and for supercomputing center operators for their next system procurements. Paper · GPUs, Memory, NVRAM, Performance, System Software, Tools, Tech Program Reg Pass Non-Volatile Memory Runtime Data Management on Non-Volatile Memory-Based Heterogeneous Memory for Task-Parallel Programs Kai Wu, Jie Ren, and Dong Li (University of California, Merced) Abstract Abstract Non-volatile memory (NVM) provides a scalable solution to replace DRAM as main memory. Because of relatively high latency and low bandwidth of NVM (comparing with DRAM), NVM often pairs with DRAM to build a heterogeneous main memory system (HMS). Deciding data placement on NVM-based HMS is critical to enable future NVM-based HPC. In this paper, we study task-parallel programs and introduce a runtime system to address the data placement problem on NVM-based HMS. Leveraging semantics and execution mode of task-parallel programs, we efficiently characterize memory access patterns of tasks and reduce data movement overhead. We also introduce a performance model to predict performance for tasks with various data placements on HMS. Evaluating with a set of HPC benchmarks, we show that our runtime system achieves higher performance than a conventional HMS-oblivious runtime (24% improvement on average) and two state-of-the-art HMS-aware solutions (16% and 11% improvement on average, respectively). DRAGON: Breaking GPU Memory Capacity Limits with Direct NVM Access Pak Markthub (Tokyo Institute of Technology); Mehmet E. Belviranli, Seyong Lee, and Jeffrey S. Vetter (Oak Ridge National Laboratory); and Satoshi Matsuoka (RIKEN, Tokyo Institute of Technology) Abstract Abstract Heterogeneous computing with accelerators is growing in importance in high performance computing (HPC). Recently, application datasets have expanded beyond the memory capacity of these accelerators, and often beyond the capacity of their hosts. Meanwhile, nonvolatile memory (NVM) storage has emerged as a pervasive component in HPC systems because NVM provides massive amounts of memory capacity at affordable cost. Currently, for accelerator applications to use NVM, they must manually orchestrate data movement across multiple memories and this approach only performs well for applications with simple access behaviors. To address this issue, we developed DRAGON, a solution that enables all classes of GP-GPU applications to transparently compute on terabyte datasets residing in NVM. DRAGON leverages the page-faulting mechanism on the recent NVIDIA GPUs by extending capabilities of CUDA Unified Memory (UM). Our experimental results show that DRAGON transparently expands memory capacity and obtain additional speedups via automated I/O and data transfer overlapping. Siena: Exploring the Design Space of Heterogeneous Memory Systems Ivy B. Peng and Jeffrey S. Vetter (Oak Ridge National Laboratory) Abstract Abstract Memory systems are crucial to the performance, power, and cost of high-performance computing systems. Recently, multiple factors are driving the need for more complex, deep memory hierarchies. However, architects and customers are struggling to design memory systems that effectively balance multiple, often competing, factors in this large, multidimensional, and fast-moving design space. In this paper, we systematically explore the organization of heterogeneous memory systems on a framework, called Siena. Siena facilitates quick exploration of memory architectures with flexible configurations of memory systems and realistic memory workloads. We perform a design space exploration on 22 proposed memory systems using eight relevant workloads. Our results show that horizontal organizations of memories can achieve higher performance than that of vertical organizations when the distribution of memory traffic balances the performance gap between memories. However, the coupling effects through shared resources and application behaviors could negate the advantage of high-performance memory in horizontal organizations. Paper · Algorithms, Applications, Computational Physics, Scientific Computing, Tech Program Reg Pass Physics and Tensor Applications Simulating the Wenchuan Earthquake with Accurate Surface Topography on Sunway TaihuLight Bingwei Chen, Haohuan Fu, Yanwen Wei, and Conghui He (Tsinghua University; National Supercomputing Center, Wuxi); Wenqiang Zhang (University of Science and Technology of China); Yuxuan Li (Tsinghua University; National Supercomputing Center, Wuxi); Wubin Wan and Wei Zhang (National Supercomputing Center, Wuxi); Lin Gan (Tsinghua University; National Supercomputing Center, Wuxi); Wei Zhang and Zhenguo Zhang (Southern University of Science and Technology, China); Guangwen Yang (Tsinghua University; National Supercomputing Center, Wuxi); and Xiaofei Chen (Southern University of Science and Technology, China) Abstract Abstract This paper reports our efforts on performing 50-m resolution earthquake simulation of the Wenchuan Earthquake (Ms 8.0, China) on Sunway TaihuLight. To accurately capture the surface topography, we adopt a curvilinear grid finite-difference method with a traction image free surface implementation and redesign the algorithm to reduce memory access costs for heterogeneous many-core architectures. We then derive a performance model of our algorithm to guide and drive the further optimization and tuning of various parameters using a genetic algorithm. A data layout transformation is also proposed to improve the direct memory access (DMA) efficiency further. Our efforts improve the simulation efficiency from 0.05% to 7.6%, with a sustained performance of 9.07 Pflops using the entire machine of the Sunway TaihuLight (over 10 million cores), and a large-scale simulation of the Wenchuan earthquake with accurate surface topography and improved coda wave effects. Accelerating Quantum Chemistry with Vectorized and Batched Integrals Hua Huang and Edmond Chow (Georgia Institute of Technology) Abstract Abstract This paper presents the first quantum chemistry calculations using a recently developed vectorized library for computing electron repulsion integrals. To lengthen the SIMD loop and thus improve SIMD utilization, the approach used in this paper is to batch together the computation of multiple integrals that have the same code path. The standard approach is to compute integrals one at a time, and thus a batching procedure had to be developed. This paper shows proof-of-concept and demonstrates the performance gains possible when the batched approach is used. Batching also enables certain optimizations when the integrals are used to compute the Fock matrix. We further describe several other optimizations that were needed to obtain up to a 270% speedup over the no batching version of the code, making a compelling case for adopting the presented techniques in quantum chemistry software. High-Performance Dense Tucker Decomposition on GPU Clusters Jee Choi (IBM), Xing Liu (Intel Corporation), and Venkatesan Chakaravarthy (IBM) Abstract Abstract The Tucker decomposition method is one of the most popular algorithms for analyzing and compressing data with multi-way relationship. Its execution time is typically dominated by dense matrix multiplication, which makes it well-suited for GPU acceleration. State-of-the-art distributed dense Tucker implementations for CPU clusters adopt multi-dimensional partitioning that optimizes for storage and communication. This, however, leads to smaller matrix dimensions that result in under-utilizing the GPU. Paper · Clouds and Distributed Computing, Resource Management, Scheduling, Tech Program Reg Pass Clouds and Distributed Computing A Reference Architecture for Datacenter Scheduling: Design, Validation, and Experiments Georgios Andreadis (Delft University of Technology, Vrije University Amsterdam); Laurens Versluis (Vrije University Amsterdam); Fabian Mastenbroek (Delft University of Technology); and Alexandru Iosup (Vrije University Amsterdam, Delft University of Technology) Abstract Abstract Datacenters act as cloud-infrastructure to stakeholders across industry, government, and academia. To meet growing demand yet operate efficiently, datacenter operators employ increasingly more sophisticated scheduling systems, mechanisms, and policies. Although many scheduling techniques already exist, relatively little research has gone into the abstraction of the scheduling process itself, hampering design, tuning, and comparison of existing techniques. In this work, we propose a reference architecture for datacenter schedulers. The architecture follows five design principles: components with clearly distinct responsibilities, grouping of related components where possible, separation of mechanism from policy, scheduling as complex workflow, and hierarchical multi-scheduler structure. To demonstrate the validity of the reference architecture, we map to it state-of-the-art datacenter schedulers. We find scheduler-stages are commonly underspecified in peer-reviewed publications. Through trace-based simulation and real-world experiments, we show underspecification of scheduler-stages can lead to significant variations in performance. Dynamically Negotiating Capacity Between On-Demand and Batch Clusters Feng Liu (University of Minnesota), Kate Keahey (Argonne National Laboratory), Pierre Riteau (University of Chicago), and Jon Weissman (University of Minnesota) Abstract Abstract In the era of rapid experimental expansion data analysis needs are rapidly outpacing the capabilities of small institutional clusters and looking to integrate HPC resources into their workflow. We propose one way of reconciling on-demand needs of experimental analytics with the batch managed HPC resources within a system that dynamically moves nodes between an on-demand cluster configured with cloud technology (OpenStack) and a traditional HPC cluster managed by a batch scheduler (Torque). We evaluate this system experimentally both in the context of real-life traces representing two years of a specific institutional need, and via experiments in the context of synthetic traces that capture generalized characteristics of potential batch and on-demand workloads. Our results for the real-life scenario show that our approach could reduce the current investment in on-demand infrastructure by 82% while at the same time improving the mean batch wait time almost by an order of magnitude (8x). A Lightweight Model for Right-Sizing Master-Worker Applications Nathaniel Kremer-Herman, Benjamin Tovar, and Douglas Thain (University of Notre Dame) Abstract Abstract When running a parallel application at scale, a resource provisioning policy should minimize over-commitment (idle resources) and under-commitment (resource contention). However, users seldom know the quantity of resources to appropriately execute their application. Even with such knowledge, over- and under-commitment of resources may still occur because the application does not run in isolation. It shares resources such as network and filesystems. Paper · Performance, Resiliency, Tools, Tech Program Reg Pass Resilience II Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo Scott Levy and Kurt B. Ferreira (Sandia National Laboratories), Nathan DeBardeleben (Los Alamos National Laboratory), Taniya Siddiqua and Vilas Sridharan (Advanced Micro Devices Inc), and Elisabeth Baseman (Los Alamos National Laboratory) Abstract Abstract Maintaining the performance of high-performance computing (HPC) applications as failures increase is a major challenge for next-generation extreme-scale systems. Recent research demonstrates that hardware failures are expected to become more common due to increased component counts, reduced device-feature sizes, and tightly-constrained power budgets. Few existing studies, however, have examined failures in the context of the entire lifetime of a single platform. In this paper, we analyze failure data collected over the entire lifetime of Cielo, a leadership-class HPC system. Our analysis reveals several key findings, including: (i) Cielo’s memory (DRAM and SRAM) exhibited no discernible aging effects; (ii) correctable memory faults are not predictive of future uncorrectable memory faults; (iii) developing more comprehensive logging facilities will improve failure analysis on future machines; (iv) continued advances will be required to ensure current failure mitigation techniques remain a viable option for future platforms. Partial Redundancy in HPC Systems with Non-Uniform Node Reliabilities Zaeem Hussain, Taieb Znati, and Rami Melhem (University of Pittsburgh) Abstract Abstract We study the usefulness of partial redundancy in HPC message passing systems where individual node failure distributions are not identical. Prior research works on fault tolerance have generally assumed identical failure distributions for the nodes of the system. In such settings, partial replication has never been shown to outperform the two extremes (full and no-replication) for any significant range of node counts. We argue that partial redundancy may provide the best performance under the more realistic assumption of non-identical node failure distributions. We provide theoretical results on arranging nodes with different reliability values among replicas such that system reliability is maximized. Moreover, using system reliability to compute MTTI (mean-time-to-interrupt) and expected completion time of a partially replicated system, we numerically determine the optimal partial replication degree. Our results indicate that partial replication can be a more efficient alternative to full replication at system scales where Checkpoint/Restart alone is not sufficient. Evaluating and Accelerating High-Fidelity Error Injection for HPC Chun-Kai Chang, Sangkug Lym, and Nicholas Kelly (University of Texas); Michael B. Sullivan (Nvidia Corporation); and Mattan Erez (University of Texas) Abstract Abstract We address two important concerns in the analysis of the behavior of applications in the presence of hardware errors: (1) when is it important to model how hardware faults lead to erroneous values (instruction-level errors) with high fidelity, as opposed to using simple bit-flipping models, and (2) how to enable fast high-fidelity error injection campaigns, in particular when error detectors are employed. We present and verify a new nested Monte Carlo methodology for evaluating high-fidelity gate-level fault models and error-detector coverage, which is orders of magnitude faster than current approaches. We use that methodology to demonstrate that, without detectors, simple error models suffice for evaluating errors in 9 HPC benchmarks. Paper · Algorithms, Applications, Architectures, Compiler Analysis and Optimization, Floating Point, Performance, Precision, Programming Systems, Tools, Tech Program Reg Pass Arithmetic and Optimization Associative Instruction Reordering to Alleviate Register Pressure Prashant Singh Rawat, Aravind Sukumaran-Rajam, and Atanas Rountev (Ohio State University); Fabrice Rastello (French Institute for Research in Computer Science and Automation (INRIA)); Louis-Noel Pouchet (Colorado State University); and P. Sadayappan (Ohio State University) Abstract Abstract Register allocation is generally considered a practically solved problem. For most applications, the register allocation strategies in production compilers are very effective in controlling the number of loads/stores and register spills. However, existing register allocation strategies are not effective and result in excessive register spilling for computation patterns with a high degree of many-to-many data reuse, e.g., high-order stencils and tensor contractions. We develop a source-to-source instruction reordering strategy that exploits the flexibility of reordering associative operations to alleviate register pressure. The developed transformation module implements an adaptable strategy that can appropriately control the degree of instruction-level parallelism, while relieving register pressure. The effectiveness of the approach is demonstrated through experimental results using multiple production compilers (GCC, Clang/LLVM) and target platforms (Intel Xeon Phi, and Intel x86 multi-core). Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers Azzam Haidar (University of Tennessee, Innovative Computing Laboratory); Stan Tomov and Jack Dongarra (University of Tennessee); and Nicholas Higham (University of Manchester, School of Mathematics) Abstract Abstract The use of low-precision arithmetic in computing methods has been a powerful tool to accelerate numerous scientific computing applications including Artificial Intelligence. We present an investigation showing that other HPC applications can harness this power too, and in particular, the general HPC problem of solving Ax = b, where A is a large dense matrix, and the solution is needed in FP64 accuracy. Our approach is based on the mixed-precision (FP16->FP64) iterative refinement technique – we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly-tuned implementations where we show how the use of FP16-TC (tensor cores) arithmetic can provide up to 4X speedup and improve the energy consumption by a factor of 5 achieving 74 Gflop/Watt. This is due to the performance boost that the FP16 (Tensor Cores) provide and to its better accuracy that outperforms the classical FP16. ADAPT: Algorithmic Differentiation Applied to Floating-Point Precision Tuning Harshitha Menon (Lawrence Livermore National Laboratory); Michael O. Lam (James Madison University, Lawrence Livermore National Laboratory); and Daniel Osei-Kuffuor, Markus Schordan, Scott Lloyd, Kathryn Mohror, and Jeffrey Hittinger (Lawrence Livermore National Laboratory) Abstract Abstract HPC applications extensively use floating point arithmetic operations to solve computational problems in various domains. Mixed precision computing, use of lowest precision data type sufficient to achieve a desired accuracy, have been explored to improve performance, reduce power consumption and data movement. Manually optimizing the program to use mixed precision is challenging. In this work, we present ADAPT, an approach for mixed precision analysis on HPC workloads while providing guarantees about the final output error. Our approach uses algorithmic differentiation to accurately estimate the output error for mixed precision configuration. ADAPT provides floating-point precision sensitivity of programs, which highlights regions of the code that that can potentially be converted to lower precision, is used to make algorithmic choices and develop mixed precision configurations. We evaluate ADAPT on six benchmarks and a proxy application and show that we are able to achieve a speedup of 1.2x on the proxy application, LULESH. Paper · Architectures, Networks, Performance, Scientific Computing, State of the Practice, Tools, Tech Program Reg Pass Large Scale System Deployments The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems Sudharshan S. Vazhkudai (Oak Ridge National Laboratory); Bronis R. de Supinski (Lawrence Livermore National Laboratory); Arthur S. Bland and Al Geist (Oak Ridge National Laboratory); James Sexton and Jim Kahle (IBM); Christopher J. Zimmer, Scott Atchley, Sarp H. Oral, Don E. Maxwell, and Veronica G. Vergara Larrea (Oak Ridge National Laboratory); Adam Bertsch and Robin Goldstone (Lawrence Livermore National Laboratory); Wayne Joubert (Oak Ridge National Laboratory); Chris Chambreau (Lawrence Livermore National Laboratory); David Appelhans and Robert Blackmore (IBM); Ben Casses (Lawrence Livermore National Laboratory); George Chochia and Gene Davison (IBM); Matthew A. Ezell (Oak Ridge National Laboratory); Tom Gooding (IBM); Elsa Gonsiorowski (Lawrence Livermore National Laboratory); Leopold Grinberg, Bill Hanson, and Bill Hartner (IBM); Ian Karlin and Matthew L. Leininger (Lawrence Livermore National Laboratory); Dustin Leverman (Oak Ridge National Laboratory); Chris Marroquin (IBM); Adam Moody (Lawrence Livermore National Laboratory); Martin Ohmacht (IBM); Ramesh Pankajakshan (Lawrence Livermore National Laboratory); Fernando Pizzano (IBM); James H. Rogers (Oak Ridge National Laboratory); Bryan Rosenburg (IBM); Drew Schmidt, Mallikarjun Shankar, and Feiyi Wang (Oak Ridge National Laboratory); Py Watson (Lawrence Livermore National Laboratory); Bob Walkup (IBM); Lance D. Weems (Lawrence Livermore National Laboratory); and Junqi Yin (Oak Ridge National Laboratory) Abstract Abstract CORAL, the Collaboration of Oak Ridge, Argonne and Livermore, is fielding two similar IBM systems, Summit and Sierra, with NVIDIA GPUs that will replace the existing Titan and Sequoia systems. Summit and Sierra are currently ranked No. 1 and No. 3, respectively, on the Top500 list. We discuss the design and key differences of the systems. Our evaluation of the systems highlights the following. Applications that fit in HBM see the most benefit and may prefer more GPUs; however, for some applications, the CPU-GPU bandwidth is more important than the number of GPUs. The node-local burst buffer scales linearly, and can achieve a 4X improvement over the parallel file system for large jobs; smaller jobs, however, may benefit from writing directly to the PFS. Finally, several CPU, network and memory bound analytics and GPU-bound deep learning codes achieve up to a 11X and 79X speedup/node, respectively over Titan. Best Practices and Lessons from Deploying and Operating a Sustained-Petascale System: The Blue Waters Experience Gregory H. Bauer, Brett Bode, Jeremy Enos, William T. Kramer, Scott Lathrop, Celso L. Mendes, and Roberto R. Sisneros (University of Illinois, National Center for Supercomputing Applications) Abstract Abstract Building and operating versatile extreme-scale computing systems that work productively for a range of frontier research domains present many challenges and opportunities. Solutions created, experiences acquired, and lessons learned, while rarely published, could drive the development of new methods and practices and raise the bar for all organizations supporting research, scholarship, and education. This paper describes the methods and procedures developed for deploying, supporting, and continuously improving the Blue Waters system and its services during the last five years. Being the first US sustained-petascale computing platform available to the open-science community, the Blue Waters project pioneered various unique practices that we are sharing to be adopted and further improved by the community. We present our support and service methodologies, and the leadership practices employed for ensuring that the system stays highly efficient and productive. We also provide the return on investment summaries related to deploying and operating the system. Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu (Tohoku University); Shintaro Momose, Yoko Isobe, Osamu Watanabe, and Akihiro Musa (Tohoku University, NEC Corporation); Mitsuo Yokokawa (Kobe University, NEC Corporation); Toshikazu Aoyama (NEC Corporation); and Masayuki Sato and Hiroaki Kobayashi (Tohoku University) Abstract Abstract A new SX-Aurora TSUBASA vector supercomputer has been released with a new system architecture and a new execution model to achieve high sustained performance, especially for memory-intensive applications. In SX-Aurora TSUBASA, the vector host (VH) of a standard x86 Linux node is attached to the vector engine (VE) of a newly developed vector processor. An application is executed on the VE, and only system calls are offloaded to the VH. This new execution model can avoid redundant data transfers between a VH and a VE that can easily become a bottleneck in the conventional execution model. This paper examines the potential of SX-Aurora TSUBASA. First, the basic performance of SX-Aurora TSUBASA is clarified by evaluating benchmark programs. Then, the effectiveness of the new execution model is examined by using a microbenchmark. Finally, the high potential of SX-Aurora TSUBASA is clarified through evaluations of practical applications. Paper · Applications, Graph Algorithms, Security, Tech Program Reg Pass Graph Algorithms and Systems iSpan: Parallel Identification of Strongly Connected Components with Spanning Trees Yuede Ji (George Washington University); Hang Liu (University of Massachusetts, Lowell); and H. Howie Huang (George Washington University) Abstract Abstract Detecting strongly connected components (SCCs) in a directed graph is crucial for understanding the structure of graphs. Most real-world graphs have one large SCC that contains the majority of the vertices, and many small SCCs whose sizes are reversely proportional to the frequency of their occurrence. For both types of SCCs, current approaches that rely on depth or breadth first search (DFS or BFS) face the challenges of strict synchronization requirement and high computation cost. In this paper, we advocate a new paradigm of identifying SCCs with simple spanning trees, since SCC detection requires only the knowledge of connectivity among the vertices. We have developed a prototype called iSpan which consists of parallel, relaxed synchronization construction of spanning trees for detecting the large and small SCCs. The evaluations show that iSpan is able to significantly outperform current state-of-the-art DFS and BFS- based methods by average 18× and 4×, respectively. Adaptive Anonymization of Data with b-Edge Covers Arif Khan (Pacific Northwest National Laboratory), Krzysztof Choromanski (Google LLC), Alex Pothen and S M Ferdous (Purdue University), and Mahantesh Halappanavar and Antonino Tumeo (Pacific Northwest National Laboratory) Abstract Abstract We explore the problem of sharing data that pertains to individuals with anonymity guarantees, where each user requires a desired level of privacy. We propose the first shared-memory as well as distributed memory parallel algorithms for the adaptive anonymity problem that achieves this goal, and produces high quality anonymized datasets. faimGraph: High Performance Management of Fully-Dynamic Graphs Under Tight Memory Constraints on the GPU Martin Winter and Daniel Mlakar (Graz University of Technology); Rhaleb Zayer and Hans-Peter Seidel (Max Planck Institute for Informatics); and Markus Steinberger (Graz University of Technology, Max Planck Institute for Informatics) Abstract Abstract In this paper, we present a fully-dynamic graph data structure for the Graphics Processing Unit (GPU). It delivers high update rates while keeping a low memory footprint using autonomous memory management directly on the GPU. The data structure is fully-dynamic, allowing not only for edge but also vertex updates. Performing the memory management on the GPU allows for fast initialization times and efficient update procedures without additional intervention or reallocation procedures from the host. faimGraph is the first GPU graph framework that fully reclaims unused memory, permitting long time application with highly changing graph structures. Performance evaluations show that our approach outperforms that previous state-of-the-art in for all types of graph updates. Furthermore, evaluate algorithmic performance using a PageRank and a Static Triangle Counting (STC) implementation, demonstrating the suitability of the framework even for memory access intensive algorithms. Paper · Linear Algebra, Memory, MPI, OpenMP, Programming Systems, Tools, Tech Program Reg Pass Programming Systems Tools Dynamic Data Race Detection for OpenMP Programs Yizi Gu and John Mellor-Crummey (Rice University) Abstract Abstract Two concurrent accesses to a shared variable that are unordered by synchronization are said to be a data race if at least one access is a write. Data races cause shared memory parallel programs to behave unpredictably. This paper describes ROMP -- a tool for detecting data races in executions of scalable parallel applications that employ OpenMP for node-level parallelism. The complexity of OpenMP, which includes primitives for managing data environments, SPMD and SIMD parallelism, work sharing, tasking, mutual exclusion, and ordering, presents a formidable challenge for data race detection. ROMP is a hybrid data race detector that tracks accesses, access orderings, and mutual exclusion. Unlike other OpenMP race detectors, ROMP detects races with respect to logical parallelism rather than implementation threads. Experiments show that ROMP yields precise race reports for a broader set of OpenMP constructs than prior state-of-the-art race detectors. ParSy: Inspection and Transformation of Sparse Matrix Computations for Parallelism Kazem Cheshmi (University of Toronto), Shoaib Kamil (Adobe Research), Michelle Mills Strout (University of Arizona), and Maryam Mehri Dehnavi (University of Toronto) Abstract Abstract In this work, we describe ParSy, a framework that uses a novel inspection strategy along with a simple code transformation to optimize parallel sparse algorithms for shared memory processors. Unlike existing approaches that can suffer from load imbalance and excessive synchronization, ParSy uses a novel task coarsening strategy to create well-balanced tasks that can execute in parallel, while maintaining locality of memory accesses. Code using the ParSy inspector and transformation outperforms existing highly-optimized sparse matrix algorithms such as Cholesky factorization on multi-core processors with speedups of 2.8× and 3.1× over the MKL Pardiso and PaStiX libraries respectively. Detecting MPI Usage Anomalies via Partial Program Symbolic Execution Fangke Ye, Jisheng Zhao, and Vivek Sarkar (Georgia Institute of Technology) Abstract Abstract MPI is a message passing based programming model for distributed-memory parallelism that has been had been widely used for programming supercomputers for over 25 years. However, debugging and verification of MPI programs is widely recognized to be a deep technical challenge. This challenge is further exacerbated by a recent increase in the use of nonblocking MPI operations that bring new classes of bugs related to data races. Paper · Algorithms, Architectures, GPUs, Linear Algebra, Networks, Resiliency, Tech Program Reg Pass Resilience III: GPUs Optimizing Software-Directed Instruction Replication for GPU Error Detection Abdulrahman Mahmoud (University of Illinois) and Siva Kumar Sastry Hari, Michael B. Sullivan, Timothy Tsai, and Stephen W. Keckler (Nvidia Corporation) Abstract Abstract Application execution on safety-critical and high-performance computer systems must be resilient to transient errors. As GPUs become more pervasive in such systems, they must supplement ECC/parity for major storage structures with reliability techniques that cover more of the GPU hardware logic. Instruction duplication has been explored for CPU resilience; however, it has never been studied in the context of GPUs, and it is unclear whether the performance and design choices it presents makes it a feasible GPU solution. This paper describes a practical methodology to employ instruction duplication for GPUs and identifies implementation challenges that can incur high overheads (69% on average). It explores GPU-specific software optimizations that trade fine-grained recoverability for performance. It also proposes simple ISA extensions with limited hardware changes and area costs to further improve performance, cutting the runtime overheads by more than half to an average of 30%. Fault Tolerant One-Sided Matrix Decompositions on Heterogeneous Systems with GPUs Jieyang Chen, Hongbo Li, Sihuan Li, and Xin Liang (University of California, Riverside); Panruo Wu (University of Houston); Dingwen Tao (University of Alabama); Kaiming Ouyang, Yuanlai Liu, and Kai Zhao (University of California, Riverside); Qiang Guan (Kent State University); and Zizhong Chen (University of California, Riverside) Abstract Abstract Current algorithm-based fault tolerance (ABFT) approach for one-sided matrix decomposition on heterogeneous systems with GPUs have following limitations: (1) they do not provide sufficient protection as most of them only maintain checksum in one dimension; (2) their checking scheme is not efficient due to redundant checksum verifications; (3) they fail to protect PCIe communication; (4) the checksum calculation based on a special type of matrix multiplication is far from efficient. By overcoming the above limitations, we design an efficient ABFT approach providing stronger protection for one-sided matrix decomposition methods on heterogeneous systems. First, we provide full matrix protection by using checksums in two dimensions. Second, our checking scheme is more efficient by prioritizing the checksum verification according to the sensitivity of matrix operations to soft errors. Third, we protect PCIe communication by reordering checksum verifications and decomposition steps. Fourth, we accelerate the checksum calculation by 1.7x via better utilizing GPUs. PRISM: Predicting Resilience of GPU Applications Using Statistical Methods Charu Kalra, Fritz Previlon, and Xiangyu Li (Northeastern University); Norman Rubin (Nvidia Corporation); and David Kaeli (Northeastern University) Abstract Abstract As Graphics Processing Units (GPUs) become more pervasive in HPC and safety-critical domains, ensuring that GPU applications can be protected from data corruption grows in importance. Despite prior efforts to mitigate errors, we still lack a clear understanding of how resilient these applications are in the presence of transient faults. Due to the random nature of these faults, predicting whether they will alter the program output is a challenging problem. In this paper, we build a framework named PRISM, which uses a systematic approach to predict failures in GPU programs. PRISM extracts micro-architecture agnostic features to characterize program resiliency, which serve as predictors in our statistical model. PRISM enables us to predict failures in applications without running exhaustive fault-injection campaigns on a GPU, thereby reducing the error estimation effort. PRISM can also be used to gain insight into potential architectural support required to improve the reliability of GPU applications. Paper · Applications, Cosmology, Data Analytics, Deep Learning, Machine Learning, Programming Systems, Storage, Visualization, Tech Program Reg Pass Deep Learning Exploring Flexible Communications for Streamlining DNN Ensemble Training Pipelines Randall Pittman, Hui Guan, and Xipeng Shen (North Carolina State University) and Seung-Hwan Lim and Robert M. Patton (Oak Ridge National Laboratory) Abstract Abstract Parallel training of a Deep Neural Network (DNN) ensemble on a cluster of nodes is a common practice to train multiple models in order to construct a model with a higher prediction accuracy. Existing ensemble training pipelines can perform a great deal of redundant operations, resulting in unnecessary CPU usage, or even poor pipeline performance. In order to remove these redundancies, we need pipelines with more communication flexibility than existing DNN frameworks provide. CosmoFlow: Using Deep Learning to Learn the Universe at Scale Amrita Mathuriya (Intel Corporation); Deborah Bard (National Energy Research Scientific Computing Center (NERSC), Lawrence Berkeley National Laboratory); Pete Mendygral (Cray Inc); Lawrence Meadows (Intel Corporation); James Arnemann (University of California, Berkeley); Lei Shao (Intel Corporation); Siyu He (Carnegie Mellon University); Tuomas Karna (Intel Corporation); Diana Moise (Cray Inc); Simon J. Pennycook (Intel Corporation); Kristyn Maschhoff (Cray Inc); Jason Sewall and Nalini Kumar (Intel Corporation); Shirley Ho (Lawrence Berkeley National Laboratory, Carnegie Mellon University); Michael F. Ringenburg (Cray Inc); Mr Prabhat (Lawrence Berkeley National Laboratory, National Energy Research Scientific Computing Center (NERSC)); and Victor Lee (Intel Corporation) Abstract Abstract Deep learning is a promising tool to determine the physical model that describes our universe. To handle the considerable computational cost of this problem, we present CosmoFlow: a highly scalable deep learning application built on top of the TensorFlow framework. Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke (Intel Corporation) Abstract Abstract Convolution layers are prevalent in many classes of deep neural networks, including Convolutional Neural Networks (CNNs) which provide state-of-the-art results for tasks like image recognition, neural machine translation, and speech recognition. The computationally expensive nature of a convolution operation has led to the proliferation of implementations including matrix-matrix multiplication formulation, and direct convolution primarily targeting GPUs. In this paper, we introduce direct convolution kernels for x86 architectures, in particular for Xeon and Xeon Phi systems, which are implemented via a dynamic compilation approach. Our JIT-based implementation shows close to theoretical peak performance, depending on the setting and the CPU architecture at hand. We additionally demonstrate how these JIT-optimized kernels can be integrated into a light-weight multi-node graph execution model. This illustrates that single- and multi-node runs yield high efficiencies and high image-throughputs when executing state of the art image recognition tasks on CPUs. Paper · Algorithms, Applications, Computational Physics, Scientific Computing, Tech Program Reg Pass Astrophysics Applications Phase Asynchronous AMR Execution for Productive and Performant Astrophysical Flows Muhammad Nufail Farooqi (Koc University); Tan Nguyen, Weiqun Zhang, Ann S. Almgren, and John Shalf (Lawrence Berkeley National Laboratory); and Didem Unat (Koc University) Abstract Abstract Adaptive Mesh Refinement (AMR) is an approach to solving PDEs that reduces the computational and memory requirements at the expense of increased communication. Although adopting asynchronous execution can overcome communication issues, manually restructuring an AMR application to realize asynchrony is extremely complicated and hinders readability and long-term maintainability. To balance performance against productivity, we design a user-friendly API and adopt phase asynchronous execution model where all subgrids at an AMR level can be computed asynchronously. Computing Planetary Interior Normal Modes with a Highly Parallel Polynomial Filtering Eigensolver Jia Shi (Rice University), Ruipeng Li (Lawrence Livermore National Laboratory), Yuanzhe Xi and Yousef Saad (University of Minnesota), and Maarten V. de Hoop (Rice University) Abstract Abstract A highly parallel algorithm has been developed and exploited to compute the planetary normal modes of the elastic-gravitational system, which is approximated via the mixed finite element method on unstructured tetrahedral meshes. The eigenmodes of the relevant generalized eigenvalue problem were extracted by a Lanczos approach combined with polynomial filtering. In contrast with the standard shift-and-invert and the full-mode coupling algorithms, the polynomial filtering technique is ideally suited for solving large-scale 3-D interior eigenvalue problems since it significantly enhances the memory and computational efficiency without loss of accuracy. The parallel efficiency and scalability of this approach are demonstrated on Stampede2 at the Texas Advanced Computing Center. To our knowledge, this is the first time that the direct calculation of the normal modes of 3-D strongly heterogeneous planets, in particular, Earth and Mars, is made feasible via a combination of multiple matrix-free methods and a separation of the essential spectra. Paper · Architectures, Data Management, File Systems, Networks, State of the Practice, System Software, Workflows, Tech Program Reg Pass File Systems: Data Movement and Provenance Dac-Man: Data Change Management for Scientific Datasets on HPC Systems Devarshi Ghoshal, Lavanya Ramakrishnan, and Deborah Agarwal (Lawrence Berkeley National Laboratory) Abstract Abstract Scientific data is growing rapidly and often changes due to instrument configurations, software updates, or quality assessments. These changes in datasets can result in significant waste of compute and storage resources on HPC systems as downstream pipelines are reprocessed. Data changes need to be detected, tracked, and analyzed for understanding the impact of data change, managing data provenance, and making efficient and effective decisions about reprocessing and use of HPC resources. Existing methods for identifying and capturing change are often manual, domain-specific, and error-prone and do not scale to large scientific datasets. In this paper, we describe the design and implementation of Dac-Man framework, which identifies, captures, and manages change in large scientific datasets, and enables plug-in of domain-specific change analysis with minimal user effort. Our evaluations show that it can retrieve file changes from directories containing millions of files and terabytes of data in less than a minute. Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In Situ Workflows Pradeep Subedi, Philip Davis, and Shaohua Duan (Rutgers University); Scott Klasky (Oak Ridge National Laboratory); Hemanth Kolla (Sandia National Laboratories); and Manish Parashar (Rutgers University) Abstract Abstract Data staging and in situ workflows are being explored extensively as an approach to address data-related costs at very large scales. However, the impact of emerging storage architectures (e.g., deep memory hierarchies and burst buffers) upon data staging solutions remains a challenge. In this paper, we investigate how burst buffers can be effectively used by data staging solutions, for example, as a persistence storage tier of the memory hierarchy. Furthermore, we use machine learning based prefetching techniques to move data between the storage levels in an autonomous manner. We also present Stacker, a prototype of the proposed solutions implemented within the Data\-Spaces data staging service, and experimentally evaluate its performance and scalability using the S3D combustion workflow on current leadership class platforms. Our experiments demonstrate that Stacker achieves low latency, high volume data-staging with low overhead as compared to in-memory staging services for production scientific workflows. A Year in the Life of a Parallel File System Glenn K. Lockwood (Lawrence Berkeley National Laboratory), Shane Snyder (Argonne National Laboratory), Teng Wang and Suren Byna (Lawrence Berkeley National Laboratory), Philip Carns (Argonne National Laboratory), and Nicholas J. Wright (Lawrence Berkeley National Laboratory) Abstract Abstract I/O performance is a critical aspect of data-intensive scientific computing. We seek to advance the state of the practice in understanding and diagnosing I/O performance issues through investigation of a comprehensive I/O performance data set that captures a full year of production storage activity at two leadership-scale computing facilities. We demonstrate techniques to identify regions of interest, perform focused investigations of both long-term trends and transient anomalies, and uncover the contributing factors that lead to performance fluctuation. |
ACM Gordon Bell Finalist Gordon Bell Prize Finalist #1 A Fast Scalable Implicit Solver for Nonlinear Time-Evolution Earthquake City Problem on Low-Ordered Unstructured Finite Elements with Artificial Intelligence and Transprecision Computing pdfACM Gordon Bell Finalist Gordon Bell Prize Finalist #2 Paper · Architectures, Data Analytics, Networks, Tech Program Reg Pass Next-Generation Networking Fine-Grained, Multi-Domain Network Resource Abstraction as a Fundamental Primitive to Enable High-Performance, Collaborative Data Sciences pdfPaper · Clouds and Distributed Computing, File Systems, I/O, Storage, Tech Program Reg Pass Data and Storage Paper · GPUs, Resiliency, State of the Practice, System Software, Tech Program Reg Pass Resilience Paper · Algorithms, Applications, Computational Biology, Scientific Computing, Tech Program Reg Pass Biology Applications Paper · Algorithms, Architectures, Data Analytics, Deep Learning, Networks, Scientific Computing, Visualization, Tech Program Reg Pass Large-Scale Algorithms Paper · OpenMP, Performance, Power, Tools, Tech Program Reg Pass Performance and Energy Analysis Paper · Algorithms, Graph Algorithms, Linear Algebra, Machine Learning, Sparse Computation, Tech Program Reg Pass Algorithms on Sparse Data Distributed Memory Sparse Inverse Covariance Matrix Estimation on High-Performance Computing Architectures pdfPaper · Data Analytics, Performance, Programming Systems, Storage, Tools, Visualization, Tech Program Reg Pass Performance Optimization Studies Paper · Networks, Resource Management, Scheduling, State of the Practice, System Software, Tech Program Reg Pass Resource Management and Interference Paper · Algorithms, Architectures, Memory, Networks, Parallel Programming Languages, Libraries, and Models, Power, Programming Systems, Scheduling, Tech Program Reg Pass Task-Based Programming Paper · Architectures, MPI, Networks, Performance, Programming Systems, State of the Practice, Tech Program Reg Pass MPI Optimization and Characterization Cooperative Rendezvous Protocols for Improved Performance and Overlap Best Student Paper Finalists pdfPaper · GPUs, Memory, NVRAM, Performance, System Software, Tools, Tech Program Reg Pass Non-Volatile Memory Runtime Data Management on Non-Volatile Memory-Based Heterogeneous Memory for Task-Parallel Programs pdfPaper · Algorithms, Applications, Computational Physics, Scientific Computing, Tech Program Reg Pass Physics and Tensor Applications Paper · Clouds and Distributed Computing, Resource Management, Scheduling, Tech Program Reg Pass Clouds and Distributed Computing Paper · Performance, Resiliency, Tools, Tech Program Reg Pass Resilience II Paper · Algorithms, Applications, Architectures, Compiler Analysis and Optimization, Floating Point, Performance, Precision, Programming Systems, Tools, Tech Program Reg Pass Arithmetic and Optimization Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers pdfPaper · Architectures, Networks, Performance, Scientific Computing, State of the Practice, Tools, Tech Program Reg Pass Large Scale System Deployments Best Practices and Lessons from Deploying and Operating a Sustained-Petascale System: The Blue Waters Experience pdfPaper · Applications, Graph Algorithms, Security, Tech Program Reg Pass Graph Algorithms and Systems Paper · Linear Algebra, Memory, MPI, OpenMP, Programming Systems, Tools, Tech Program Reg Pass Programming Systems Tools Paper · Algorithms, Architectures, GPUs, Linear Algebra, Networks, Resiliency, Tech Program Reg Pass Resilience III: GPUs Paper · Applications, Cosmology, Data Analytics, Deep Learning, Machine Learning, Programming Systems, Storage, Visualization, Tech Program Reg Pass Deep Learning Paper · Algorithms, Applications, Computational Physics, Scientific Computing, Tech Program Reg Pass Astrophysics Applications Paper · Architectures, Data Management, File Systems, Networks, State of the Practice, System Software, Workflows, Tech Program Reg Pass File Systems: Data Movement and Provenance Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In Situ Workflows pdf |