SC18 Proceedings

Tuesday, November 13th

10:30am-12:00pm

Data and Storage

C146

SP-Cache: Load-Balanced, Redundancy-Free Cluster Caching with Selective Partition

BESPOKV: Application Tailored Scale-Out Key-Value Stores

Scaling Embedded In Situ Indexing with DeltaFS

Paper

Clouds and Distributed Computing, File Systems, I/O, Storage, Tech Program Reg Pass

Next-Generation Networking

C140/142

Exploiting Idle Resources in a High-Radix Switch for Supplemental Storage

Fine-Grained, Multi-Domain Network Resource Abstraction as a Fundamental Primitive to Enable High-Performance, Collaborative Data Sciences

Light-Weight Protocols for Wire-Speed Ordering

Paper

Architectures, Data Analytics, Networks, Tech Program Reg Pass

Resilience

C141/143/149

GPU Age-Aware Scheduling to Improve the Reliability of Leadership Jobs on Titan

FlipTracker: Understanding Natural Error Resilience in HPC Applications

Doomsday: Predicting Which Node Will Fail When on Supercomputers

Paper

GPUs, Resiliency, State of the Practice, System Software, Tech Program Reg Pass

1:30pm-3:00pm

Biology Applications

C140/142

Extreme Scale De Novo Metagenome Assembly

Optimizing High Performance Distributed Memory Parallel Hash Tables for DNA k-mer Counting

Redesigning LAMMPS for Petascale and Hundred-Billion-Atom Simulation on Sunway TaihuLight

Paper

Algorithms, Applications, Computational Biology, Scientific Computing, Tech Program Reg Pass

Large-Scale Algorithms

C146

Large-Scale Hierarchical K-Means for Heterogeneous Many-Core Supercomputers

TriCore: Parallel Triangle Counting on GPUs

Distributed-Memory Hierarchical Compression of Dense SPD Matrices

Paper

Algorithms, Architectures, Data Analytics, Deep Learning, Networks, Scientific Computing, Visualization, Tech Program Reg Pass

Performance and Energy Analysis

C141/143/149

A Parallelism Profiler with What-If Analyses for OpenMP Programs

Energy Efficiency Modeling of Parallel Applications

HPL and DGEMM Performance Variability on the Xeon Platinum 8160 Processor

Paper

OpenMP, Performance, Power, Tools, Tech Program Reg Pass

3:30pm-5:00pm

Algorithms on Sparse Data

C141/143/149

HiCOO: Hierarchical Storage of Sparse Tensors

Distributed Memory Sparse Inverse Covariance Matrix Estimation on High-Performance Computing Architectures

PruneJuice: Pruning Trillion-Edge Graphs to a Precise Pattern-Matching Solution

Paper

Algorithms, Graph Algorithms, Linear Algebra, Machine Learning, Sparse Computation, Tech Program Reg Pass

Performance Optimization Studies

C146

Many-Core Graph Workload Analysis

Lessons Learned from Analyzing Dynamic Promotion for User-Level Threading

Topology-Aware Space-Shared Co-Analysis of Large-Scale Molecular Dynamics Simulations

Paper

Data Analytics, Performance, Programming Systems, Storage, Tools, Visualization, Tech Program Reg Pass

Resource Management and Interference

C140/142

RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management

Evaluation of an Interference-Free Node Allocation Policy on Fat-Tree Clusters

Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing

Paper

Networks, Resource Management, Scheduling, State of the Practice, System Software, Tech Program Reg Pass

Wednesday, November 14th

10:30am-12:00pm

MPI Optimization and Characterization

C140/142

Cooperative Rendezvous Protocols for Improved Performance and Overlap

Framework for Scalable Intra-Node Collective Operations Using Shared Memory

Characterization of MPI Usage on a Production Supercomputer

Paper

Architectures, MPI, Networks, Performance, Programming Systems, State of the Practice, Tech Program Reg Pass

Non-Volatile Memory

C141/143/149

Runtime Data Management on Non-Volatile Memory-Based Heterogeneous Memory for Task-Parallel Programs

DRAGON: Breaking GPU Memory Capacity Limits with Direct NVM Access

Siena: Exploring the Design Space of Heterogeneous Memory Systems

Paper

GPUs, Memory, NVRAM, Performance, System Software, Tools, Tech Program Reg Pass

Task-Based Programming

C146

Dynamic Tracing: Memoization of Task Graphs for Dynamic Task-Based Runtimes

Runtime-Assisted Cache Coherence Deactivation in Task Parallel Programs

A Divide and Conquer Algorithm for DAG Scheduling Under Power Constraints

Paper

Algorithms, Architectures, Memory, Networks, Parallel Programming Languages, Libraries, and Models, Power, Programming Systems, Scheduling, Tech Program Reg Pass

1:30pm-3:00pm

Clouds and Distributed Computing

C141/143/149

A Reference Architecture for Datacenter Scheduling: Design, Validation, and Experiments

Dynamically Negotiating Capacity Between On-Demand and Batch Clusters

A Lightweight Model for Right-Sizing Master-Worker Applications

Paper

Clouds and Distributed Computing, Resource Management, Scheduling, Tech Program Reg Pass

Physics and Tensor Applications

C140/142

Simulating the Wenchuan Earthquake with Accurate Surface Topography on Sunway TaihuLight

Accelerating Quantum Chemistry with Vectorized and Batched Integrals

High-Performance Dense Tucker Decomposition on GPU Clusters

Paper

Algorithms, Applications, Computational Physics, Scientific Computing, Tech Program Reg Pass

Resilience II

C146

Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo

Partial Redundancy in HPC Systems with Non-Uniform Node Reliabilities

Evaluating and Accelerating High-Fidelity Error Injection for HPC

Paper

Performance, Resiliency, Tools, Tech Program Reg Pass

3:30pm-5:00pm

Arithmetic and Optimization

C141/143/149

Associative Instruction Reordering to Alleviate Register Pressure

Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers

ADAPT: Algorithmic Differentiation Applied to Floating-Point Precision Tuning

Paper

Algorithms, Applications, Architectures, Compiler Analysis and Optimization, Floating Point, Performance, Precision, Programming Systems, Tools, Tech Program Reg Pass

Gordon Bell Prize Finalist #1

A2 Ballroom

A Fast Scalable Implicit Solver for Nonlinear Time-Evolution Earthquake City Problem on Low-Ordered Unstructured Finite Elements with Artificial Intelligence and Transprecision Computing

167-PFlops Deep Learning for Electron Microscopy: From Learning Physics to Atomic Manipulation

Exascale Deep Learning for Climate Analytics

ACM Gordon Bell Finalist

Large Scale System Deployments

C140/142

The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems

The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems

Sudharshan S. Vazhkudai (Oak Ridge National Laboratory); Bronis R. de Supinski (Lawrence Livermore National Laboratory); Arthur S. Bland and Al Geist (Oak Ridge National Laboratory); James Sexton and Jim Kahle (IBM); Christopher J. Zimmer, Scott Atchley, Sarp H. Oral, Don E. Maxwell, and Veronica G. Vergara Larrea (Oak Ridge National Laboratory); Adam Bertsch and Robin Goldstone (Lawrence Livermore National Laboratory); Wayne Joubert (Oak Ridge National Laboratory); Chris Chambreau (Lawrence Livermore National Laboratory); David Appelhans and Robert Blackmore (IBM); Ben Casses (Lawrence Livermore National Laboratory); George Chochia and Gene Davison (IBM); Matthew A. Ezell (Oak Ridge National Laboratory); Tom Gooding (IBM); Elsa Gonsiorowski (Lawrence Livermore National Laboratory); Leopold Grinberg, Bill Hanson, and Bill Hartner (IBM); Ian Karlin and Matthew L. Leininger (Lawrence Livermore National Laboratory); Dustin Leverman (Oak Ridge National Laboratory); Chris Marroquin (IBM); Adam Moody (Lawrence Livermore National Laboratory); Martin Ohmacht (IBM); Ramesh Pankajakshan (Lawrence Livermore National Laboratory); Fernando Pizzano (IBM); James H. Rogers (Oak Ridge National Laboratory); Bryan Rosenburg (IBM); Drew Schmidt, Mallikarjun Shankar, and Feiyi Wang (Oak Ridge National Laboratory); Py Watson (Lawrence Livermore National Laboratory); Bob Walkup (IBM); Lance D. Weems (Lawrence Livermore National Laboratory); and Junqi Yin (Oak Ridge National Laboratory)

Abstract

CORAL, the Collaboration of Oak Ridge, Argonne and Livermore, is fielding two similar IBM systems, Summit and Sierra, with NVIDIA GPUs that will replace the existing Titan and Sequoia systems. Summit and Sierra are currently ranked No. 1 and No. 3, respectively, on the Top500 list. We discuss the design and key differences of the systems. Our evaluation of the systems highlights the following. Applications that fit in HBM see the most benefit and may prefer more GPUs; however, for some applications, the CPU-GPU bandwidth is more important than the number of GPUs. The node-local burst buffer scales linearly, and can achieve a 4X improvement over the parallel file system for large jobs; smaller jobs, however, may benefit from writing directly to the PFS. Finally, several CPU, network and memory bound analytics and GPU-bound deep learning codes achieve up to a 11X and 79X speedup/node, respectively over Titan.

pdf

Best Practices and Lessons from Deploying and Operating a Sustained-Petascale System: The Blue Waters Experience

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Paper

Architectures, Networks, Performance, Scientific Computing, State of the Practice, Tools, Tech Program Reg Pass

Thursday, November 15th

10:30am-12:00pm

Gordon Bell Prize Finalist #2

A2 Ballroom

Simulating the Weak Death of the Neutron in a Femtoscale Universe with Near-Exascale Computing

Simulating the Weak Death of the Neutron in a Femtoscale Universe with Near-Exascale Computing

Evan Berkowitz (Forschungszentrum Juelich); M.A. Clark (Nvidia Corporation); Arjun Gambhir (Lawrence Livermore National Laboratory, Lawrence Berkeley National Laboratory); Ken McElvain (University of California, Berkeley; Lawrence Berkeley National Laboratory); Amy Nicholson (University of North Carolina); Enrico Rinaldi (RIKEN BNL Research Center, Lawrence Berkeley National Laboratory); Pavlos Vranas (Lawrence Livermore National Laboratory, Lawrence Berkeley National Laboratory); André Walker-Loud (Lawrence Berkeley National Laboratory, Lawrence Livermore National Laboratory); Chia Cheng Chang (Lawrence Berkeley National Laboratory, RIKEN); Bálint Joó (Thomas Jefferson National Accelerator Facility); Thorsten Kurth (Lawrence Berkeley National Laboratory); and Kostas Orginos (College of William & Mary, Thomas Jefferson National Accelerator Facility)

Abstract

The fundamental particle theory called Quantum Chromodynamics (QCD) dictates everything about protons and neutrons, from their intrinsic properties to interactions that bind them into atomic nuclei. Quantities that cannot be fully resolved through experiment, such as the neutron lifetime (whose precise value is important for the existence of light-atomic elements that make the sun shine and life possible), may be understood through numerical solutions to QCD. We directly solve QCD using Lattice Gauge Theory and calculate nuclear observables such as neutron lifetime. We have developed an improved algorithm that exponentially decreases the time-to-solution and applied it on the new CORAL supercomputers, Sierra and Summit. We use run-time autotuning to distribute GPU resources, achieving 20% performance at low node count. We also developed optimal application mapping through a job manager, which allows CPU and GPU jobs to be interleaved, yielding 15% of peak performance when deployed across large fractions of CORAL.

pdf

ShenTu: Processing Multi-Trillion Edge Graphs on Millions of Cores in Seconds

Attacking the Opioid Epidemic: Determining the Epistatic and Pleiotropic Genetic Architectures for Chronic Pain and Opioid Addiction

ACM Gordon Bell Finalist

Graph Algorithms and Systems

C140/142

iSpan: Parallel Identification of Strongly Connected Components with Spanning Trees

Adaptive Anonymization of Data with b-Edge Covers

faimGraph: High Performance Management of Fully-Dynamic Graphs Under Tight Memory Constraints on the GPU

Paper

Applications, Graph Algorithms, Security, Tech Program Reg Pass

Programming Systems Tools

C141/143/149

Dynamic Data Race Detection for OpenMP Programs

ParSy: Inspection and Transformation of Sparse Matrix Computations for Parallelism

Detecting MPI Usage Anomalies via Partial Program Symbolic Execution

Paper

Linear Algebra, Memory, MPI, OpenMP, Programming Systems, Tools, Tech Program Reg Pass

1:30pm-3:00pm

Deep Learning

C140/142

Exploring Flexible Communications for Streamlining DNN Ensemble Training Pipelines

CosmoFlow: Using Deep Learning to Learn the Universe at Scale

Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures

Paper

Applications, Cosmology, Data Analytics, Deep Learning, Machine Learning, Programming Systems, Storage, Visualization, Tech Program Reg Pass

Resilience III: GPUs

C141/143/149

Optimizing Software-Directed Instruction Replication for GPU Error Detection

Fault Tolerant One-Sided Matrix Decompositions on Heterogeneous Systems with GPUs

PRISM: Predicting Resilience of GPU Applications Using Statistical Methods

Paper

Algorithms, Architectures, GPUs, Linear Algebra, Networks, Resiliency, Tech Program Reg Pass

3:30pm-5:00pm

Astrophysics Applications

C140/142

Phase Asynchronous AMR Execution for Productive and Performant Astrophysical Flows

Computing Planetary Interior Normal Modes with a Highly Parallel Polynomial Filtering Eigensolver

Paper

Algorithms, Applications, Computational Physics, Scientific Computing, Tech Program Reg Pass

File Systems: Data Movement and Provenance

C141/143/149

Dac-Man: Data Change Management for Scientific Datasets on HPC Systems

Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In Situ Workflows

A Year in the Life of a Parallel File System

Paper

Architectures, Data Management, File Systems, Networks, State of the Practice, System Software, Workflows, Tech Program Reg Pass