oneAPI DevSummit at SC 2021

November 14, 2021 | 9 a.m.–6:30 p.m. CT

Join us for hands-on tutorials, tech talks, and panels spanning the oneAPI programming model, AI analytics, performance analysis tools and libraries with global Industry experts from Berkeley, Argonne, NASA, Codeplay, University of Lisbon, University of Edinburg and more. Get the latest information on Intel® oneAPI Toolkits since their initial production release in late 2020.

oneAPI Developer Summit at SC21

Agenda

DPC++
AI Analytics/ FPGA
Libraries

DPC++

9:00 - 9:15 AM CT

Introduction/Opening

9:15 - 10:00 AM CT

Global Experts on eXtreme Performance Panel

The Intel eXtreme Performance Users Group (IXPUG) will present a brief overview of the organization and its activities, followed by a panel discussion focused on the expected adoption, support and application of oneAPI at various computing sites around the world. Experts from the various sites will discuss ongoing work with oneAPI and plans for its support and application at their site, in addition to elucidating on the expected impact on their user community. The session will close with a brief overview of opportunities for involvement in upcoming IXPUG activities.

 

Download Presentation Deck – Thomas Steinke

Download Presentation Deck – David Martin

Presenting

10:30 - 10:45 AM CT Break

10:45 - 11:45 AM CT

Developing for Nvidia GPUs using SYCL with oneAPI

Support from the community for SYCL is growing, with some of the most powerful supercomputers in the world (including Aurora, Perlmutter and Frontier) adopting the programming model for cutting edge research. By migrating your code from CUDA to SYCL it’s not only possible to still target Nvidia GPUs, but it’s also possible to deploy to a wider set of GPUs from different companies including Intel and AMD. This hands-on workshop will introduce the basics of how to set up your development environment to use SYCL to target Nvidia GPUs using oneAPI and what you need to know to migrate your code from CUDA to SYCL. Find out how to use incremental porting by using interoperability for native CUDA kernel code and libraries and learn the fundamentals needed to get the full performance with SYCL. In addition, learn how you can call CUDA libraries such as cuDNN or cuBLAS directly, or via existing SYCL libraries such as oneDNN using oneAPI for CUDA.

 

Download Presentation Deck

Presenting

11:45 - 12:45 PM CT Lunch

12:45 - 1:15 PM CT

Experience in Moving CUDA Optimized FUN3D Kernels to Intel GPUs using Intel OneAPI

This presentation provides an overview of recent efforts to port existing CUDA kernels relevant to unstructured-grid computational fluid dynamics to the oneAPI framework for execution on Intel GPUs. Differences between the programming models are examined and ongoing challenges are discussed.

 

Download Presentation Deck

Presenting

1:15 - 1:45 PM CT

Acceleration of Integrated Circuit Simulation using SYCL and oneAPI

Simulation of integrated circuits consists of solving matrix-based equations. As the size of the modern circuits increases, the computation time and resources for a simulation have significantly increased. The recent progress in heterogenous hardware platforms has created an opportunity to increase the efficiency of these simulations. In this project, we demonstrate the acceleration of LU decomposition as the core algorithm in solving circuits using SYCL and oneAPI on CPU and GPU. We demonstrate results and discuss the development experience and future opportunities in using SYCL and oneAPI.

 

Download Presentation Deck

Presenting

1:45 - 2:15 PM CT Break

2:15 - 2:45 PM CT

Performance of DPC++ on Representative Structured/Unstructured Mesh

In this session we will give an overview of performance achieved with DPC++ on Intel server CPUs on MG-CFD, an unstructured-mesh CFD mini-app, and OpenSBLI, a structured mesh academic CFD code. We will contrast results to OpenMP implementations and explore key differences and bottlenecks based on VTune and Advisor feedback.

 

Download Presentation Deck

Presenting

2:45 - 3:15 PM CT

Enabling NAMD for Intel Xe

NAMD is a prominent parallel molecular dynamics application designed for high performance computing of large biomolecular systems. This session focuses on the development of NAMD for Intel GPUs using oneAPI/DPC++ by porting the efficient NAMD CUDA implementation and improving it with flexible vectorization for portable performance. We will also discuss the implementation in NAMD of relative debugging techniques across architectures and programming languages.

 

Download Presentation Deck

Presenting

3:15 - 3:30 PM CT Break

3:30 - 4:00 PM CT

Performance portability and evaluation of heterogeneous components of SeiSol targeted to upcoming Intel HPC GPUs

We will present our recent results of integrating oneAPI programming model into SeisSol, a software package for simulating seismic waves and earthquake dynamics. During the talk, we are going to demonstrate a set of comparisons of various SeisSol specific benchmarks compiled and executed with oneAPI, hipSYCL, and CUDA. At the end, we are going to present performance of the whole application obtained with 2 Nvidia RTX 3090 Turbo GPUs using oneAPI compiled with CUDA backend.

 

Download Presentation Deck

Presenting

4:00 - 4:30 PM CT

Enhancing Online Planning on low-power CPU-GPU SoCs via Bloom Filter Based Memory

This work proposes a new design for online planning for intelligent agents modelled as POMDPs. We introduce an online planner enhanced with Bloom filter memory which we implement and evaluate on a low-power CPU+GPU SoC. Using the DPC++ parallel execution model of the most computing-intensive kernel of our Bloom filter implementation, we reduce the overall planning time by 3.5x to 7.5x for three representative benchmarks in the POMDP literature. Our preliminary results promise new opportunities for using POMDP agents on low-power mobile platforms and in real-time use cases.

 

Download Presentation Deck

Presenting

4:30 - 5:30 PM CT

The oneAPI Software Abstraction for Heterogeneous Computing

oneAPI is a cross-industry, open, standards-based unified programming model. The oneAPI specification extends existing developer programming models to enable a diverse set of hardware through language, a set of library APIs, and a low-level hardware interface to support cross-architecture programming. It builds upon industry standards and provides an open, cross-platform developer stack to improve productivity and innovation. At the core of oneAPI is the DPC++ programming language, which builds on the ISO C++ and Khronos SYCL standards. DPC++ provides explicit parallel constructs and offload interfaces to support a broad range of accelerators. In addition to DPC++, oneAPI also provides libraries for compute- and data-intensive domains, e.g.: deep learning, scientific computing, video analytics, and media processing. Finally, a low-level hardware interface defines a set of capabilities and services to allow a language runtime system to effectively utilize a hardware accelerator.​

Presenting

5:30 - 6:30 PM CT

Happy Hour

AI Analytics/ FPGA

9:00 - 9:15 AM CT

Introduction/Opening

9:15 - 10:00 AM CT

Global Experts on eXtreme Performance Panel

The Intel eXtreme Performance Users Group (IXPUG) will present a brief overview of the organization and its activities, followed by a panel discussion focused on the expected adoption, support and application of oneAPI at various computing sites around the world. Experts from the various sites will discuss ongoing work with oneAPI and plans for its support and application at their site, in addition to elucidating on the expected impact on their user community. The session will close with a brief overview of opportunities for involvement in upcoming IXPUG activities.

 

Download Presentation Deck – Thomas Steinke

Download Presentation Deck – David Martin

Presenting

10:30 - 10:45 AM CT Break

11:45 - 12:45 PM CT Lunch

12:45 - 1:15 PM CT

Spatial DPC++ constructs for algorithm acceleration with FPGAs

Field programmable gate arrays (FPGAs) have gained increasing mindshare as an architecture through which workloads can be accelerated in a power-efficient way, particularly when existing accelerators aren’t tuned for or well matched with a workload of interest. They allow a custom architecture to be built for the algorithm of interest without resorting to costly ASIC design, and therefore bridge a gap in performance between a flexible ISA architecture that isn’t quite the right fit for performance, and a custom ASIC designed for the workload.
As a reconfigurable spatial architecture, FPGAs allow implementation of algorithms in a fundamentally different way from instruction set architecture (ISA) accelerators. They can be thought of as implementing an algorithm at the same level of abstraction as an ISA machine would itself be designed and constructed. To enable productivity compared with conventional chip design languages and tools, high level languages including SYCL (and DPC++ which extends SYCL) have been made available to program FPGAs.
This talk will overview the most significant language features on top of SYCL that simplify expression of spatial constructs, and the resultant mapping of these constructs to a device will be depicted. Specifically, parallel execution of kernels with pairwise independent forward progress will be described, as will the variety of communication mechanisms enabled by the pipes features. Memory system tuning controls will be overviewed, and the recommended report-driven methodology for development will be described.

The audience will leave this talk with an overview of spatial language constructs exposed in SYCL and DPC++ , and how to reason about them. Attendance will make it easier to get started adapting algorithms for spatial accelerator implementation, both in terms of initial implementation and also optimization.

 

Download Presentation Deck

Presenting

1:15 - 1:45 PM CT

oneAPI AI Analytics – End to End

Using an end-to-end machine learning platform to build and deploy Intel AI models at scale. Bridge science and engineering teams in a clear and collaborative machine learning management environment in which communicate and reproduce results with interactive workspaces, dashboards, dataset organization, experiment tracking and visualization, a model repository and API to consume them. All possible through a unique open source Platform, cnvrg.io and Intel AI Analytics Toolkit that we illustrate during the session.

 

Download Presentation Deck

Presenting

1:45 - 2:15 PM CT Break

2:15 - 2:45 PM CT

Using Arhat framework with Intel® oneDNN library and OpenVINO™ toolkit for object detection applications

Arhat is a cross-platform deep learning framework that converts neural network descriptions into lean standalone executable code. This approach provides significant benefits because of a simple and straightforward deployment process.
Arhat is integrated with Intel oneAPI deep learning libraries. Arhat backend for Intel generates C++ code that directly calls oneDNN API. Furthermore, Arhat provides a module that consumes models produced by the OpenVINO Model Optimizer.
I will present recent case studies dedicated to using Arhat for building object detection applications on Intel CPU and GPU hardware. These studies cover models from OpenVINO Model Zoo as well as models from Detectron2 library.

 

Download Presentation Deck

Presenting

2:45 - 3:15 PM CT

The Great CEED Bake-off: DPC++ Edition

The CEED Bake-off Problems are a collection of benchmarks representing important compute-intensive kernels and solvers relevant to high-order finite and spectral element methods, such as those used in the Nek5000 CFD code. In this talk we present a DPC++ implementation of the CEED Bake-off Problems. Benchmark results are given for Intel CPUs and GPUs. Intel Advisor is used to conduct cache-aware roofline analysis and understand performance. Finally, batched routines from the Intel oneMKL BLAS-like extensions are explored as a replacement for certain directly programmed DPC++ kernels.

 

Download Presentation Deck

Presenting

3:15 - 3:30 PM CT Break

3:30 - 3:45 PM CT

Accelerating Deep Learning with Intel Extension for PyTorch: a MedMNIST Classification Decathlon example

We showcase how to use Intel Extension for PyTorch (IPEX) for training and inference on the MedMNIST datasets, a collection of 10 MNIST-like open datasets on various medical imaging classification tasks such as pathology images, chest x-ray, OCT images. The demo runs on the Intel DevCloud for oneAPI on Ice Lake. We compare the performance with stock PyTorch and observe the performance gain that Intel Extension for PyTorch offers.

 

Download Presentation Deck

Presenting

3:45 - 4:00 PM CT

Inference with ArrayFire and oneAPI​

Session will demonstrate a simple ML inference pipeline using the OpenCL interop of oneAPI. The ArrayFire library and the derivative Flashlight project will be introduced and used as motivating examples. Data will flow from the oneAPI Video Processing Library to these existing libraries as an example of integrating oneAPI with existing GPU codebases.

Download Presentation Deck

Presenting

4:00 - 4:30 PM CT

Edge Intelligence and Its application in CAVs

The proliferation of Internet of Things and the success of rich cloud services have pushed the horizon of a new computing paradigm, Edge Computing, which calls for processing the data at the edge of the network. Edge computing has the potential to address the concerns of response time requirement, battery life constraint, bandwidth cost saving, as well as data safety and privacy. In this talk, Prof. Weisong Shi will discuss the start-of-the-art of Edge Computing, followed by the emergence of edge intelligence (EI), and how EI applies to CAVs.
In the meanwhile, real-world applications usually call for the collaboration of multiple DNN models on heterogeneous edge platforms to complete complicated tasks with outstanding performance. However, due to the explosive growth in computational requirements, model size, number of involved models, and participated devices, a fundamental question lying in most of the practical applications needs to be solved urgently, i.e., how to concurrently and efficiently deploy and execute these collaborative models on heterogeneous edge devices with different deployment constraints? In this talk, Yongtao Yao will also introduce their solutions to schedule multiple collaborative DNNs on a group of heterogeneous edge devices with the goal of reducing overall latency.

 

Download Presentation Deck

Presenting

4:30 - 5:30 PM CT

The oneAPI Software Abstraction for Heterogeneous Computing

oneAPI is a cross-industry, open, standards-based unified programming model. The oneAPI specification extends existing developer programming models to enable a diverse set of hardware through language, a set of library APIs, and a low-level hardware interface to support cross-architecture programming. It builds upon industry standards and provides an open, cross-platform developer stack to improve productivity and innovation. At the core of oneAPI is the DPC++ programming language, which builds on the ISO C++ and Khronos SYCL standards. DPC++ provides explicit parallel constructs and offload interfaces to support a broad range of accelerators. In addition to DPC++, oneAPI also provides libraries for compute- and data-intensive domains, e.g.: deep learning, scientific computing, video analytics, and media processing. Finally, a low-level hardware interface defines a set of capabilities and services to allow a language runtime system to effectively utilize a hardware accelerator.​

Presenting

Libraries

9:00 - 9:15 AM CT

Introduction/Opening

9:15 - 10:00 AM CT

Global Experts on eXtreme Performance Panel

The Intel eXtreme Performance Users Group (IXPUG) will present a brief overview of the organization and its activities, followed by a panel discussion focused on the expected adoption, support and application of oneAPI at various computing sites around the world. Experts from the various sites will discuss ongoing work with oneAPI and plans for its support and application at their site, in addition to elucidating on the expected impact on their user community. The session will close with a brief overview of opportunities for involvement in upcoming IXPUG activities.

 

Download Presentation Deck – Thomas Steinke

Download Presentation Deck – David Martin

Presenting

10:30 - 10:45 AM CT Break

10:45 - 11:45 AM CT

Multi-GPU Programming - Scale-Up and Scale-Out made easy, using the Intel MPI Library

For shared memory programming of GPGPU systems, users either have to manually run their domain decomposition along available GPUs as well as GPU Tiles. Or leverage implicit scaling mechanisms that transparently scale their offload code across multiple GPU-Tiles. The former approach can be cumbersome and the latter approach is not always the best performing one.
The Intel MPI library can take that burden from users by enabling the user to program only for a single GPU / Tile and leave the distribution to the library. This can make HPC / GPU programming much easier.
Therefore, Intel MPI does not just allow to pin individual MPI ranks to individual GPUs or Tiles, but also allows users to pass GPU memory pointers to the library.

 

Download Presentation Deck

Presenting

11:45 - 12:45 PM CT Lunch

3:30 - 4:00 PM CT

Accelerating epistasis detection on Intel CPUs and discrete GPUs with Intel® Advisor

In this tutorial, we will introduce the Cache-aware Roofline Model (CARM) and expose its basic principles when modelling the performance upper-bounds of Intel CPU and GPU devices. For this purpose, we will rely on epistasis detection as a case-study, which is an important application in bioinformatics. By using DPC++ to deploy the application in Intel Iris Xe MAX (DG1), we will show how Intel® Advisor CARM can be used to detect execution bottlenecks and provide useful hints on which type of optimizations to apply in order to fully exploit both CPU and GPU device capabilities, being actionable for Hybrid CPU+GPU codes design strategy as well.

Using oneAPI tools and Intel Advisor we can obtain essential insight to improve our DPC++ bioinformatics application and obtain speedups on the Intel devices.

 

Download Presentation Deck

Presenting

1:15 - 1:45 PM CT

Visual Analysis Challenges in the Age of Data

Ninety percent of all data in the world has been created in the past two years alone, at a rate of exabytes per day. New data of all kinds — structured, unstructured, quantitative, qualitative, spatial, and temporal — is growing exponentially and in every way. Given the vast amount of data being produced, one of our greatest scientific challenges is to effectively understand and make use of it. In this talk, I will present recent visual analysis research and applications in science, engineering, and medicine from the OneAPI Center of Excellence at the Scientific Computing and Imaging Institute and discuss current and future visualization research challenges.

Download Presentation Deck

Presenting

1:45 - 2:15 PM CT Break

2:15 - 2:45 PM CT

A Synergistic Approach for Abstracting Hardware Heterogeneity and Reducing Algorithmic Complexity: Powering HiCMA with oneAPI for HPC Scientific Applications

We leverage performance of HPC scientific applications using tile low-rank matrix computations. The idea consists in revisiting tile algorithms using low-rank matrix approximations by exploiting the data sparsity of the dense operator coming from computational astronomy, seismic imaging, and climate/weather prediction applications. We rely on the HiCMA software library for providing sequential numerical kernels and oneAPI runtime system for orchestrating the resulting computational tasks onto parallel systems. We demonstrate performance superiority against state-of-the-art numerical libraries with high productivity in mind.

 

Download Presentation Deck

Presenting

12:45 - 1:15 PM CT

Getting Ready to Aurora exa-scale supercomputer using Intel Advisor Roofline on Intel CPUs and GPUs

Aurora at Argonne National Laboratory is one of US DOE’s exa-scale supercomputers that will be deployed in 2022. OneAPI provides all essential components for porting applications to Aurora with optimal performance. OneAPI Intel Advisor roofline features provide intuitive performance analysis results on Intel GPUs, and useful insights about performance bottlenecks for further optimization. We present our Advisor use-cases from our workloads including MD (molecular dynamics) and CFD (computational fluid dynamics) areas.

 

Download Presentation Deck

Presenting

3:15 - 3:30 PM CT Break

2:45 - 3:15 PM CT

Driving a New Era of Accelerated Computing using OpenMP* with Intel® oneAPI Compilers

You are already deeply invested in OpenMP for Multicore, so now just a few additions will launch your code into the xPU era!
OpenMP* is a popular, portable, and widely supported programming model. OpenMP provides capabilities for threaded and task-based parallelism for multicore, data-parallel programming using Single Instruction Multiple Data (SIMD) for vector architectures, and most recently support for a programming model for offload to accelerators for Heterogeneous Architectures (xPU). In this session we will demonstrate how you can use the OpenMP Offload model with the latest Intel® oneAPI Compilers to Drive a New Era of Accelerated Computing with your applications.

 

Download Presentation Deck

Presenting

4:00 - 4:30 PM CT

Exploiting Heterogeneous Computing with Intel® oneAPI Threading Building Blocks (oneTBB)

This session will discuss how to utilize Intel® oneAPI Threading Building Blocks (oneTBB) to balance workloads across heterogenous compute resources. As XPU programming grows, applications should be able to utilize CPU + other devices to maximize throughput.

 

Download Presentation Deck

Presenting

4:30 - 5:30 PM CT

The oneAPI Software Abstraction for Heterogeneous Computing

oneAPI is a cross-industry, open, standards-based unified programming model. The oneAPI specification extends existing developer programming models to enable a diverse set of hardware through language, a set of library APIs, and a low-level hardware interface to support cross-architecture programming. It builds upon industry standards and provides an open, cross-platform developer stack to improve productivity and innovation. At the core of oneAPI is the DPC++ programming language, which builds on the ISO C++ and Khronos SYCL standards. DPC++ provides explicit parallel constructs and offload interfaces to support a broad range of accelerators. In addition to DPC++, oneAPI also provides libraries for compute- and data-intensive domains, e.g.: deep learning, scientific computing, video analytics, and media processing. Finally, a low-level hardware interface defines a set of capabilities and services to allow a language runtime system to effectively utilize a hardware accelerator.​

Presenting