oneAPI DevSummit at ISC 2021

June 22, 2021

Join us for the oneAPI Developer Summit at ISC focused on oneAPI and Data Parallel C++ for accelerated computing across xPU architectures (CPU, GPU, FPGA, and other accelerators). ​ In this two-day LIVE virtual conference, you will learn from leading industry and academia speakers who are working on innovative cross-platform, multi-vendor architecture oneAPI solutions. – ​Collaborate from fellow developers and connect with other innovators. ​ – Dive into a hands-on session where you will learn and apply optimizations in order to fully exploit device capabilities on CPU’s and GPU’s. ​ – Join a vibrant community supporting each other using oneAPI and Data Parallel C++.​

At ISC21, join Intel to learn how our unique, XPU-centric portfolio and oneAPI is helping users redefine the limits of what’s possible. In addition to oneAPI dev summit, do not forget to check out our new Intel® HPC + AI Pavilion with over 20 tech talks, fireside chats and demos, plus 2 executive keynotes, you’ll get the latest news on 3rd Gen Intel® Xeon® Scalable processors, Distributed Asynchronous Object Storage (DAOS), Intel® Optane™ technology, Xe-HPC (Ponte Vecchio), and much more. 

ISC at Intel.              IXPUG

Agenda

Schedule: June 22
Schedule: June 23

Schedule: June 22

10:00 - 10:10 CET

Introduction

Presenting

10:10 – 10:50 AM CET KEYNOTE

Experiences with adding SYCL support to GROMACS

KEYNOTE

Experiences with adding SYCL support to GROMACS

GROMACS is an open-source, high-performance molecular dynamics (MD) package primarily used for biomolecular simulations, accounting for ~5% of HPC utilization worldwide. Due to the extreme computing needs of MD, significant efforts are invested in improving the performance and scalability of simulations. Target hardware ranges from supercomputers to laptops of individual researchers and volunteers of distributed computing projects such as Folding@Home. The code has been designed both for portability and performance by explicitly adapting algorithms to SIMD and data-parallel processors. A SIMD intrinsic abstraction layer provides high CPU performance. Explicit GPU acceleration has long used CUDA to target NVIDIA devices and OpenCL for AMD/Intel devices.

In this talk, we discuss the experiences and challenges of adding support for the SYCL platform into the established GROMACS codebase and share experiences and considerations in porting and optimization. While OpenCL offers the benefits of using the same code to target different hardware, it suffers from several drawbacks that add significant development friction. Its separate-source model leads to code duplication and makes changes complicated. The need to use C99 for kernels, while the rest of the codebase uses C++17, exacerbates these issues. Another problem is that OpenCL, while supported by most GPU vendors, is never the main framework and thus is not getting the primary support or tuning efforts. SYCL alleviates many of these issues, employing a single-source model based on the modern C++ standard. In addition to being the primary platform for Intel GPUs, the possibility to target AMD and NVIDIA GPUs through other implementations (e.g., hipSYCL) might make it possible to reduce the number of separate GPU ports that have to be maintained.

Some design differences from OpenCL, such as flow directed acyclic graphs (DAGs) instead of in-order queues, made it necessary to reconsider the GROMACS’s task scheduling approach and architectural choices in the GPU backend. Additionally, supporting multiple GPU platforms presents a challenge of balancing performance (low-level and hardware-specific code) and maintainability (more generalization and code-reuse). We will discuss the limitations of the existing codebase and interoperability layers with regards to adding the new platform; the compute performance and latency comparisons; code quality considerations; and the issues we encountered with SYCL implementations tested. Finally, we will discuss our goals for the next release cycle for the SYCL backend and the overall architecture of GPU acceleration code in GROMACS.

Download Presentation Deck

Presenting

10:50 - 11:20 CET TECH TALK

AdePT project - Experience with porting particle transport simulation to oneAPI

TECH TALK

AdePT project – Experience with porting particle transport simulation to oneAPI

The AdePT project is an R&D activity led by CERN aiming to speed up the simulation of particle propagation in the detector regions dominated by electromagnetic physics. The project goals are to implement technical solutions to run particle transport on GPUs and to understand related portability/implementation/optimization issues. While the project currently targets only the NVIDIA platform, code portability and support for different hardware accelerators are long-term goals. In that context, the oneAdePT project has been started to investigate oneAPI as a possible way forward.

Download Presentation Deck

Presenting

11:20 - 11:30 CET BREAK

11:30 - 1:00 CET HANDS ON SESSION

Single source heterogeneous programming with Data Parallel C++ and SYCL 2020 features

HANDS ON SESSION

Single source heterogeneous programming with Data Parallel C++ and SYCL 2020 features

We will introduce oneAPI and Data Parallel C++ for heterogenous programming. We begin by introducing this technology as an extension to standard C++ which incorporates parallelism directly into the language using SYCL specifications. We will look at SYCL 2020 features like Unified Shared Memory, Sub-Groups and Reductions. We will work on some hands-on coding samples and see how DPC++ increases productivity and helps to achieve performance.

Download Presentation Deck

Presenting

1:00 - 1:20 CET LUNCH

1:20 - 2:50 CET HANDS ON SESSION

Porting NAMD to oneAPI DPC++

HANDS ON SESSION

Porting NAMD to oneAPI DPC++

The NAMD parallel molecular dynamics code is designed for high-performance simulation of large biomolecular systems; it is used for tackling biomedically relevant challenges, such as the coronavirus, by providing insight at the atomic level of detail. NAMD is an important application for the upcoming Aurora supercomputer at Argonne National Laboratory, which will be accelerated by Intel Ponte Vecchio Xe-HPC GPUs. This hands-on session will discuss in some detail the work being undertaken to port the CUDA kernels in NAMD to oneAPI DPC++ in preparation for running on Aurora.

Download Presentation Deck

Presenting

2:50 - 3:00 CET BREAK

3:00 - 3:40 CET VENDOR UPDATE

Experiences in Using oneAPI

VENDOR UPDATE

Experiences in Using oneAPI

Presenting

3:40 - 4:10 CET TECH TALK

Evaluating CUDA Portability with HIPCL and DPCT

TECH TALK

Evaluating CUDA Portability with HIPCL and DPCT

HIPCL is expanding the scope of the CUDA portability route from an AMD platform to an OpenCL platform. In the meantime, the Intel DPC++ Compatibility Tool (DPCT) is migrating a CUDA program to a data parallel C++ (DPC++) program. Towards the goal of portability enhancement, we evaluate the performance of the CUDA applications from Rodinia, SHOC, and proxy applications ported using HIPCL and DPCT on Intel GPUs. After profiling the ported programs, we aim to understand their performance gaps, and optimize codes converted by DPCT to improve their performance. The open-source repository for the CUDA, HIP, and DPCT programs will be useful for the development of a translator. 

Download Presentation Deck

Presenting

4:10 - 4:40 CET TECH TALK

Design, Development and Validation of DPC++ backend for OCCA

TECH TALK

Design, Development and Validation of DPC++ backend for OCCA

OCCA—an open source, portable, and vendor neutral framework for programming parallel architectures—is used by the U.S. Department of Energy and Shell in major scientific and engineering applications. This talk will provide insight into the development of a DPC++ backend for OCCA. Integral to this effort is the DPC++ Unified Shared Memory (USM) model. Factors influencing choices related to kernel translation and launching will also be discussed. The functional accuracy of an initial implementation of a the OCCA DPC++ backend is validated on Intel GPU hardware. Finally, ongoing validation and performance analysis efforts will be outlined, along with plans for future development.

Download Presentation Deck

Presenting

4:40 – 5:10 AM CET TECH TALK

Porting oneAPI DPC++ on Xilinx FPGA & Versal ACAP CGRA

TECH TALK

Porting oneAPI DPC++ on Xilinx FPGA & Versal ACAP CGRA

Many accelerators comes with programming environment suitable for electrical engineers or usable with machine-learning frameworks but remain difficult to use in an HPC context. Fortunately SYCL 2020 can bring direct programming for various accelerators through the concept of generic back-ends. We are porting the open-source oneAPI DPC++ implementation to Xilinx Alveo FPGA cards and also targeting our Versal AIE CGRA with 400 VLIW vector processors. We extend SYCL with collaborative operations to use the distributed memory shared by the 2D processor neighborhood, useful for stencil code. 

Download Presentation Deck

Presenting

5:10 - 5:20 CET BREAK

5:20 - 5:50 CET KEYNOTE

TensorFlow and oneDNN in Partnership

KEYNOTE

TensorFlow and oneDNN in Partnership

Rapid growth in AI and machine learning innovations and workloads necessitates constant developments in both software and hardware infrastructure. TensorFlow, Google’s end-to-end open-source machine learning framework, and oneDNN have been collaborating closely to ensure users can fully utilize new hardware features and accelerators, with a focus on x86 architectures. This talk will cover recent projects such as int8 (AVX512_VNNI) and bfloat16 (AVX512_BF16) vectorization support, bringing custom oneDNN operations to vanilla TensorFlow, and the upcoming Intel XPU device plug-in for TensorFlow.

Download Presentation Deck

Presenting

5:50 - 6:00 CET

Conclusion

Presenting

6:00 - 7:00 PM CET HAPPY HOUR

Happy Hour

Happy Hour

We will Open the Happy hour with a fun Jeopardy game where you can show off your oneAPI and DPC++ knowledge and win some fun prizes including a book on DPC++. Then we will test your creativity with an exciting game of Scribl.io.

Presenting

Schedule: June 23

10:00 - 10:10 CET

Introduction

Introduction

Presenting

10:00 - 10:30 CET KEYNOTE

Porting Boris Particle Pusher to DPC++. Performance Analysis and Optimization on Intel CPUs and GPUs

KEYNOTE

Porting Boris Particle Pusher to DPC++. Performance Analysis and Optimization on Intel CPUs and GPUs

The talk reports the results of porting one of the key computational kernels of the High-Intensity Collisions and Interactions (Hi-Chi) open-source numerical code to the DPC++ programming language. When planning to port the entire Hi-Chi code to DPC++, we started with a Boris particle pusher to assess the effort required for porting and the expected performance on state-of-the-art Intel CPUs and GPUs. We found that the optimized parallel C++ code can be ported relatively easily to DPC++, while performance on Xeon Platinum 8260L differs only slightly (~10% on average) from the baseline C++ code, which is a reasonable price to pay for portability. It turned out that the code originally optimized for CPUs can now be run on Intel GPUs (P630 and Iris Xe Max), demonstrating the expected performance, which can be improved later. In this talk, we will tell about the experience of porting, optimization techniques, and current results of code optimization for Intel GPUs.

Download Presentation Deck

Presenting

10:30 - 11:00 CET TECH TALK

Bringing SYCL to pre-exascale supercomputing with DPC++ for CUDA 

TECH TALK

Bringing SYCL to pre-exascale supercomputing with DPC++ for CUDA

Perlmutter is a pre-exascale supercomputer for NERSC at Lawrence Berkeley National Laboratory consisting of >6000 Nvidia A100 GPUs. Codeplay is working in partnership with NERSC, LBNL and Argonne National Laboratory to enable developers using this supercomputer and ALCF’s ThetaGPU machine to write highly accelerated applications using the SYCL programming model.

This presentation will show how this open-source work will bring benefits to the whole community, enabling developers to port their CUDA code to SYCL while still being able to obtain comparable performance on Nvidia GPUs.

Download Presentation Deck

11:00 - 11:10 CET BREAK

11:10 - 12:35 CET HANDS ON SESSION

Application optimization with Cache-aware Roofline Model and Intel oneAPI tools

HANDS ON SESSION

Application optimization with Cache-aware Roofline Model and Intel oneAPI tools

In this tutorial, we will introduce the Cache-aware Roofline Model (CARM) and expose its principles when modelling the performance of Intel CPU and GPU devices. We will also showcase how CARM implementation in Intel® Advisor can be used to drive the application optimization. For this purpose, we will rely on epistasis detection as a case-study, which is an important application in bioinformatics. For lntel GPUs, we will show how CARM can be used to detect execution bottlenecks and provide useful hints on which type of optimizations to apply to fully exploit device capabilities. The guidelines provided by CARM were fundamental to achieve the speedups of more than 20x when compared to the baseline code.

Download Presentation Deck

Presenting

12:35 - 12:55 CET LUNCH

12:55 - 2:20 CET HANDS ON SESSION

Word-Count with MapReduce on FPGA, A DPC++ Example

HANDS ON SESSION

Word-Count with MapReduce on FPGA, A DPC++ Example

Many workloads have inherent data parallelism which can be leveraged to achieve optimal performance. However, it is challenging to design data parallel programs and map them to different hardware targets. Intel’s Data Parallel C++ is an open alternative for cross-architecture development, aiming to address this challenge. In this talk, we cover a popular distributed programming model MapReduce and how to adopt the model in processing a large dataset on FPGA. We first introduce MapReduce, a widely used programming model for processing large datasets in a distributed fashion. We present the word-count problem to explain how to design data parallel algorithm with MapReduce paradigm. We describe a variant of MapReduce for data parallel programming on FPGA due to the unique characteristics of FPGA. Finally, we demonstrate the design flow of DPC++ programming on Intel DevCloud and the relevant tools for performance analysis and optimization.

Download Presentation Deck

Presenting

2:20 - 2:30 CET BREAK

2:30 - 3:00 CET TECH TALK

Performance-Portable Distributed k-Nearest Neighbors using Locality-Sensitive Hashing and SYCL

TECH TALK

Performance-Portable Distributed k-Nearest Neighbors using Locality-Sensitive Hashing and SYCL

In the age of AI, algorithms must efficiently cope with vast data sets. We propose a performance-portable implementation of Locality-Sensitive Hashing (LSH), an approximate k-nearest neighbors algorithm to speed up classification on heterogeneous hardware.
Our new library provides a hardware-independent, yet efficient and distributed implementation of the LSH algorithm using SYCL and MPI.
The results show that our library can scale on multiple GPUs achieving a speedup of up to 7.6 on 8 GPUs. It supports different SYCL implementations—ComputeCpp, hipSYCL, DPC++—to target different hardware.

Download Presentation Deck

Presenting

3:00 - 3:30 CET TECH TALK

Visualization of human-scale blood flow simulation using Intel OSPRay Studio on SuperMUC-NG

TECH TALK

Visualization of human-scale blood flow simulation using Intel OSPRay Studio on SuperMUC-NG

HemeLB is a highly scalable, 3D blood flow solver capable of generating high-resolution simulations of blood flow through human-scale vasculatures. Post-processing such simulations is a significant challenge, particularly due to the volume of data generated. We have utilized Intel OSPRay Studio to visualize generated data timeseries directly on the production machine — SuperMUC-NG at LRZ. We use a custom input plugin to efficiently map the simulation data whilst eliminating complex pre-processing. This allows the full domain simulated by HemeLB to be quickly observed by users for assessment.

Download Presentation Deck

Presenting

3:30 - 4:00 CET TECH TALK

Ginkgo - An Open Source Math Library in the oneAPI Ecosytem

TECH TALK

Ginkgo – An Open Source Math Library in the oneAPI Ecosytem

Ginkgo is an open-source math library designed for GPU-accelerated supercomputers. In this talk, we will present the path we took to prepare Ginkgo for Intel GPUs. We will start with reporting our experiences in porting the NVIDIA-focused software stack to Intel’s DPC++ environment and the obstacles we encountered when using automated code conversion. We will then present the functionality Ginkgo currently provides for Intel GPUs, and the performance we achieve for key linear algebra building blocks on recent Intel GPUs. We will conclude by demonstrating how Ginkgo’s DPC++ backend can be used to prepare scientific applications for the OneAPI ecosystem.

Download Presentation Deck

Presenting

4:00 - 4:40 CET TECH TALK

HPC changing the paradigm of film animation as Tangent revolutionizes creative story telling

TECH TALK

HPC changing the paradigm of film animation as Tangent revolutionizes creative story telling

Jeff Bell, CEO of Tangent Labs, will present a unique Studio Customer story highlighting their evolution of the digital creative workflow, leveraging HPC in the cloud. A company with a strong focus on the creative community, is embracing Blender for production film making. This includes high fidelity animation films for Netflix and others using Intel’s oneAPI Rendering Toolkit resulting in reducing film production time to market where functions that once took days now takes hours delivering predictable rendering times. Tangent now can deliver predictable films in budget and with the highest fidelity quality. Tangent and Intel’s close collaboration influences what Blender brings to market as an open source solution the film animation industry.

In late 2020 Tangent Labs released their fresh take on a cloud-based digital production pipeline LoUPE, initially developed for use by sister company Tangent Animation to address challenges faced by the community for years delivered on the AWS Studio Cloud. Remote collaboration has never been a more timely topic, and LoUPE advances the traditional on-premise production workflow by seamlessly facilitating collaboration between studio locations and freelancers around the globe. Data is securely stored in the cloud, with intelligent transit between locations ensuring sync.

Tangent Labs and Intel share a vision of providing improvements for creators, with LoUPE focused on the creative workflow and working closely with Intel on their advancements in Visualization development and Intel’s oneAPI Render Toolkit, which will be discussed.

Download Presentation Deck

Presenting

4:40 - 5:10 CET TECH TALK

Integrating Arhat deep learning framework with Intel® oneDNN library and OpenVINO™ toolkit

TECH TALK

Integrating Arhat deep learning framework with Intel® oneDNN library and OpenVINO™ toolkit

Arhat is a specialized deep learning framework that converts neural network descriptions into lean standalone executable code which can be directly deployed on various platforms. Arhat backend for Intel generates C++ code that directly calls oneDNN API functions and can run on any modern Intel CPU or GPU. Arhat includes the OpenVINO interoperability layer that consumes models produced by the OpenVINO Model Optimizer. We have used Arhat for a series of experiments aimed at benchmarking Tiger Lake i7-1185G7E against NVIDIA Jetson Xavier NX with various object detection neural networks. The benchmarking results demonstrate that Tiger Lake is a powerful competitor in the embedded application domain.

Download Presentation Deck

Presenting

5:10 - 5:15 CET BREAK

5:15 - 6:00 CET KEYNOTE

oneAPI, SYCL and Standard C++: Where Do We Need To Go From Here?

KEYNOTE

oneAPI, SYCL and Standard C++: Where Do We Need To Go From Here?

I first discovered the joys of C++ in 1988, and joined its standardization effort two decades later to represent developers. Working at Argonne on the oneAPI/DPC++/SYCL backend for Kokkos, I see much of what HPC and heterogeneous computing developers need (both as one as well as having to support others), and I believe that standardization is the way to address those long term.

I’ll be presenting my journey so far, ways I’d like to see oneAPI evolve, and how we are helping to guide those improvements into SYCL and ultimately standard C++ to handle our development needs for decades to come.

Download Presentation Deck

Presenting

6:00 - 6:10 CET

Conclusion

Presenting

6:10 - 7:00 PM CET HAPPY HOUR

Happy Hour

Happy Hour

We will Open the Happy hour with a fun Jeopardy game where you can show off your oneAPI and DPC++ knowledge and win some fun prizes including a book on DPC++. Then we will test your creativity with an exciting game of Scribl.io.

Presenting