An international alliance of companies and researchers gear up to enable cross-architecture programming adoption for a common user experience
The oneAPI industry initiative is making steady progress toward its ultimate objective: to advance a cross-industry, open, standards-based unified programming model that delivers a common developer experience across general purpose (CPU) and accelerator architectures—for faster application performance, more productivity, and greater innovation.
Industry leaders participating in the oneAPI industry initiative are actively collaborating on the oneAPI specification and compatible implementations across the ecosystem. This blog explores the diverse ways that they are implementing oneAPI cross-architecture programming on AMD CPUs and GPUs, Nvidia, Arm, and Huawei Ascend AI Chips.
oneAPI Ecosystem Adoption Highlights Values of Open
Codeplay: Nvidia backend – NERSC, Berkeley, Argonne, Oak Ridge national laboratories – As a driving force in the oneAPI initiative and the SYCL community since its inception, Codeplay CEO Andrew Richards and his team collaborated with industry colleagues to define the SYCL standard. In 2020, as part of its work within the initiative, Codeplay announced its contribution to oneAPI by supporting SYCL, the oneAPI Deep Neural Network Library (oneDNN) and the oneAPI Math Kernel Library (oneMKL) for Nvidia GPUs.
“When Intel’s implementation of SYCL – known as DPC++ with extensions for CPUs, GPUs, and FPGAs – became available, it offered us an opportunity to fully support Nvidia GPUs and to integrate them into the LLVM compiler,” said Richards.
This meant developers could more easily code for Nvidia GPUs without using OpenCL. The codebase for this implementation lives in the main Data Parallel C++ (DPC++) LLVM Compiler project. The Codeplay implementation uses the native CUDA interface, enabling developers to gain portability and performance. The interface is available for both DPC++ and the MKL-BLAS linear algebra math library, which is part of the oneMKL math framework, as well as for oneDNN.
“Accelerating devices in this way, by using standards that can run on numerous platforms and devices frees developers to make accelerator choices based on what works best for the overall solution,” said Richards. The benefits of using a DPC++ compiler, which offers a standards-based, open-source model goes beyond reduced development costs for heterogeneous programming. It enables faster application performance, more productivity, and greater innovation.
In an agreement from February 2021, the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory, in collaboration with the Argonne Leadership Computing Facility at Argonne National Laboratory (ANL), signed a contract with Codeplay to enhance the LLVM SYCL GPU compiler capabilities for NVIDIA A100 GPUs to power NERSC’s next-generation supercomputer, Perlmutter.
In June 2021, Codeplay announced a similar agreement with ANL in collaboration with Oak Ridge National Laboratory (ORNL). The contract is to implement the oneAPI DPC++ compiler to support AMD GPU-based high-performance compute (HPC) supercomputers.
Argonne’s exascale supercomputer, Aurora, is based on Intel® Xeon® Scalable processor (CPUs) and Intel Xe GPUs and includes SYCL as a primary programming model, while the Oak Ridge exascale supercomputer Frontier features AMD GPUs. Argonne and Lawrence Berkeley are also deploying SYCL on the Polaris and Perlmutter pre-exascale supercomputers. Exascale supercomputers process 1018, or 1 quintillion, calculations per second, more than 150 Petaflops, making them the world’s highest performing computers. Both the Argonne and Oak Ridge labs are registered in the U.S. Department of Energy (DoE) Office of Science User Facilities.
Fugaku: Fujitsu/Riken oneDNN on ARM – Fujitsu and Riken optimized and ported the oneDNN deep learning processing library, which continues to be developed as open-source software for the Armv8-A instruction set, enabling it to run at high speed on the Fugaku supercomputer. The Fugaku supercomputer was delivered to Port Island, located off the coast of Kobe. Developed jointly by RIKEN and Fujitsu, Fujitsu achieved full use of Arm SVE architecture, 9X improved performance in training and 7.8X in inference. Using the oneAPI oneDNN open source version, Fujitsu scored best performance as a CPU with MLPerf HPC v0.7.
Claiming the title of world’s fastest performance for the number of deep learning models trained per time unit for CosmoFlow, Fujitsu and RIKEN took first place for MLPerf HPC Benchmark with Fugaku. By applying technology to programs used on the system that reduce the mutual interference of communication between CPUs, they could train the system at a rate of 1.29 deep learning models per minute – approximately 1.77X faster than other systems. In developing TensorFlow and Mesh TensorFlow implementations for the cosmological parameter prediction benchmark, Fujitsu and Riken customized TensorFlow and optimized oneDNN as the backend. oneDNN specifically uses JIT assembler Xbyak_aarch64 to exploit the performance of A64FX.
Huawei: DPC++ on Ascend AI Chips – Wilson Feng, Rasool Maghareh, and Amy Wang at Huawei’s Heterogeneous Compiler Lab, improved the performance of Huawei products by applying oneAPI open-source specifications. They shared their results with other industry members at IWOCL & SYCLCON 2021.
The Lab researched and developed compilers, including work on language runtimes, system-level exploitation of machine learning/artificial intelligence frameworks, and concurrent/distributed systems. The lab used oneAPI DPC++, built on SYCL, to drive heterogeneous compute to support multiple types of processors or accelerators.
hipSYCL on AMD GPU – HipSYCL is one of the four major SYCL implementations, with a particular focus on aggregating hardware support for multivendor hardware provided by those toolchains within one single framework. Recently, hipSYCL adopted DPC++/SYCL 2020 features such as unified shared memory, reductions, and more to increase code portability on AMD GPUs. Aksel Alpay, research and software engineer at Heidelberg University, heads up much of the oneAPI work related to AMD GPUs.
Chinese Academy of Science – The Chinese Academy of Science is extending oneAPI with support for Chinese-developed hardware and collaborating with Intel to build a oneAPI Center of Excellence. Learn more.
Get Involved: Review & Collaborate on the oneAPI Specification
Learn about the latest oneAPI updates, industry initiative and news. Check out our videos and podcasts. Visit our GitHub repo – review the spec and give feedback or join the conversation happening now on our Discord channel. Then get inspired, network with peers, and participate in oneAPI events.