Using oneAPI construction kit to enable open standards programming for the Metis AIPU

The necessity of dedicated AI hardware accelerators

AI applications have an endless hunger for computational power. Currently, increasing the sizes of the models and cranking up the number of parameters has not quite yet reached the point of diminishing returns. Thus, the ever growing models still yield better performance than their predecessors.

At the same time, new areas for application of AI tools are explored and discovered almost daily. Hence, building dedicated AI hardware accelerators is extremely attractive. In some situations it is even a necessity, as it enables running more powerful AI applications while using less energy on cheaper hardware.

Welcome to the hardware jungle

Such specialized accelerator hardware poses great challenges to software developers, as they instantly transform a regular computer into a heterogeneous supercomputer, where the accelerator is distinctly different from the host processor. Moreover, each accelerator is different in its own way and wants to be programmed appropriately to actually reap the potential performance and efficiency benefits.

In his 2011 article 1, Herb Sutter heralded this age with the words “welcome to the hardware jungle”. And since he wrote this article, a thick jungle it has indeed become, with multiple specialized hardware accelerators now being commonplace across all device categories ranging from low-end phones to high-end servers.

So what’s the machete that developers can use to make their way through this jungle without getting lost?

 

Why custom accelerator interfaces are a bad idea

The answer lies in the creation of a suitable programming interface for those accelerators. Creating a custom interface that is completely tailored for a new accelerator silicon could let a developer exploit every little feature that the hardware has to offer to achieve maximum performance.

However, upon closer inspection, this is a bad idea for a variety of reasons. Firstly, while there might be the possibility of achieving peak performance with a custom interface, it would require expertise that is already hard to come by for existing devices and even rarer for new devices. The necessary developer training is time-intensive and costly.

Even more importantly, using a different bespoke interface to program each accelerator can also result in vendor lock-in if the created software completely relies on such a custom interface, making it highly challenging and significantly more expensive to switch to a different hardware accelerator. The choice of programming interface is thus crucial not only from a technical perspective, but also from a business standpoint. At Axelera, we therefore believe that the answer to the question of how to best bushwhack through the accelerator jungle is to embrace open standards, such as OpenCL 2 and SYCL 3.

 

Open standards for open interaction

OpenCL and SYCL are open standards defined by the Khronos Group. They define an application programming interface (API) for interacting with all kinds of devices, as well as programming languages for implementing compute kernels to run on these devices.

SYCL provides high-level programming concepts for heterogeneous computing architectures, together with the ability to maintain code for host and device inside a shared source file.

But providing a standard-conformant implementation of such open standards poses a daunting challenge for creators of new hardware accelerators. The OpenCL API consists of more than 100 functions and OpenCL C specifies over 10000 built-in functions that compute kernels can use. It would be great if these open standards were also accompanied by high-quality open-source implementations that are easy to port to new silicon. Fortunately, in the case of OpenCL and SYCL, this is indeed the case.

Increased developer productivity

Open standards such as OpenCL & SYCL promise portability across different hardware devices and also foster collaboration and code reuse. After all, it suddenly becomes possible and worthwhile to create optimized libraries that are usable for many devices, which ultimately increases developer productivity.

Axelera is a member of the UXL Foundation 4, a group that governs optimized libraries implemented using SYCL. These libraries are compatible with this software stack, offering math and AI operations through standard APIs.

Conquering the jungle with the oneAPI Construction Kit

The open source oneAPI Construction Kit from Codeplay is a collection of high-quality implementations of open standards, such as OpenCL and Vulkan Compute, that are designed from the ground up to be easily portable to new hardware targets. We want to share our experiences using the Construction Kit to unlock OpenCL and SYCL for our Metis AI Processing Unit (AIPU) 5. Prerequisites for deployment In order to enable porting an existing OpenCL implementation to a new device, two prerequisites must be fulfilled:

  1. There must be a compiler backend able to generate code for the device’s compute units. As the oneAPI construction kit, like virtually all OpenCL implementations, is based on the LLVM compiler framework, in this case this means having an LLVM code generator backend for the target instruction set architecture (ISA). As our Metis AIPU’s compute units are based on the RISC-V ISA, we could just use the RISC-V backend that’s part of the upstream LLVM distribution to get us started. If the accelerator uses a non-standard ISA, an adapted version of LLVM with a custom backend can of course be used with the Construction Kit as well.
  2. There must be some way for low-level interaction with the device, to perform actions like reading or writing device memory, or triggering the execution of a newly loaded piece of machine code. As we already supported another API before looking into OpenCL, such a fundamental library was already in place. In our case, it was a kernel driver exposing the minimal needed functionality to user space (essentially handling interrupts and providing access to device memory), accompanied by a very thin user space library wrapping those exposed primitives.

Implementing HAL

With these prerequisites being met, we started following the Construction Kit’s documentation 6. The first thing to do is implementing what the Construction Kit calls the “hardware abstraction layer” (HAL). The HAL comprises a minimal interface that covers the second item of the above list and consists of just eight functions: allocating/freeing device memory, reading/writing device memory, loading/freeing programs on the device, and finding/executing a kernel contained in an already loaded program.

In order to avoid having to deal with the full complexity of OpenCL from the get-go, a smaller helper library called “clik” is provided by the Construction Kit to implement the HAL. This library is essentially a severely stripped-down version of OpenCL, with some especially complex parts like on-line kernel compilation being completely absent. Hence, the clik library serves as a stepping stone for getting the HAL implemented function by function, and provides matching test cases to ensure that the HAL implementation fulfills the contract expected by the Construction Kit. After all tests pass, this scaffolding can be removed, and the resulting HAL implementation can be used to bring up a full OpenCL implementation.
In our case, implementing the HAL was straightforward. The tests enabled a quick development cycle, where more tests started passing every time some new functionality was added or pointed out problems where the HAL implementation didn’t meet the Construction Kit’s expectations. In total, it took about two weeks of full-time work by one developer without prior Construction Kit knowledge to go from starting the work to passing all clik tests. Configuring a complete OpenCL stack.

After gaining confidence that the Metis HAL implementation was functional, we could continue with the next step and bring up a complete OpenCL stack 7. This, too, was surprisingly quick, taking roughly another two person weeks of developer time. The Construction Kit again provides an extensive unit test suite, whose tests can be used to guide development by pointing out specific areas that aren’t working yet. Testing our Metis OpenCL implementation.

All bring-up work was initially performed using an internal simulator environment, but after passing all tests there, we could quickly move to working on actual silicon (see 8). As the first real-world litmus test for our Metis OpenCL implementation, we picked an OpenCL C kernel that is currently used for preprocessing as part of our production vision pipeline. By default, the kernel is offloaded to the host’s GPU. However, with Metis now being a possible offloading target for OpenCL workloads as well, we pointed the existing host application at our Metis OpenCL library and gave it a try. We were very happy to see that without any modifications to the host application1, we were able to run the vision pipeline while offloading the computations to Metis instead of the host GPU. In total, with the transition to actual silicon taking another week of developer time, it took us around five person weeks of development effort to go from having no OpenCL support to having a prototype implementation capable of offloading an existing OpenCL C kernel used in a production setting to our accelerator.

Hence, in our experience, OpenCL and the oneAPI Construction Kit fully delivered on the promises of easy portability and avoiding vendor lock-in.

Opening up possibilities

Having a functional OpenCL implementation is also an important building block that opens up many other possibilities. OpenCL can be used as a backend for the DPC++ SYCL implementation [9], which enables a more modern single-source style for programming accelerators.

Even more importantly, a SYCL implementation makes it possible to tap into the wider SYCL ecosystem. This includes optimized libraries, such as portBLAS 10 providing linear algebra routines and portDNN 11 providing neural-network-related routines, but also brings the potential to support the UXL Foundation libraries including oneMKL 12, oneDPL 13, and oneDNN 14. Alongside these libraries it also includes tools like SYCLomatic 15, which assists with migrating existing CUDA codebases to SYCL. Thus, it offers an important migration path to escape from vendor lock-in.

Why oneAPI simplifies AI accelerator implementation

The best way to bushwhack through the accelerator jungle and enable heterogeneous computing is to embrace open standardsOpen standards play a crucial role in the evolution and adoption of heterogeneous computing by addressing some of the fundamental challenges associated with developing for diverse hardware architectures. They provide standardized programming models and APIs, such as OneAI, that allow software to communicate with various hardware components, including CPUs, GPUs, DSPs, and FPGAs, irrespective of the vendor.Overall, we found the Construction Kit of oneAPI to be key for unlocking access to open standards. Through the use of oneAPI, the integration of AI accelerators can be significantly simplified and made more efficient and future-proof. That’s because oneAPI enables seamless, hardware-agnostic interoperation between tools and libraries. This accelerates the development process and ensures that applications can leverage the latest advancements in AI hardware and software technologies, and remain compatible with future hardware innovations, reducing the need for costly rewrites or optimizations. At Axelera AI, we are excited to continue on this path.

References

  1. H. Sutter, “Welcome to the Jungle,” 2011. Online. Available: https://herbsutter.com/welcome-to-the-jungle/.
  2. The Khronos Group, “OpenCL Overview,” Online. Available: https://www.khronos.org/opencl/.
  3. The Khronos Group, “SYCL Overview,” [Online]. Available: https://www.khronos.org/sycl/.
  4. UXL Foundation, “UXL Foundation: Unified Acceleration,” [Online]. Available: https://uxlfoundation.org/.
  5. Axelera AI, “Metis AIPU Product Page,” [Online]. Available: https://www.axelera.ai/metis-aipu.
  6. Codeplay Software Ltd, “Guide: Creating a new HAL,” [Online]. Available: https://developer.codeplay.com/products/oneapi/construction-kit/3.0.0/guides/overview/tutorials/creating-a-new-hal.
  7. Codeplay Software Ltd, “Guide: Creating a new ComputeMux Target,” [Online]. Available: https://developer.codeplay.com/products/oneapi/construction-kit/3.0.0/guides/overview/tutorials/creating-a-new-mux-target.
  8. Axelera AI, “First Customers Receive World’s Most Powerful Edge AI Solutions from Axelera AI,” 12 September 2023. [Online]. Available: https://www.axelera.ai/news/first-customers-receive-worlds-most-powerful-edge-ai-solutions-from-axelera-ai..
  9. Intel Corporation, “Intel® oneAPI DPC++/C++ Compiler,” [Online]. Available: https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler.html.
  10. Codeplay Software Ltd, “portBLAS: Basic Linear Algebra Subroutines using SYCL,” [Online]. Available: https://github.com/codeplaysoftware/portBLAS.
  11. Codeplay Software Ltd, “portDNN: neural network acceleration library using SYCL,” [Online]. Available: https://github.com/codeplaysoftware/portDNN.
  12. Intel Corporation, “Intel® oneAPI Math Kernel Library (oneMKL),” [Online]. Available:https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html.
  13. Intel Corporation, “Intel® oneAPI DPC++ Library,” [Online]. Available: https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-library.html.
  14. Intel Corporation, “Intel® oneAPI Deep Neural Network Library,” [Online]. Available: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onednn.html.
  15. Intel Corporation, “SYCLomatic: CUDA to SYCL migration tool,” [Online]. Available: https://github.com/oneapi-src/SYCLomatic.
×


Learn about joining the UXL Foundation:

Join now