There has been rapid growth in AI and machine learning innovations over the years, and high-compute workloads necessitate constant developments in both software and hardware infrastructure. Application engineers need a solution that takes full advantage of architecture to extract maximum performance across CPUs and GPUs.
Enter Intel-Optimized TensorFlow. Tensorflow, Google’s end-to-end open source machine learning framework, and oneDNN have been collaborating closely since 2017 to ensure users can fully utilize new hardware features and accelerators, with a focus on x86 architectures. Penporn Koanantakool, senior software engineer at Google, spoke about the partnership and its results and ongoing collaboration during the oneAPI Developer Summit at ISC. Koanantakool highlighted the int8 (AVX512_VNNI) and bfloat16 (AVX512_BF16) vectorization support, discussed custom oneDNN operations with vanilla TensorFlow, and talked about the upcoming Intel XPU (CPUs, GPUs, and FPGAs) device plug-in for TensorFlow.
Here are the highlights of the exciting work that’s coming out of the TensorFlow and oneDNN Partnership.
What is Intel-optimized TensorFlow?
TensorFlow is a widely-used machine learning framework in the deep learning arena that most AI developers are quite familiar with, and Intel-optimized TensorFlow has optimized this framework using oneAPI Deep Neural Network Library (oneDNN) primitives. TensorFlow and oneDNN have been collaborating closely to ensure users can fully utilize new hardware features and accelerators, especially x86 architectures, as soon as they are available. Intel-Optimized TensorFlow is a separate build of tensorflow maintained by Intel. It is similar to TensorFlow but it replaces some of the most compute-intensive operations with custom operations referred to as oneDNN primitives.
The oneDNN library provides the building blocks to optimize AI applications and is used in Google Cloud DL VMs, containers, notebooks, and more. oneDNN replaces operations such as MatMul, Convolution 3D, Fields convolutions, and all kinds of contractions.
Key users of oneDNN include:
- Google Health’s DeepVariant
- Tencent’s 3D digital face reconstruction
- Laika’s stop motion animation
There have been a lot of added optimizations over the years, and two of the most exciting are:
AVX512 information support
AVX512_VNNI quantizes models using Intel® Low Precision Optimization Tool to achieve up to 4x speedups over fp32 while maintaining high-level accuracy. Read how to enable the Intel DL boost capabilities on 2nd generation Intel® Xeon® Scalable processors.
BF16 information support
bfloat16 (AVX512_BF16) can be used through keras mixed-precision API for around 2x speedups over fp32 for both mixed precision training and inference. Read how to accelerate AI performance on 3rd Gen Intel® Xeon® Scalable processors with TensorFlow and Bfloat16.
The Future of Vanilla TensorFlow and oneDNN
Historically, Vanilla TensorFlow and oneDNN have not mixed well because oneDNN uses OpenMP, which does not coordinate with TensorFlow’s thread pool. This has forced developers to use work around in the past, such as the single-threaded Matmul that Koanantakool used to get a 29% speedup. However, the 2.5 release of TensorFlow oneDNN now has oneDNN Operations available in Vanilla TensorFlow. (Note: this is disabled by default but can be enabled by setting the environment variable TF_ENABLE_ONEDNN_OPTS=1.) These improvements allow for up to 3x in inference and 2.4x in machine training. Additionally, for the first time, users get access to the mixed precision computation, allowing them to run on CPU to GPU (previously this was only available on GPU to GPU).
Device Plugins for TensorFlow
TensorFlow Device Support has also improved with version 2.5. Before this latest version, device integration required changes to core TensorFlow. This meant complex build dependencies and compiler toolchains, slow development times, and the end product was fragile because the code touched so many parts of TensorFlow that seemingly unrelated code was still connected and could break it.
Version 2.5 looks to put an end to those troubles with an integration called PluggableDevice, which is a scalable device integration that does not require device-specific changes in the TensorFlow code. It relies on C APIs to communicate with the TensorFlow binary in a stable manner. PluggableDevice builds upon Modular TensorFlow, which tries to split TensorFlow into multiple separate self-contained repositories that talk to each other through the API. Now, the build dependencies, toolchains, and test process in TensorFlow are not affected. The integration is less fragile than before since the only changes that could affect the code are ones made to the C APIs or PluggableDevice components. This device supports plug-ins, and it is designed, developed, and driven by the TensorFlow community and largely by Intel.
OneDNN has put together some exciting work and there’s still more to come. Of course, there is more work being done to get support for AMX/Sapphire Rapids support and to build an Intel extension for TensorFlow that will be a pluggable device to support Intel XPU devices (CPUs, GPUs, and FPGAs). There is also work being done to integrate Tensor Possessing Primitives in MLIR, which are multi-level compilers used underneath TensorFlow. Stay tuned for more developments.