Intel recently showcased the latest developments of the oneAPI specification, and how its implementations are being leveraged in performant, cross-architecture programming for heterogeneous computing.
Held on December 6th and 7th, The oneAPI DevSummit for AI and HPC was a free virtual community conference that hosted researchers, data scientists and developers from around the world. Industry experts from Argonne National Laboratory, UC Berkeley, San Jose State University, and Codeplay took a deep dive into cross-architecture software development ranging from Intel® AI Analytics and HPC Toolkits, to advancements in the oneAPI specification, to key performance analysis tools overviews, plus hands-on workshops with TensorFlow and PyTorch.
Attendees found opportunities to:
- Connect with industry experts working on cross-architecture software development for AI and HPC workloads.
- Learn about AI analytics using frameworks and tools performance-optimized by oneAPI.
- Discover new oneAPI innovations, including the Intel® AI Analytics and HPC Toolkits.
- Listen to tech talks concerning key performance analysis tools and how to use them.
- Participate in hands-on workshops with TensorFlow and PyTorch.
- Attend panel discussions hosted by renowned experts in their respective fields.
DAY ONE FEATURED SESSIONS
In this hands-on tutorial, Intel AI Software Solutions Engineer, Pramod Pai, showcased optimizations for Intel XPUs in PyTorch Upstream and Intel® Extension for PyTorch. The presenter demoed the best-known methods to fully utilize these optimizations for best performance in their deployments with Intel products.
After guiding session participants through an overview of Intel optimizations for PyTorch, Pai outlined three major aspects of the optimizations that are executed in the extension for PyTorch, which include operators, graph and runtime optimizations.
- Operator optimization: Involves vectorization and parallelization to maximize the efficiency of CPU capability and usage. Intel engineers optimized for memory related operations as well as for low precision data types.
- Graph optimization: PyTorch by default runs in Eager Mode to provide ease-of-use and flexibility. However, by switching to TorchScript Mode, the whole topology can be converted into a graph, whereby fusions can be applied to the graph to improve performance.
- Runtime extension: Here, overhead can be avoided by utilizing thread affinity and tweaking memory allocation methodologies, enabling maximization of the inference throughput.
The session continued with Pai demonstrating specifically how Intel has been contributing source code to PyTorch Upstream, which began in 2018. Intel engineers first aimed their optimization efforts at Skylake servers with the float32 data type; they used oneDNN as the computation backend to accelerate operations; they used the VNNI instruction set on the Cascade Lake Xeon Scalable Processors, and bfloat16 on Cooper Lake processors, and now AMX and bfloat16 on Sapphire Rapids processors.
A hands-on lab concluded the presentation, where participants generated synthetic data to be used for inference with sample computer vision and NLP workloads; they used stock PyTorch models to generate predictions, and then–with minimal code changes–used the Intel Extension for PyTorch (IPEX) to achieve acceleration gains over stock PyTorch on Intel hardware. Participants were also shown how quantization features from IPEX can be used to reduce the inference time of a model by using practical examples of specific optimizations that were illustrated during the presentation portion of the session preceding the hands-on lab.
In this hands-on training, Intel Deep Learning Software Engineer, Sachin Muradi, demonstrated the ease of use and performance benefits of default Intel optimizations when used in Transfer Learning with TensorFlow. Muradi provided participants the opportunity to use the platform on a common transfer learning use case.
Using a cloud instance set up powered by the latest Intel Xeon Scalable Processor, and the latest official TensorFlow release (2.10), participants observed a common transfer learning use case that included an image classification model from the TensorFlow hub repository. Participants then ran inference on the re-trained model, optimized the model for best latency, and finally deployed it using TensorFlow serving on a 3rd Gen Intel® Xeon® Scalable Processors with AI acceleration.
Key takeaways participants learned from this hands-on session include:
- oneDNN delivers faster training and inference.
- In using the bfloat16 data type, changing just one line of code can yield significant performance enhancements.
- The upcoming 4th Gen Intel Xeon processor will deliver even more performance gains.
Abhishek Nandy led this fascinating Tech Talk on implementation of the oneAPI AI toolkit in Medical Imaging. Nandy has a Bachelor of Technology degree and is an Intel Black Belt Developer. He is also co-founder of Dynopii, an organization that helps companies create personalized customer experiences with best-in-class AI-powered digital and voice automation tools.
Nandy focused on how the oneAPI AI Analytics Toolkit helps developers write and develop performant code quickly and accurately for the purpose of special single cell analysis in medical imaging use cases. Single cell analysis enables scientists and medical professionals to study cell-to-cell variation within a cell population (e.g. organ, tissue, and cell culture). During the Tech Talk, Abhishek showed participants how to port Squidpy–a Python framework that provides efficient infrastructure and numerous analysis methods for efficiently storing, manipulating and interactively visualizing single cell spatial molecular data–into the oneAPI AI Toolkit to facilitate the exploration of human tissue images.
The integration of Squidpy and the oneAPI AI Analytics Toolkit enables medical researchers to find abnormalities in gene expression and identify disease progression by, among other capabilities, creating spatial graphs from huge data sets to perform image analysis. Researchers can then extract an image and its summary features to generate cluster annotation that helps them compute neighborhood enrichment, which helps to identify spots (nodes) clusters that share a common neighborhood structure across the tissue.
This process helps researchers accomplish two important tasks:
- Transcriptional Profiling: a process used to identify a rare disease condition as well as how the body responds to a specific treatment for the presenting disease.
- RNA Sequencing: a sequencing technique which uses any of several high-throughput approaches to DNA sequencing using the concept of massively parallel processing to reveal the presence and quantity of RNA in a biological sample at a given moment, allowing researchers to analyze the continuously changing cellular transcriptome.
DAY TWO KEYNOTE
On day two of the DevSummit, Andrew Richards, CEO and co-founder of Codeplay–a software development company recently acquired by Intel–led this much anticipated keynote, where he discussed some of the most successful approaches to achieving productivity, performance and portability, as well as the trade-offs required to deliver this developer wish list.
Richards rooted the discussion in what developers want most nowadays: to write code once and run it fast, on lots of different processors–without having to optimize it differently for every different platform. While this is not completely possible yet, there are “practical things we’re doing to take us as close as possible as we can get to writing once and getting high performance on a very wide variety of processors,” Richards said.
Richards discussed two main approaches to solving the problem of writing software for rapidly evolving processors. The first includes C++, which is already widely used. Richards refers to the programming language as the proven practical solution that allows developers to write high performance software on lots of different hardware. He illustrates how developers have used C++ to enable performance portability in video game development, writing code that works on different platforms, from the latest PCs to various different game consoles. Interestingly, Richards makes the point that developers are using the same performance portability techniques that are used in fields like gaming in the AI and HPC fields.
According to Richards, C++ has three key concepts that enable it to support development of very large, very high performance software:
- Zero-cost abstractions: This allows you to write abstract code at compile time, but abstractions don’t slow down your runtime because the compiler removes them.
- Separation of concerns: When developers design software, it can be done in a way in which different experts can work on different parts of the software together (e.g. developers who are good at making the software easy to use, developers who can make the software run very fast, and others who are domain experts in the medical systems or the molecular dynamics, for example). Following this concept of separation of concerns results in more freedom for simplification and maintenance of code, as well as more freedom for module upgrade, reuse, and independent development.
- Composability: This last concept available in C++ allows different developers to combine different features of the software into one. For example, if one developer writes an abstraction that makes something go very fast on a processor, and another developer writes an abstraction that makes it easier to write molecular dynamics, they can comprise the two abstractions together.
The second approach to solving the problem of writing software for rapidly evolving processors is SYCL, which is a royalty-free, vendor-neutral industry standard C++ for parallel software and accelerator processors. Intel’s oneAPI programming language is a combination of SYCL and a set of SYCL extensions. SYCL takes proven C++ performance ideas and super-charges them for a heterogeneous processing world.
SYCL enables developers to:
- Build their own C++ SYCL compilers for a variety of new processors.
- Design their own optimizations.
- Build C++ libraries that can adapt to the performance requirements of lots of different systems.
- Integrate native compilation for different processors in one source file.
According to Richards, SYCL is “not actually that revolutionary.” SYCL essentially takes many of the techniques that developers were using in a kind of ad hoc way and standardizes them; it standardizes some of the interfaces between the compiler and the hardware, and frameworks like Kokkos and Eigen, which is why it enables developers to execute the above bulleted tasks with relative ease. Developers can now build their own C++ / SYCL compilers for any new processor, and design their own optimizations as well as build C++ libraries that can adapt to the performance requirements of various hardware, and they can integrate native compilation for different processors in one source hub.
As the title of this keynote suggests, SYCL is very much reality, not fantasy. As Richards concludes, “You are programming in C++. And actually in SYCL, if you program the CPU code, it usually goes straight through your normal CPU C++ compiler. And if you’re compiling on a GPU or FPGA or whatever, it’s a native compilation. It’s like programming in C++ on that hardware. You’re going straight down to the hardware and using any of the hardware features; it’s just the same as C++. SYCL standardized the names of things, and we’ve found a way of gluing the two compilers together.”
Highlights of Notable Presentations
Day one and day two of the DevSummit were replete with presentations, tech talks and panel discussions that illustrated the power of oneAPI and its contribution to a unified build environment, along with other integrated tools, for cross-platform productivity.
Peter Ma, co-founder of SiteMana, an AI company that predicts anonymous visitor purchasing intent, led a tech talk where he outlined how to leverage oneAPI to train and infer website anonymous visitor purchasing behaviors at scale, which enables business owners to know which anonymous visitors are likely to spend money on their website.
In another tech talk, UC Berkeley post-doctoral student, Zhen Dong, and UC Berkeley professor, Kurt Keutzer, spoke about their progress in applying efficient deep learning to both inference and training in an effort to reduce memory consumption and the computational cost of sophisticated deep neural network models. Using LTP, which uses pruning to accelerate inference, and staged training for transformers, and TASC, which are designed to accelerate training, the speakers outlined solutions to efficiently implement large recommendation models, where they systematically applied quantization in DQRM, and leveraged the sparsity in DLRM to better support hot embeddings, which achieves great performance and generalization ability.
At an eagerly anticipated session, several prominent Intel engineers shared various use cases with participants in a live code tutorial and demonstration, representing the latest and most interesting aspects of oneAPI programming. They outlined key aspects of the developer workflow using oneAPI tools to program for the Intel Data Center GPU Max Series, formerly codenamed Ponte Vecchio (PVC), all using cross vendor, cross architecture standards based techniques.
Although there were too many fascinating keynotes, tech talks, and hands-on tutorials to summarize in this article, those interested can watch the entire virtual conference on-demand on oneAPI.io, to get deep insights on AI and HPC that will catapult you to the next level in your developer journey.