The TAU Performance System® [ ] supports profiling and tracing of programs written using the Intel OneAPI. Intel OneAPI provides two interfaces for programming – OpenCL and DPC++/SYCL for CPUs, GPUs, and other devices. With TAU, a user can observe the performance of the program both at the CPU and the GPU level. At the GPU level, TAU support the OpenCL profiling interface as well as the Intel Level Zero API. With these two interfaces, it is possible to track the precise timings of kernels executing on the GPU. TAU has been tested on Intel Gen12 GPUs now available in Intel TigerLake CPUs and DG1 cards using the Intel BaseKit and HPCToolkit.
This talk will present the advances in TAU to support OneAPI and how a user may use TAU to instrument the code without any modifications to the application binary. Instrumenting and measuring an application’s performance is the first step towards optimizing it. Typically, the process of instrumenting the application with source code and build system modifications is viewed as cumbersome, but it does not have to be. TAU operates on an un-modified binary to generate detailed summary statistics (profiles) and event-traces. Observing application performance at the statement, loop, and function level is now possible on a per-MPI rank, thread, and even GPU kernel level using TAU. TAU is a versatile profiling and tracing toolkit that is widely ported to HPC platforms. TAU is included in the DOE’s Extreme-scale Scientific Software Stack (E4S) [https://e4s.io], a curated Spack based release of HPC and AI/ML packages for containerized deployment and bare-metal builds.