C-DAC Achieves 1.75x Performance Improvement on Seismic Code Migration from CUDA on Nvidia A100 to SYCL on Intel® Data Center GPU Max Series

Summary

The India-based premier R&D organization used tools in the Intel® oneAPI Base Toolkit to free itself from vendor hardware lock-in by migrating its open-source seismic modeling application from CUDA to SYCL. As a result, application performance improved by 1.75x on Intel® Datacenter GPU Max 1550 when compared to NVIDIA A100* platform performance.

Introduction

C-DAC (Center for Development of Advanced Computing) is the premier R&D organization of India’s Ministry of Electronics and Information Technology for R&D in IT, Electronics, and associated areas. Created in 1987, its research spans multiple industries and domains such as HPC, cloud computing, embedded systems, cyber security, bioinformatics, geomatics, and quantum computing.
In the realm of geophysical exploration, C-DAC has developed an open-source seismic modeling application, SeisAcoMod2D.—it performs acoustic wave propagation of multiple source locations for the 2D subsurface earth model using finite difference time domain modeling.

Challenge: Vendor Hardware Lock-In

Fueled by high computational throughput and energy efficiency, there has been a quick adoption of GPUs as computing engines for High Performance Computing (HPC) applications in recent years. The growing representation of heterogeneous architectures combining general-purpose multicore platforms and accelerators has led to the redesign of codes for parallel applications.

SeisAcomod2D is developed in C to efficiently use the multicore CPU architecture and in CUDA* C to make use of NVIDIA GPU architecture. The existing CUDA C program cannot run Intel® GPUS (nor any other vendor GPUs, for that matter). This limits architecture choice, creates vendor lock-in, and forces developers to maintain separate code bases for CPU and GPU architectures.

Solution: Build a Single Code Base using Intel® oneAPI Tools

Intel® oneAPI tools enable single-language and cross-architecture platform applications to be ported to (and optimized for) multiple single and heterogeneous architecture-based platforms. Using a combination of optimized tools and libraries, the application’s native CUDA code was migrated to SYCL*, enabling it to run seamlessly on Intel® CPUs and GPUs.
The result: SeisAcoMod2D is now comprised of a single code base that can be used to run it on multiple architectures without losing performance. This was a perfect package for C-DAC: a single language with a boost in performance without being vendor-locked.

Let’s walk-through the steps, tools, and results.

Code Migration to CUDA

SeisAcoMod2D has both CPU (C++) source and GPU (CUDA) source. As a first step, the Intel® DPC++ Compatibility Tool (available as part of the Intel® oneAPI Base Toolkit) was used to migrate CUDA source to SYCL source. In this case, the Compatibility Tool was able to achieve 100% migration in a very short time. This made functional porting of the seismic modeling complete. Figure 1 and Figure 2 provide the snippet of CUDA source code to migrated SYCL source code.

Figure 1: Snippet of CUDA source before migration
Figure 2: Snippet of SYCL source after migration

Due to the presence of multiple CUDA streams with async calls, the migrated code needed placement of appropriate barrier/wait calls or single SYCL queue to maintain the data consistency. Incorporating these solutions resolved the correctness issue, the changes shown in Figure 3 and Figure 4.

Figure 3: Placement of wait call

DPC++ Compatibility Tool migration from CUDA streams to SYCL queues:

User modification of multiple SYCL queues to single SYCL queue:

Figure 4: Single SYCL queue creation

Our open-source Seismic Modeling application – ‘SeisAcoMod2D’ CUDA code was migrated to SYCL using SYCLomatic easily. The migrated code efficiently runs on Intel® Data Center GPU Max Series and achieves competitive performance compared to currently available GPU solutions. As we look to the future, the combination of Intel® Xeon Max CPUs with High Bandwidth Memory plus Intel Data Center GPU Max Series presents us with a seamless upgrade path, accelerating our applications without the need for code changes thanks to using oneAPI Toolkits.

– C-DAC, India

Code Optimization

As a next step, Intel® VTune™ Profiler was used to profile the kernels running on GPU to identify the bottlenecks and tuning opportunities. VTune Profiler supports GPU offload and GPU compute media hotspots analysis which help analyze the most time consuming GPU kernels and identify if the application is CPU- or GPU-bound.

Figure 5 shows the GPU offload analysis result from the SYCL binary of SeisAcoMod2D with kernels running on an Intel GPU. As highlighted, the memset call doing the memory transfer from host to device is taking more time; the memset uses copy engine in a GPU. This was replaced with a fill-function call, which serves the same purpose but uses a compute engine to make it fast. The same can be seen in Figure 6 where the time taken for memory transfer was drastically reduced.

Figure 5: GPU offload analysis result of SYCL binary having memset function call.
Figure 6: GPU offload analysis result of SYCL binary having fill function call.

Performance Results

oneAPI along with Intel® GPU (Intel® Data Center GPU Max series) helped to speed up C-DAC’s seismic modeling application run time on the GPU by 7x compared to CPU baseline and, by ~1.75x compared to NVIDIA A100* platforms.
The CPU thread reads and moves data related to seismic source locations to GPU memory and calls the GPU kernels for computation. The GPU then computes the wavefield propagation forward in time. After finalization of the time-iteration loop, the CPU thread copies the computed synthetic seismogram from GPU memory to CPU memory and writes it to the file system.

Figure 7: Seismic workload execution time on Intel® Xeon Platinum 8360Y CPU and Intel® data center GPU max series.

Workload:  seismic workload from C-DAC
Hardware configuration:

  • Intel® Data Center GPU Max Series, with 2 stacks suitable for HPC and AI workloads. This GPU contains a total of 8 slices, 128 Xe-cores, 1024 Xe Vector engines, 128 ray tracing units, 8 hardware contexts, 8 HBM2e controllers and 16 Xe-links.
  • Intel® Xeon® Platinum 8360Y CPU @ 2.40GHz having 72 physical cores,256 GB of DDR4 memory @ 3200 MT/s.
  • Intel® Xeon® CPU Max 9480 @1.90GHz having 112 physical cores, 64 GB HBM memory, 256 GB of DDR5 memory @ 4800 MT/s.
  • NVIDIA A100 having 80GB HBM2e, base clock:1065MHz connected to Intel® Xeon® Platinum 8360Y CPU.

Software configuration:

Operating system: RHEL 8

Compilers:  Intel icx 2023.0, nvcc 11.7

Language and API: C, SYCL, CUDA C, OPENMP

Testing date: June 20,2023

Figure 8: Seismic workload execution time on Nvidia A100 and Intel® data center max GPU series.

Testing date: June 20,2023

Hardware configuration:

Intel® Data Center GPU Max (Code Name: Ponte Vecchio), with 2 stacks suitable for HPC and AI workloads. This GPU contains a total of 8 slices, 128 Xe-cores, 1024 Xe Vector engines, 128 ray tracing units, 8 hardware contexts, 8 HBM2e controllers and 16 Xe-links.

Intel® Xeon® Platinum 8360Y CPU @ 2.40GHz having 72 physical cores,256 GB of DDR4 memory @ 3200 MT/s.
Intel® Xeon® CPU Max 9480 @1.90GHz having 112 physical cores, 64 GB HBM memory, 256 GB of DDR5 memory @ 4800 MT/s.
NVIDIA A100 having 80GB HBM2e, base clock:1065MHz connected to Intel® Xeon® Platinum 8360Y CPU.

Software configuration:

Operating system: RHEL 8

Compilers:  Intel icx 2023.0, nvcc 11.7

Language and API: C, SYCL, CUDA C, OPENMP

Conclusion

Using Intel oneAPI tools made it easier to migrate CUDA source code to SYCL, which helped C-DAC overcome vendor-lock-in for its optimized SeisAcoMod2D seismic modeling application and maintain a single code base for different architectures. Intel VTune profiler helped in identifying the bottlenecks in SYCL binary running on Intel GPU. Fixing this bottleneck yielded a performance improvement of 7x from baseline CPU and 1.75x from native CUDA on Nvidia A100 compared to SYCL code on Intel GPU.

Explore More

Download the tools

Get the code samples [GitHub]

×


Learn about joining the UXL Foundation:

Join now