CFD solvers are computationally expensive as they require higher grid resolutions to capture flow physics accurately and parallelization is necessary to reduce simulation time. In this work, the performance of a Lattice Boltzmann flow solver was optimized on an Intel® Xeon® CPU resulting in a 6X performance gain.

Analysis tools like Intel® Advisor and Intel® VTune™ Profiler were used to guide the optimizations. Further, the code was ported to Intel® oneAPI using OpenMP based offload to enable execution on Intel Xe GPUs. The same analysis tools were used again for guiding GPU porting and optimizations.

Significant performance gains were achieved by optimizing data allocations and data movements between the host and the device. Multiple schemes with varying degrees of offload were also tested. The performance on Xe GPUs was found to be better than Xeon platforms for medium to large size grids. The effect of further data transfer optimizations, GPU tile-scaling and process scaling will also be presented.