DP, Alibaba Cloud, and Intel: A Winning Solution to Maximize Computation Performance

Story at a Glance

  • Alibaba Cloud, founded in 2009, is a global leader in cloud computing and artificial intelligence, providing services to thousands of enterprises, developers, and government organizations in more than 200 countries and regions.
  • Alibaba Cloud Elastic High-Performance Computing (E-HPC) is a cloud-native, full-stack, high-performance computing PaaS platform in China. It provides users with one-stop public cloud HPC services. It is fast, flexible, and secure, and supports interoperability with other Alibaba Cloud products.
  • DP Technology, Ltd., founded in 2018, tackles challenges in many industries at the microscopic level via innovative molecular simulation technologies.
  • DP Technology is a global pioneer in artificial intelligence and molecular simulation algorithms and is building a simulation platform that can dramatically increase the productivity of research labs, from pharmaceutical and materials industries to academia. Its cloud platform relies on Alibaba’s E-HPC products.
  • DP needed to run LAMMPS workloads, which are particularly challenging due to their inherent complex simulations and changing dynamics.
  • Using the combination of Alibaba’s E-HPC Cloud Service and Intel® hardware and oneAPI software, DP Technology achieved about 45.2% performance improvement.

“With oneAPI support DP successfully carried out molecular dynamics simulation of millions of atoms on E-HPC.”

– Linfeng Zhang co-founder & Chief Scientist, DP Technology

“E-HPC provides individual users, education and research institutions, and public institutions with a fast, elastic, and secure cloud compute platform. With the Intel oneAPI toolkit, E-HPC can help customers build a high-performance, profiling computing platform on Intel Xeon scalable processors.”

– Jinzhong Tao, HPC Solution Product Manager Alibaba Cloud

The Challenge:

DP Technology needed to run LAMMPS workloads. LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) is used for simulating materials such as nanoclusters, metals, polymers, and dynamic processes such as growth, cutting, and fusion.

LAMMPS workloads are particularly challenging due to their inherently complex simulations and changing dynamics. They often require individualized optimization to achieve best performance.

The Solution: Using oneAPI to Profile and Maximize Computation Performance

E-HPC Cloud Service provided a rich array of instances, enabling DP to choose the most suitable for its LAMMPS workloads and delivering more compute-resource flexibility compared with other cloud or on-prem instances. As a result, Ali E-HPC Cloud Service eliminated DP’s need to build an on-premise HPC center. Additionally, DP used Intel® oneAPI tools to recompile their code and took further steps through hotspot analysis to maximize performance.

In collaboration with Intel, E-HPC provides users an end-to-end HPC development environment (as shown in Figure 1) and a high-performance, cross-architecture E-HPC computation platform using Intel® Xeon® Processors. Complex, performance-dependent workloads like LAMMPS can benefit greatly from further optimization. Intel makes several tools available. In this instance, DP found three that directly helped improve their LAMMPS performance:

  1. Intel® oneAPI DPC++/C++ Compiler
  2. Intel® MPI library
  3. Intel® VTune™ Profiler (Intel’s performance analysis tool)
Using oneAPI on E-HPC

Using the E-HPC client, DP was able to manage jobs and files, use the integrated development environment (IDE) to program, and visualize calculation and analysis results. Additionally, it used Intel VTune Profiler to locate the most time-consuming parts of the code and identify the most significant issues affecting application performance. This enabled DP engineers to visualize the performance of their code and remove bottlenecks in just a few simple steps as shown in Figure 1.

  1. Open E-HPC client and choose session
  2. Launch VNC session
  3. Browse E-HPC workspace and open VTune analysis results
  4. Launch VTune in E-HPC workspace
Figure 1. A complete end-to-end HPC development environment provided by E-HPC
The Results: Significant Performance Improvements

Using the Intel oneAPI DPC++/C++ Compiler and Intel MPI library, the workload achieved 16.2% performance improvement. After fine-tuning the process and thread combinations, performance improved roughly 45.2% compared to open source GCC and MPICH as shown in Figure 2.

Below, we will walk through the work, step by step, to achieve the total performance improvement shown below.

Figure 2. Step-by-step performance improvement comparison (see configuration details below)

Using Intel oneAPI DPC++/C++ Compiler and Intel MPI library to accelerate the computation speed of LAMMPS, the computation time was reduced from 16 minutes and 22 seconds to 13 minutes and 43 seconds (Table 1) compared to MPICH, a roughly 16.2% performance improvement. The workload achieved that improvement in 2 steps:

  1. Replacing GCC with Intel oneAPI DPC++/C++ Compiler and using open source library MPICH, the execution time for LAMMPS was reduced from 16 minutes and 22 seconds to 14 minutes and 48 seconds, a roughly 9.6% performance improvement.
  2. Replacing MPICH with the Intel MPI library, while keeping the same Intel oneAPI DPC++/C++ Compiler, execution time was reduced from 14 minutes and 48 seconds to 13 minutes and 43 seconds, a roughly 7.3% additional performance improvement.
Table 1. LAMMPS computing with 32 processes, the different compiler based on the same MPICH library
Removing Bottlenecks

Intel VTune Profiler is an advanced performance-analysis tool used to optimize application performance, system performance, and system configuration for HPC, cloud, IoT, media, storage, and more on Intel® CPUs, GPUs, and FPGAs. It is part of the Intel® oneAPI Base Toolkit and supports important-analysis types such as hotspot analysis and microarchitecture exploration.

From the top hotspots shown in Figures 3 and 4, we observe PMPI_Wait is the most time-consuming hotspot which is utilizing around ~27seconds of CPU execution time.

Figure 3. Top hotspots

Hotspot execution time by function for 32 processes

Figure 4. Hotspot functions/call stack

After fine-tuning, we observe PMPI_Wait CPU execution time CPU is reduced to ~11 seconds (Figures 5 and 6), according to the result of Intel VTune Profiler.

Figure 5. Top Hotspots
Figure 6. Hotspots functions/Call stack

The execution time of the workload was further reduced by fine-tuning the combination of processes and threads. For the current workload, we increased the thread count to 2 and reduced processes to 16. After several rounds of fine-tuning, we obtained the comparison data shown below in Table 2. As a result, the execution time was reduced from 13 minutes and 43 seconds to 8 minutes and 58 seconds, about 34.6% performance improvement.

Table 2. LAMMPS computing with different processes and threads combinations

After all steps were complete, the total CPU execution time was reduced from a starting point of 16 minutes and 22 secs to 8 minutes and 58 seconds, achieving the total roughly 45.2% performance improvement from our green bar shown in Figure 2.

DP, Alibaba Cloud, and Intel: A Winning Solution

The Intel oneAPI HPC Toolkit (HPC Kit) efficiently and quickly helps users in the scientific community to analyze and optimize HPC industry application software, which speeds up the computing process and improves application performance. In the case of the molecular dynamics software LAMMPS, DP Technology using the E-HPC platform from Alibaba Cloud successfully achieved about 45.2% performance improvement with the help of the HPC Kit. E-HPC customers get a cloud-native, full-stack, high-performance-computing PaaS service, which is fast, flexible, and secure, and supports interoperability. In addition, using oneAPI provides an excellent path to achieving better performance gains on Intel platforms, in the cloud and beyond.


Configuration Details

Data source from DP – E-HPC Internal Evaluation.

Testing Date: Performance results are based on testing by Alibaba as of July 19, 2022. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.​

Configuration Details and Workload Setup: 3rd Gen Intel® Xeon® Scalable processors 8369B CPU @ 2.70GHz, 32v CPU(s), 64G memory, 40G ESSD. LAMMPS configuration file: LAMMPS default configuration file. Release: 23 Jun 2022, Iteration Count: 2M, Number of test processes and threads: 32P1T, 16P2T, 8P4T. Comparing compilers:  GCC-10.2, Intel(R) oneAPI DPC++/C++ Compiler 2022.0.0 (2022.0.0.20211123). Performance evaluation indicators: the execution time.​