byteLAKE’s CFD Suite (AI-accelerated CFD) — HPC scalability report
My team at byteLAKE has analyzed the performance of byteLAKE’s CFD Suite (Computational Fluid Dynamics accelerated with Artificial Intelligence), an HPC application, across various configurations, incl. single node (server) with 2 Intel CPUs and 2 NVIDIA GPUs, and many nodes of a CPU-only cluster (HPC). The report below summarizes CFD Suite’s scalability with Intel technologies for AI models training and OpenVINO-optimized inferencing.
For those of you who are not familiar with what CFD Suite is, it is a collection of innovative AI Models (Artificial Intelligence) for Computational Fluid Dynamics (CFD) acceleration. The product has been developed by byteLAKE and you can learn more about it on the product website at https://www.bytelake.com/en/CFDSuite.
CFD Suite is a perfect example where AI (Artificial Intelligence) and HPC (High Performance Computing) converge into a complete solution, changing the simulations into predictions and delivering immediate or ultra-fast results. Each AI model we add to the collection has been carefully designed, trained and calibrated by a team of world-class researchers and programmers (PhD, DSc), industry leaders (i.e. Tridiagonal Solutions, OpenFOAM experts), and various byteLAKE partners like Lenovo or Intel. CFD Suite has also been optimized for various hardware configurations, ensuring maximum performance across desktop PCs, server nodes and HPC/multi-node/cluster architectures.
Hardware configurations we tested
byteLAKE has collaborated with Lenovo (Lenovo Infrastructure Solutions Group), Intel, and Tridiagonal Solutions to perform the performance analysis of the CFD Suite across the following configurations:
1. Single HPC node: 2* Intel Xeon Gold 6148 CPU @ 2.40GHz and 2 * NVIDIA V100 16GB, and 400GB RAM (Gold, V100 respectively)
2. Intel CPU-only HPC cluster, BEM supercomputer (860 TFLOPS); 22,000 cores; 724 nodes; 1,600 Intel Xeon CPU E5–2670 @ 2.30GHz — 12 cores processors; 74,6 TB RAM (BEM for cluster or E5–2670 for a single node)
3. Desktop platform: Intel Core i7–3770 CPU @ 3.40GHz — 4 cores (Core-i7 or i7) + NVIDIA GeForce GTX TITAN GPU (TITAN)
4. Single HPC node: Intel Xeon CPU E5–2695 @ 2.30GHz — 12 cores (E5–2695)
Abbreviations in brackets are used further in this document when referring to a particular platform.
CFD Simulation used for benchmarking
The investigated phenomenon is about chemical mixing. The considered CFD simulations belong to a group of steady-state simulations and utilize the MixIT tool from Tridiagonal Solutions, which is based on the open-source CFD toolbox.
AI Models Training (benchmark)
The test aimed to validate the scalability of the CFD Suite’s training module based on the Horovod distributed parallelization across various hardware configurations. We utilized 12 cores per BEM cluster node, 4 cores for i7, and 20 cores for Gold. We also verified the performance of shared memory and distributed memory parallelization.
The CFD models are generally memory-bound algorithms. Here we use 3D meshes. We observed that:
· There is no expected speedup across a single node (1 Gold, 20 cores vs. 2 Golds, 40 cores). Poor performance improvement across an OpenMP (shared memory model) parallelization when using more than 20 threads within a single node (speedup by a factor of ~1.11x). The reason is that the training is not compute-intensive enough.
· There is also no big difference between a single V100 and TITAN — since the training is not compute-intensive enough.
· The performance improvement is much better in the case of distributed training (1xV100 vs 2xV100, or BEM — up to 64 nodes).
· The cluster implementation (BEM) based on the Horovod framework allows us to overtake the performance of a single V100 using 8 nodes. 16 nodes are 1.3x faster than 2xV100. By comparing the Gold results and a single node of BEM (single E5–2670 CPU), we can assume that a cluster with 8xGold would allow us to achieve comparable results versus 2xV100.
· Benchmark shows time to results reduction by a factor of 48 for 64 nodes
· We can observe a stable efficiency (>90% up to 8 nodes, ~70% up to 64 nodes)
· Low impact of the distributed communication on the performance for all 3 models
CFD Suite predictions — AI Models Inferencing (benchmark)
The AI-accelerated simulation (prediction/inferencing) is composed of a set of steps. First, we need to execute i.e. 10% of the conventional CFD solver. Then the remaining 90% is predicted by CFD Suite.
To predict the results of a conventional CFD solver with AI (using byteLAKE’s CFD Suite), the following is performed by byteLAKE’s CFD Suite:
· data import from the conventional CFD solver (as indicated above, just a fraction of the initial results)
· data normalization
· inferencing with AI-models (that is where we leverage trained AI models)
· and data export to the conventional solver format (so that we output results ready for analysis by existing CAE tools).
The results we achieved are as follows (CFD Suite in ACCURATE mode):
- Inferencing is up to 9.5x faster on Gold (CFD Suite optimized with OpenVINO) than on V100.
- By configuring the CFD Suite into the ACCURATE mode, 90% of the simulation (pure AI inferencing without any overhead) is predicted 111x faster than CFD solver computation, and 9.25x faster considering 10% overhead of the CFD solver.
· CFD Suite accelerates time to results for conventional CFD solvers by a factor of at least 10x and keeps the accuracy at the level of at least 93%.
· New AI models are constantly added by byteLAKE to the CFD Suite which gradually increases the number of CFD simulations that can be handled by the CFD Suite off-the-shelf. To do so byteLAKE collaborates with a growing number of industry leaders.
· CFD Suite is an add-on to existing CAE/CFD tools and its integration is a straightforward process.
· CFD Suite is a data-driven solution. Therefore, past simulations done by conventional CFD solvers are required to train its AI models so that they can predict the results.
· CFD Suite is a scalable solution, and we observed a stable efficiency across cluster nodes.
· 8 nodes of the presented HPC cluster (BEM) can overtake a single V100 GPU for training. 16 nodes were 1.3x faster than 2xV100. By comparing the Gold results and a single node of BEM (single E5–2670 CPU), we can assume that a cluster with 8xGold would allow us to achieve comparable results versus 2xV100.
· The memory requirements increase linearly with increasing the number of cells (for CFD Suite training). Inferencing (prediction) can be done on typical desktop configurations.
· The inference process is executed up to 9.5x faster on the Intel Xeon Gold with CFD Suite optimized with OpenVINO compared with V100 GPU.
· By configuring the CFD Suite to use AI models in ACCURATE mode and using 2xXeon Gold CPUs, 90% of the simulation is predicted 111x faster than CFD solver computation, and 9.25x faster considering 10% overhead of the CFD solver.
Read the full report here: https://www.slideshare.net/byteLAKE/bytelakes-cfd-suite-aiaccelerated-cfd-hpc-scalability-report-april21
Note: The report was published in April 2021. Check our website for the latest information: https://www.byteLAKE.com/en/CFDSuite
Learn more about CFD Suite by visiting our website www.byteLAKE.com/en/CFDSuite. We constantly update our product with new features so follow our blog post series where we share the latest updates. Also, consider joining dedicated LinkedIn and Facebook groups to stay in touch with the CFD Suite community. You might as well be interested in listening to our panel discussion about how CFD Suite has been successfully used in the CFD/chemical mixing space.
Reach out to us to get access to the on-demand recording (CFDSuite@byteLAKE.com).
- All blog posts in the series: www.byteLAKE.com/en/AI4CFD-toc