byteLAKE’s CFD Suite (AI-accelerated CFD) — recommended hardware for AI training at the Edge (part 2/3)

7 min readJun 23, 2022

In part 1 of this miniseries I explained how we picked Lenovo ThinkEdge SE450 Edge Server (Product Guide, Press Release) powered by 2 NVIDIA A100 80GB Tensor Core GPUs (Learn More) as byteLAKE’s recommended hardware configuration to perform CFD Suite’s AI Training at the Edge.

byteLAKE’s CFD Suite (AI-accelerated CFD) — recommended hardware for AI training at the Edge

So let’s go straight into our findings and benchmark results. In July/August 2022 I will update this blog post by embedding a complete report and will make it available for download through byteLAKE’s website as well.

Let me start by explaining how we set up the environment for benchmarking purposes.

Benchmark — environment

We examined the performance and memory requirements of byteLAKE’s CFD Suite that uses AI to accelerate CFD Simulations (hereinafter also referred to as a “framework”). Memory requirements analysis can be found in the full report. Here I will focus mostly on performance analysis. CFD Suite is based on a data-driven model, where in the first step we need to train the model using CFD simulations executed with a traditional CFD solver (historic simulations build-up in a form of a training dataset). Then CFD Suite can provide the prediction that allows us to significantly reduce the simulation’s execution time. In this benchmark, we focus on the AI training part that requires highly parallel hardware to create an accurate model. To train the model, we used a real-life scenario, where our framework takes 10 initial iterations generated by the CFD solver as an input and returns the final (steady-state) iteration. Our dataset includes 50 such CFD simulations. From each simulation, the framework generates 20 different packages of inputs and a single iteration as an output. As a result, we use 1000 packages containing 10 input and 1 output iteration.

All the simulations are generated with the 3-dimensional rhoSimpleFoam solver. The rhoSimpleFoam is a steady-state solver for compressible, turbulent flow, using the SIMPLE (Semi-Implicit Method for Pressure Linked Equations) algorithm. That means that a pressure equation is solved and the density is related to the pressure via an equation of state. We generate 3 different mesh configurations:

the mesh of size 32 768 cells;
the mesh of size 262 144 cells;
the mesh of size 884 736 cells.

The training of our model generates a set of sub-models for each quantity used by the solver. We use here 2 types of quantities: scalar quantities (pressure, temperature, …), and the vector quantity (velocity). Our model trains them sequentially one by one, so for the purpose of this benchmark, we focus on analyzing a single scalar and vector quantity.

Hardware and software environment

This benchmark has been executed on the Lenovo SE450 node. The node is equipped with a single Intel Xeon Gold CPU and 2xNVIDIA A100 Tensor Core GPUs. Moreover, the performance results are compared with a single node equipped with 2xNVIDIA V100 Tensor Core GPUs. The server node includes 128GB of the host memory. All the platforms that we have benchmarked will be further referred to as:

A100 — NVIDIA A100 Tensor Core GPU with 80GB of GPU global memory;
V100 — NVIDIA V100 Tensor Core GPU with 16GB of GPU global memory;
CPU or Gold — Intel Xeon Gold 6330N CPU clocked 2.20GHz with 28 physical (56 logical) cores.

The software environment of the SE450 node includes:

Ubuntu 20.04.4 LTS (GNU/Linux 5.13.0–51-generic x86_64) OS;
NVIDIA Driver version: 510.73.05;
CUDA Version: 11.6;
cuDNN version: 8.4.0;
TensorFlow version: 2.9.0;
Keras: the Python deep learning API version: 2.9.1;
Dataset: float32 data type (single-precision arithmetic).

Results

As we progressed with the product development and related research efforts, we significantly changed the underlying AI architectures to better address our clients’ and partners’ needs. This effort led to improved accuracy of AI predictions but also better performance and increased automation. Previously, CFD Suite’s users had to manually configure the number of iterations that the traditional CFD solver had to perform before AI could take over and generate the prediction. Now AI has taken over that task as well and the users no longer need to worry about how to properly calibrate CFD Suite for optimal results. That allowed us to implement a mechanism that could find the best tradeoff between performance and accurate predictions, ultimately improving the overall quality of predictions. Previously, CFD Suite’s AI Training phase’s performance could only be improved by adding more nodes within a multi-node HPC architecture. Thanks to our latest upgrade, it can now benefit from many NVIDIA GPU cards within a single node. A much-awaited feature for those who prefer such setups.

„We have significantly changed the architecture of the underlying AI within the CFD Suite. With the mechanisms like dynamic generating of the learning samples, we are now able to fully utilize multiple GPU cards within one node and provide better accuracy. Unlike in the previous versions, where CFD Suite’s AI training performance could only be increased by adding more nodes. Now we can greatly benefit from having more accelerators within a single node.”, said DSc, PhD, CTO at byteLAKE.

Therefore the first test was to compare the performance of AI Training when using 1 vs 2 A100 80GB GPU cards. Here are the results for a vector quantity and details about the scalars can be found in the full report.

1 vs. 2 A100 GPUs within a single edge server (SE450)

Performance comparison: one vs. two A100 GPUs (vector quantity training)

Conclusions for this part:

The speedup between 1xA100 and 2xA100 is stable for all 3 meshes and varies from 1.76 to 1.87.
It gives the efficiency up to 0.95 which confirms good scalability of the framework regardless of the type of quantity (scalar, vector).
The best performance is achieved using a batch of sizes 32, and 16 for meshes of sizes 32768, and 262144, respectively.
The tested scenario requires more than 128GB of the host memory to train the network for the mesh of size 884736 with the vector quantity.
Comparing this benchmark with our previous one (available here: https://marcrojek.medium.com/bytelakes-cfd-suite-ai-accelerated-cfd-hpc-scalability-report-25f9786e6123) or by downloading the full report from byteLAKE’s website www.byteLAKE.com/en/CFDSuite here: https://www.bytelake.com/en/download/4013/), we observe that in the current version of the byteLAKE’s CFD Suite the scalability within a node is much more profitable. This has resulted from the fact, that the current AI model is much more compute-intensive — includes more layers. In this version of our framework, we provided a mechanism that dynamically generates a set of training samples from a single input simulation, which also improves the dataset size and reduces the memory transfer.

Further tests where we compared the performance and memory requirements between vector and scalar quantities proved that both, scalar and vector quantities, are efficiently distributed across 2 x A100 GPUs with an efficiency of up to 0.95.

Platforms comparison: A100 vs. V100 vs. CPU

The performance comparison between A100, V100, and CPU gold for a mesh of size 32768 is listed below (the lower the better):

Performance comparison. Mesh size: 32 768. Training.

The performance comparison between A100, V100, and CPU gold for a mesh of size 262144 is listed below (the lower the better):

Performance comparison. Mesh size: 262 144. Training.

The performance comparison between A100, V100, and CPU gold for a mesh of size 884736 is listed below (the lower the better):

Performance comparison. Mesh size: 884 736. Training.

Conclusions for this part:

Different platforms used individually optimized configurations for each mesh size to measure the performance of byteLAKE’s CFD Suite.
For some batch sizes the V100 GPU outperforms the A100 (1.5x speedup for a batch of size 1), which is resulted from the fact, that A100 needs more parallel tasks to execute.
Overall, with the optimized configs, the A100 GPU is ~1.8x faster than V100 GPU.
With the optimized configs, the A100 GPU gives a speedup from 5.9x to 10.9x compared to the CPU gold, while the V100 is from 3.3x to 6x faster. This is as expected and aligned with our previous benchmarks where we concluded that GPUs were preferred for AI Training workloads.
Using an entire SE450 node, 2 x A100 gives from 10.6x to 19.3x speedup over CPU gold, while the V100 is from 6x to 10.7x faster than CPU gold.
The higher mesh the higher speedup is achieved using GPUs over the CPU platform.

Key Takeaways

Lenovo ThinkEdge SE450 Edge Server (Product Guide, Press Release) powered by 2 NVIDIA A100 80GB Tensor Core GPUs (Learn More) is byteLAKE’s recommended hardware configuration to perform CFD Suite’s AI Training at the Edge.
A100 GPU turned out to be ~1.8x faster than V100 GPU in the scenarios benchmarked by byteLAKE and described in this report.
CFD Suite’s AI Training’s performance improves if we add more NVIDIA GPUs per node. The speedup between 1xA100 and 2xA100 was stable for all benchmarked meshes and varied from 1.76 to 1.87.
Efficiency of the AI Training was 0.95 which confirmed the good scalability of CFD Suite.
SE450 node, powered by 2 x A100 gave from 10.6x to 19.3x speedup over CPU gold, while the V100 was from 6x to 10.7x faster than CPU gold. The higher the mesh size, the higher speedup was achieved using GPUs over the CPU platform. Again, the results are based on scenarios described in this report.
In theory, we are able to train the model for a mesh of size 21 952 000 using a single A100 80GB GPU. This is based on byteLAKE’s CFD Suite’s current architecture and as a research in that space is ongoing, this will change in the future.

Continue: part 3.

Download the full report: https://www.bytelake.com/en/download/4018/.