Accelerating CFD with FPGA

Let me start this post by saying that it is going to be looong… On a flip side, however, I will summarize my teams’ current research & development efforts related to CFD simulations acceleration.

Some background though…

Team at byteLAKE has created a set of highly optimized CFD kernels that leverage the speed and energy efficiency of Xilinx Alveo FPGA accelerator cards to create a high-performance platform for complex engineering analysis.

Kernels can be directly adapted to the geophysical models such as EULAG (Eulerian/semi-Lagrangian) fluid solver, designed to simulate the all-scale geophysical flows.

The algorithms have been extended by additional quantities as forces (implosion, explosion) and density vectors. In addition, they allow users to fully configure the border conditions (periodic, open).

What is CFD then?

CFD, Computational Fluid Dynamics tools combine numerical analysis and algorithms to solve fluid flows problems. A range of industries such as automotive, chemical, aerospace, biomedical, power and energy, and construction rely on fast CFD analysis turnaround time. It is a key part of their design workflow to understand and design how liquids and gases flow and interact with surfaces.

Typical applications include weather simulations, aerodynamic characteristics modelling and optimization, and petroleum mass flow rate assessment.

Why acceleration matters?

The ever-increasing demand for accuracy and capabilities of the CFD workloads produces an exponential growth of the required computational resources. Moving to heterogeneous HPC (High Performance Computing) configurations powered by Xilinx Alveo helps significantly improve performance within radically reduced energy budgets. Eventually you get the results faster and within radically reduced energy budgets. And both of these factors help you drive the TCO down.

Kernels we adapted and optimized

First-order step of the non-linear iterative upwind advection MPDATA (Multidimensional Positive Definite Advection Transport Algorithm) schemes.

Computation of the pseudo velocity for the second pass of upwind algorithm in MPDATA.

Divergence part of the matrix-free linear operator formulation in the iterative Krylov scheme.

Tridiagonal Thomas algorithm for vertical matrix inversion inside preconditioner for the iterative solver. Preconditioner operates on the diagonal part of the full linear problem. Effective preconditioning lies at the heart of multiscale flow simulation, including a broad range of geoscientific applications.

Quickly about the results so far…

Optimizing CFD codes for Alveo

The goal of the work was to adapt 4 CFD kernels to ALVEO U250 FPGA. All the kernels use 3-dimensional compute domain consisting of 7 (Thomas) to 11 (pseudovelocity) arrays. Also, the computations are performed with a stencil fashion (to compute a single element of a compute domain it is required to access the neighboring elements). Since all the kernels belong to a group of memory bound algorithms, our main challenge was to provide the highest utilization of the global memory bandwidth.

The ALVEO U250 FPGA consists of 4 global memory banks, where each of them is connected to a single super logic region (SLR). To address this design feature the compute domain was divided into 4 sub-domains, where each of them was assigned to a separate memory bank. Each kernel was distributed across 4 compute units assigned to a different SLR. In this way, the memory transfers between the global memory and compute units occurred only between connected pairs of SLR and memory bank.

To update the data between the memory banks it is required to exchange halo areas (borders of sub-domain) between neighboring sub-domains. For this purpose, we utilized a new memory object called pipe. A pipe stores data organized as a FIFO. Pipes can be used to stream data from one kernel to another inside the FPGA without having to use the external memory, which greatly improves the overall system latency.

To minimize global memory traffic, we utilized fast BRAM memory. The characteristic of stencil computation requires to access a single array many times. Since there is not enough memory space to store 3D blocks of compute domains, we utilized the 2.5D blocking technique to provide data locality. For this purpose, we stored only a small set of planes for each domain, which was stored as a queue of planes. After each iteration, only a single plane was downloaded from the global memory, while others migrated across the queue. In this way the global memory traffic was minimized.

Another key optimization was to organize the computation in a SIMD fashion by utilizing vector data types of size 16. It allowed us to utilize a 512-bit AXI4 memory interface for global memory access.

Lastly, here goes a summary of how each optimization let us speed up the execution of the advection kernel, ultimately cutting the time of execution.

Below, example results for 500 time steps, and acceleration from 90 minutes to under 10 seconds.

It is worth mentioning that the above techniques translated into almost 600x speedup!

To compare the results, we also highly optimized the code for CPU-only architectures. So let’s quickly jump into some details there as well…

CPU optimization for reference

Our initial CPU implementation utilized 2 CPU processors: Intel® Xeon® CPU E5–2695 v2 2.40–3.2 GHz (2x12 cores). Then we compared the results with several other configurations, including: 1 * Intel Xeon E5–2695 CPU 2.4GHz — IvyBridge (Ivy), Intel Xeon Gold 6148 CPU 2.4GHZ — SkyLake (Gold) and Intel Xeon Platinum 8168 CPU 2.7GHZ — SkyLake (Platinum).

To optimize the code, we implemented several techniques like:

Depending on a kernel, the above techniques translated into an almost 92x speedup!

Below, example results for only 50 time steps.

Benchmark results

For the configuration with 2 CPUs (Ivy) we reached the maximum throughput of: 3.7 GB/s. Corresponding power dissipation was: 142 Watts.

The configuration with FPGA resulted in almost 6 GB/s throughput and the power dissipation of slightly above 100 W.

Also, still speaking of FPGA, the results for the advection kernel were as follows:

It is very important to note that

we reached 98.32% of the maximum attainable throughput.

And in that case, we optimized the performance to the level that the time of execution was completely “hidden” behind the time of the data transfer with a maximum possible throughput. Also, we can say that we reached the best possible optimization for the given CFD kernel.

And here are the results for various configurations:

As we can see, a single Alveo U250 card was able to outperform even Intel Xeon Platinum 8168 CPU, delivering results slightly faster and at a significantly lower energy budget.

It is important to emphasize, that the presented results are for single kernels. Typical CFD applications consists of many kernels which execute various operations on the same data. Therefore, we should expect even better results due to further possibility to optimize the computations and reuse data across all the kernels.

Conclusions

Moving fluid simulations to heterogeneous computing architectures powered by Alveo FPGA delivers faster results within significantly reduced energy budgets. For instance, nodes equipped with Alveo U250 deliver up to 4x speedup while reducing the energy consumption by almost 80% vs. CPU-only nodes. As these algorithms are memory bound, upgrading the configuration to U280 (equipped with HBM) gives additional speedup and helps reduce the energy budgets further.

CFD market needs speedup, heterogeneous architectures ready solutions and scalability. Alveo products family addresses these challenges very well. Moreover, CFD codes fit well into the Alveo architecture features such as multi banks, BRAM utilization, pipelining, vectorization etc.

FPGAs introduce certain limitations like for example:

Therefore, FPGAs are good candidates for applications where:

Based on this, CFD codes in general are very good candidates to benefit from the FPGA architectures.

Some comments about FPGA benefits over CPU

Access to global memory is a bottleneck for CFD codes (in our case)

In the FPGA the global memory is integrated with the accelerator that allows us to fully utilize all memory banks in parallel and gives us better access to it with higher bandwidth (DDR4/HBM) comparing with the external RAM memory of CPU

A large set of arithmetic and logic units (1.3 M of LUTs) allow us to perfectly hide up to 90% of computation behind the data transfers

Thanks to this solution we can beat 2.5 GHz of CPU with 300/500 MHz of FPGA

Lower frequency is also more energy efficient

Small stencil structure of CFD kernels does not require to use a big cache memory of the CPU. Small but ultra-fast BRAM is good enough to provide data reusing and data locality that allows us to reduce the global memory traffic comparable with a CPU, where the cache memory is bigger.

Read our presentation and get in touch with us directly to learn more: bytelake.com/en/PPAM19

Update for 2020: byteLAKE is currently developing CFD Suite as AI for CFD Suite, a collection of AI/ Artificial Intelligence Models to accelerate and enable new features for CFD simulations. It is a cross-platform solution (not only for FPGAs). More: www.byteLAKE.com/en/CFDSuite.

Co Founder @ byteLAKE | AI & HPC | AI-accelerated CFD, AI for Industry 4.0, Manufacturing, Paper Mills, Restaurants, Document Processing etc.