TY - JOUR
T1 - Efficient vectorised kernels for unstructured high-order finite element fluid solvers on GPU architectures in two dimensions
AU - Eichstädt, Jan
AU - Peiró, Joaquim
AU - Moxey, David
PY - 2022/11/30
Y1 - 2022/11/30
N2 - We develop efficient kernels for elemental operators of matrix-free solvers of the Helmholtz equation, which are the core operations for incompressible Navier-Stokes solvers, for use on graphics-processing units (GPUs). Our primary concern in this work is the extension of matrix-free routines to efficiently evaluate this elliptic operator on regular and curvilinear triangular elements in a tensor-product manner. We investigate two types of efficient CUDA kernels for a range of polynomial orders and thus varying arithmetic intensities: the first maps each elemental operation to a CUDA-thread for a completely vectorised kernel, whilst the second maps each element to a CUDA-block for nested parallelism. Our results show that the first option is beneficial for elements with low polynomial order, whereas the second option is beneficial for elements of higher order. The crossover point between these two schemes for the hardware used in this study corresponds to polynomial orders at around $P=4-5$, depending on element type. For both options, we highlight the importance of the layout of data structures, which necessitates the development of interleaved elemental data for vectorised kernels, and analyse the effect of selecting different memory spaces on the GPU. As the considered kernels are foremost memory-bandwidth bound, we develop kernels for curved elements that trade memory bandwidth against additional arithmetic operations, and demonstrate improved throughput in selected cases. We further compare our optimised CUDA kernels against optimised OpenACC kernels, to contrast the performance between a native and a portable programming model for GPUs.
AB - We develop efficient kernels for elemental operators of matrix-free solvers of the Helmholtz equation, which are the core operations for incompressible Navier-Stokes solvers, for use on graphics-processing units (GPUs). Our primary concern in this work is the extension of matrix-free routines to efficiently evaluate this elliptic operator on regular and curvilinear triangular elements in a tensor-product manner. We investigate two types of efficient CUDA kernels for a range of polynomial orders and thus varying arithmetic intensities: the first maps each elemental operation to a CUDA-thread for a completely vectorised kernel, whilst the second maps each element to a CUDA-block for nested parallelism. Our results show that the first option is beneficial for elements with low polynomial order, whereas the second option is beneficial for elements of higher order. The crossover point between these two schemes for the hardware used in this study corresponds to polynomial orders at around $P=4-5$, depending on element type. For both options, we highlight the importance of the layout of data structures, which necessitates the development of interleaved elemental data for vectorised kernels, and analyse the effect of selecting different memory spaces on the GPU. As the considered kernels are foremost memory-bandwidth bound, we develop kernels for curved elements that trade memory bandwidth against additional arithmetic operations, and demonstrate improved throughput in selected cases. We further compare our optimised CUDA kernels against optimised OpenACC kernels, to contrast the performance between a native and a portable programming model for GPUs.
M3 - Article
SN - 0010-4655
JO - COMPUTER PHYSICS COMMUNICATIONS
JF - COMPUTER PHYSICS COMMUNICATIONS
ER -