TY - CHAP
T1 - A Generic Vectorization Scheme and a GPU Kernel for the Phylogenetic Likelihood Library
AU - Izquierdo-Carrasco, Fernando
AU - Alachiotis, Nikolaos
AU - Berger, Simon
AU - Flouri, Tomas
AU - Pissis, Solon P.
AU - Stamatakis, Alexandros
PY - 2013/12
Y1 - 2013/12
N2 - Highly optimized library implementations for important scientific kernels can improve scientific productivity. To this end, we are currently developing the Phylogenetic Likelihood Library (PLL) that implements functions to compute and optimize the phylogenetic likelihood score on evolutionary trees. Here, we focus on novel techniques to orchestrate likelihood computations on large vector-like processors such as GPUs. We present a novel scheme for vectorizing computations and organizing conditional likelihood arrays (CLAs) in such a way that they do not need to be transferred at all between the GPU and the CPU. We compare the performance of our GPU implementation for DNA data with a highly optimized x86 version of the PLL that relies on manually tuned AVX intrinsics. Our GPU implementation accelerates the likelihood computations by a factor of two compared to the, most probably, currently fastest available x86 implementation. We conclude that, a hybrid GPU-CPU version needs to be developed and integrated into the PLL to leverage the computational power of modern desktop systems and clusters.
AB - Highly optimized library implementations for important scientific kernels can improve scientific productivity. To this end, we are currently developing the Phylogenetic Likelihood Library (PLL) that implements functions to compute and optimize the phylogenetic likelihood score on evolutionary trees. Here, we focus on novel techniques to orchestrate likelihood computations on large vector-like processors such as GPUs. We present a novel scheme for vectorizing computations and organizing conditional likelihood arrays (CLAs) in such a way that they do not need to be transferred at all between the GPU and the CPU. We compare the performance of our GPU implementation for DNA data with a highly optimized x86 version of the PLL that relies on manually tuned AVX intrinsics. Our GPU implementation accelerates the likelihood computations by a factor of two compared to the, most probably, currently fastest available x86 implementation. We conclude that, a hybrid GPU-CPU version needs to be developed and integrated into the PLL to leverage the computational power of modern desktop systems and clusters.
U2 - 10.1109/IPDPSW.2013.103
DO - 10.1109/IPDPSW.2013.103
M3 - Conference paper
SN - 9781479913725
T3 - IPDPSW '13, IEEE Computer Society
SP - 530
EP - 538
BT - Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
PB - IEEE
T2 - 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops and PhD Forum (IPDPSW)
Y2 - 20 May 2013 through 24 May 2013
ER -