论文标题
探索Nekbone在可重构体系结构上的加速度
Exploring the acceleration of Nekbone on reconfigurable architectures
论文作者
论文摘要
硬件技术进步正在努力与科学的野心相匹配,一个关键问题是我们如何使用已经具有更有效的晶体管。对于HPC而言,尤其如此,在其他因素中,通常在某种程度上,通常在某种程度上,通常在某种程度上绑定了代码本身。通过重新设计算法并从冯·诺伊曼(Von Neumann)转变为数据流风格,与更通用的架构相比,有更多机会可以在可重构体系结构上解决这些瓶颈。 在本文中,我们探索了Nekbone的Ax内核的移植,这是一种广受欢迎的HPC Mini-App,使用通过Vitis进行高水平合成的FPGA。尽管计算是该代码的重要组成部分,但它也是CPU上的内存绑定,一个关键问题是是否可以通过利用FPGA来改善这一点。我们首先探讨了获得良好性能的优化策略,在FPGA上的内核的第一版和最终版本之间,运行时差超过4000倍。随后,将我们在ALVEO U280上的性能和功率效率与24核Xeon Platinum CPU和NVIDIA V100 GPU进行了比较,FPGA的表现使CPU的表现优于CPU大约四次,从而达到了近四分之三的GPU性能,并且比两者都高效的功率高。这项工作的结果是对FPGA上的Nekbone的比较和一组技术,并且在加速可重构体系结构上的HPC代码方面也更广泛地感兴趣。
Hardware technological advances are struggling to match scientific ambition, and a key question is how we can use the transistors that we already have more effectively. This is especially true for HPC, where the tendency is often to throw computation at a problem whereas codes themselves are commonly bound, at-least to some extent, by other factors. By redesigning an algorithm and moving from a Von Neumann to dataflow style, then potentially there is more opportunity to address these bottlenecks on reconfigurable architectures, compared to more general-purpose architectures. In this paper we explore the porting of Nekbone's AX kernel, a widely popular HPC mini-app, to FPGAs using High Level Synthesis via Vitis. Whilst computation is an important part of this code, it is also memory bound on CPUs, and a key question is whether one can ameliorate this by leveraging FPGAs. We first explore optimisation strategies for obtaining good performance, with over a 4000 times runtime difference between the first and final version of our kernel on FPGAs. Subsequently, performance and power efficiency of our approach on an Alveo U280 are compared against a 24 core Xeon Platinum CPU and NVIDIA V100 GPU, with the FPGA outperforming the CPU by around four times, achieving almost three quarters the GPU performance, and significantly more power efficient than both. The result of this work is a comparison and set of techniques that both apply to Nekbone on FPGAs specifically and are also of interest more widely in accelerating HPC codes on reconfigurable architectures.
