论文标题
整合最新技术,沟通和自动调节策略,以使应用程序CPMD的性能乘以从头算分子动力学模拟
Integrating State of the Art Compute, Communication, and Autotuning Strategies to Multiply the Performance of the Application Programm CPMD for Ab Initio Molecular Dynamics Simulations
论文作者
论文摘要
我们介绍了Ab Initio Molecular Dynamics程序CPMD(www.cpmd.org)的最新代码现代化,特别关注超柔软的伪能力(USPP)代码路径。在CPMD的内部仪器之后,已经修改了所有关键例程,以最大化计算吞吐量并最大程度地减少通信开销以获得最佳性能。在整个程序中,缺少混合MPI+OpenMP并行化以优化缩放。对于通信密集型例程,作为电子状态的多个分布式3D FFT和分布式矩阵 - 矩阵乘法,与伪电势的$β$ - 投影器有关,现在,此MPI+OpenMP平行化现在重叠的计算和通信。通过自动调节算法优化了工作负载的必要分区。此外,最大的全局MPI_Allreduce操作已被MPI共享模式窗口替换为高度调谐的节点 - 局部并行化操作,以避免使用节点间通信。多个3D FFTS的批处理算法改善了MPI_AlltoAll通信的吞吐量,因此,对于USPP和常用的规范支持的伪电位代码路径,实现的可扩展性。在256个水分子的中型基准系统和32个分子到2048年分子的进一步的水系统上,表现出增强的性能和可伸缩性。
We present our recent code modernizations of the of the ab initio molecular dynamics program CPMD (www.cpmd.org) with a special focus on the ultra-soft pseudopotential (USPP) code path. Following the internal instrumentation of CPMD, all time critical routines have been revised to maximize the computational throughput and to minimize the communication overhead for optimal performance. Throughout the program missing hybrid MPI+OpenMP parallelization has been added to optimize scaling. For communication intensive routines, as the multiple distributed 3d FFTs of the electronic states and distributed matrix-matrix multiplications related to the $β$-projectors of the pseudopotentials, this MPI+OpenMP parallelization now overlaps computation and communication. The necessary partitioning of the workload is optimized by an auto-tuning algorithm. In addition, the largest global MPI_Allreduce operation has been replaced by highly tuned node-local parallelized operations using MPI shared-memory windows to avoid inter-node communication. A batched algorithm for the multiple 3d FFTs improves the throughput of the MPI_Alltoall communication and, thus, the scalability of the implementation, both for USPP and for the frequently used norm-conserving pseudopotential code path. The enhanced performance and scalability is demonstrated on a mid-sized benchmark system of 256 water molecules and further water systems of from 32 up to 2048 molecules.
