Efficient magnetohydrodynamic simulations on distributed multi-GPU systems using a novel GPU Direct-MPI hybrid approach

Un Hong Wong, Takayuki Aoki, Hon Cheng Wong

Research output: Contribution to journalArticlepeer-review

5 Citations (Scopus)


Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct-MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU-MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 12003 grid points using 216 GPUs.

Original languageEnglish
Pages (from-to)1901-1913
Number of pages13
JournalComputer Physics Communications
Issue number7
Publication statusPublished - Jul 2014
Externally publishedYes


  • CUDA
  • Distributed multi-GPU systems
  • GPUs
  • MHD simulations
  • Parallel computing


Dive into the research topics of 'Efficient magnetohydrodynamic simulations on distributed multi-GPU systems using a novel GPU Direct-MPI hybrid approach'. Together they form a unique fingerprint.

Cite this