The parallel programming comes a long way with the advances in the HPC (High Performance Computing). The high performance computing landscape is shifting from collections of homogeneous nodes towards heterogeneous systems, in which nodes consist of a combination of traditional out-of-order execution cores and accelerator devices. Accelerators, built around GPUs, many-core chips, FPGAs or DSPs, are used to offload compute intensive tasks. Large-scale GPU clusters are gaining popularity in the scientific computing community and having massive range of applications. However, their deployment and production use are associated with a number of new challenge.