The authors present the implementation and performance of a new gravitational N- body tree-code that is specifically designed for the Graphics Processing Unit (GPU). All parts of the tree-code algorithm are executed on the GPU. They present algorithms for parallel construction and traversing of sparse oc-trees. These algorithms are implemented in CUDA and tested on NVIDIA GPUs, but they are portable to OpenCL and can easily be used on many-core devices from other manufacturers. This portability is achieved by using general parallel-scan and sort methods.