Provided by: University of Waterloo
Date Added: Mar 2012
In this paper, the authors chronicles their experiences using CUDA to implement a parallelized variant of Pollard's rho algorithm to solve discrete logarithms in groups with cryptographically large moduli but smooth order using commodity GPUs. They first discuss some key design constraints imposed by modern GPU architectures and the CUDA framework and then explain how they were able to implement efficient arbitrary-precision modular multiplication within these constraints. Their implementation can execute roughly 51.9 million 768-bit modular multiplications per second - or a whopping 840 million 192-bit modular multiplications per second - on a single Nvidia Tesla M2050 GPU card, which is a notable improvement over all previous results on comparable hardware.