The unique architecture of the heterogeneous multi-core cell processor offers great potential for high performance computing. It offers features such as high memory bandwidth using DMA; user managed local stores and SIMD architecture. In this paper, the authors present strategies for leveraging these features to develop a high performance BLAS library. They propose techniques to partition and distribute data across SPEs for handling DMA efficiently. They show that suitable pre-processing of data leads to significant performance improvements, particularly when data is unaligned.