Date Added: Jan 2011
This paper presents an algorithm to automatically map code on a generic intelligent memory system that consists of a host processor and a simpler memory processor. To achieve high performance with this type of architecture, code needs to be partitioned and scheduled such that each section is assigned to the processor on which it runs most efficiently. In addition, the two processors should overlap their execution as much as possible. With the algorithm, applications are mapped fully automatically using both static and dynamic information. Using a set of standard applications and a simulated architecture, the authors show average speedups of 1.7 for numerical applications and 1.2 for non-numerical applications over a single host with plain memory.