Current loop buffer has been mainly explored as an effective architectural technique for low-power execution in embedded processor. Another avenue, however, for exploiting loop buffer is to obtain its performance benefit. In this paper, the authors propose an application specific loop buffer organization for vectorized processing kernels, to achieve low-power and high-performance goals. The Vectorized Loop Buffer (VLB) is simplified with single loop support for SIMD devices. Since significant data rearrangement overhead is required in order to use the SIMD capabilities, the VLB is specialized for zero-overhead implicit data permutation.