Date Added: Jun 2012
This paper focus on Big Data systems calls for a rethinking of data generation methods. The traditional sequential data generation approach is not well suited to large-scale systems as generating a terabyte of data may require days or even weeks depending on the number of constraints imposed on the generated model. The authors demonstrate Myriad, a new data generation toolkit that enables the specification of semantically rich data generator programs that can scale out linearly in a shared-nothing environment. Data generation programs built on top of Myriad implement an efficient parallel execution strategy leveraged by the extensive use of pseudo-random number generators with random access support.