Myriad - Parallel Data Generation on Shared-Nothing Architectures
The need for efficient data generation for the purposes of testing and benchmarking newly developed massively-parallel data processing systems has increased with the emergence of BigData problems. As synthetic data model specifications evolve over time, the data generator programs implementing these models have to be adapted continuously - a task that might become complex as the set of model constraints grows. In this paper, the authors present Myriad - a new parallel data generation toolkit. Data generators created with the toolkit can produce very large datasets by exploiting a completely parallel execution model, while at the same time maintain cross-partition dependencies, correlations and distributions in the generated data.