Being able to experiment with big data and queries in a safe and secure “sandbox” test environment is important to both IT and end business users as companies get going with big data. Nevertheless, setting up a big data sandbox test environment is different from establishing traditional test environments for transactional data and reports. Here are ten key strategies to keep in mind for building and managing big data sandboxes:
1. Data mart or master data repository?
The data base administrator needs to make a decision early-on as to whether to have test sandboxes use data directly from the master data repository that production uses, or whether the best solution is to replicate and splinter off sections of this data into separate data marts that are reserved for testing purposes only. The advantage of the full data repository is that testing actually uses data that is used in production, so test results will be more accurate. The disadvantage is that data contention can be created with production itself. With the data mart strategy, you don’t risk contention with production data—but the data will likely need to be periodically refreshed to stay in some degree of synchronization with data being used in production if it is going to closely approximate the production environment.
2. Work out scheduling
Scheduling is one of the most important big data sandbox activities. It ensures that all sandbox work is optimally being run. It usually achieves this by concurrently scheduling a group of smaller jobs that can be completed while a longer job is being run. In this way, resources are allocated to as many jobs as possible. The key to this process is for IT to sit down with the various user areas that are using sandboxes so everyone has an upfront understanding of the schedule, the rationale behind it, and when they can expect their jobs to run.
3. Set limits
If months go by without a specific data mart or sandbox being used, business users and IT should have mutually acceptable policies in place for purging these resources so they can be put back into a resource pool that can be re-provisioned for other activities. The test environment should be managed as effectively as its production environment counterpart so that resources are called into play only when they are actively being used.
4. Use clean data
One of the preliminary big data pipeline jobs should be preparing and cleaning data so that it is of reasonable quality for testing, especially if you are using the “data mart” approach. It is a bad habit (dating back to testing for standard reports and transactions) to use data in test regions that is incomplete, inaccurate, or even broken—simply because it was never cleaned up before it was dumped into a test region. Resist this temptation with big data.
5. Monitor resources
Assuming big data resources are centralized in the data center, IT should set resource allowances and monitor sandbox utilization. One area often requiring close attention is the tendency to over-provision resources as more end user departments engage in sandbox activities.
6. Watch for project overlap
At some point, it makes sense to have a corporate “steering committee” for big data that tracks the various sandbox projects going on throughout the company to ensure that there is no overlap and/or duplicated effort.
7. Consider centralizing compute resources and management in IT
Some companies start out with big data projects in specific departments but quickly learn that they can’t work on big data, do their daily work, and then manage compute resources, too. Ultimately, they move the equipment into the data center for IT to manage. This frees them to focus on the business and ways that big data can bring in value.
8. Use a data team
Even in sandbox experimentation, it’s important to have the requisite big data skills team on hand to assist with tasks. Typically, this team consists of a business analyst, a data scientist, and an IT support person who can fine-tune hardware and software resources and coordinate with database specialists.
9. Stay on task with business cases
It’s important to infuse creativity into sandbox activities, but not to where you totally forget the initial charge of the business case you’re trying to bring value to.
10. Define what a sandbox is!Especially participants coming from the end business might not be familiar with the term “sandbox” or what it implies. Like the childhood sandbox, the purpose of a big data sandbox is to freely play and experiment with big data—but to do it with purpose. Part of this purposeful activity should be abiding by the ground rules of the sandbox, such as when, where and how to use it, as well as experimenting to derive meaningful results for the business.
Mary E. Shacklett is president of Transworld Data, a technology research and market development firm. Prior to founding the company, Mary was Senior Vice President of Marketing and Technology at TCCU, Inc., a financial services firm; Vice President of Product Research and Software Development for Summit Information Systems, a computer software company; and Vice President of Strategic Planning and Technology at FSI International, a multinational manufacturing company in the semiconductor industry. Mary is a keynote speaker and has more than 1,000 articles, research studies, and technology publications in print.