As companies learn to exploit their valuable data resources, unearthing the right combination of data to interrogate and then designing the best questions to dissect that data are major areas of work. It’s a challenge to even know the right questions to ask when it comes to navigating through customer, Internet of Things (IoT), and other unstructured data to arrive at conclusive and actionable results. During the process, businesses must also experiment with different combinations of big data from a plethora of sources to see which data combinations are best positioned to address the questions that organizations want answered.
This experimentation is a resource-intensive activity that by its very uncertain nature cannot be achieved in a production environment. Instead, many organizations use “sandboxes”, which allow IT pros and end users to try different queries and data combinations to see which tests get them closer to what they want to know from their data. In this scenario, the role of the database administrator (DBA) is critical, because if the DBA doesn’t provide policies and procedures for big data experimentation in a database environment, an organization can quickly be in a situation where resource consumption (and its costs) are running away, and the performance of production databases begins to get compromised.
An architectural decision
The DBA should begin with an architectural decision concerning test and production databases. Should separate test databases be established outside of the enterprise-wide data repository that production uses, or does the DBA establish separate virtual test databases within the same master data repository that production applications also use?
If the DBA opts to create separate physical databases (and data) outside of the enterprise master data repository, more physical server resources and staff time will potentially be consumed to maintain, track, and monitor all of the databases. Guidelines will need to be in place to periodically refresh the data in the non-production databases, so they remain relatively synchronized with data in the master data repository that production applications use.
If the DBA opts to virtualize separate databases within the master data repository that production also uses, risk to the production database performance is heightened because all of these databases are accessing the same data resources. In the case of big data, where data access and manipulation operations are intense, this data contention can be significant. There is also the risk of virtual “sprawl” that often arises when virtualization is done in on-the-fly deployments, with no one actively monitoring how many virtual instances of a system resource are accumulating. The advantage to this technique is that test and production users are using the same data, so data refreshes can be avoided.
The governing best practice
There is no right or wrong way to set up data testing and the maintenance of big data production databases. Instead, the governing best practice is for the DBA to plan their big data database approach upfront and very carefully. Upfront work should be done with end users, and should include agreement on test database setup and takedown timeframes. In this way, over-promulgation of databases and excessive resource consumption for testing can be balanced against the need to experiment and collaborate on big data testing.