Synchronizing big data: 5 ways to ensure big data accuracy

IT must assure that the data served up from applications that access big data and transactional data is accurate.

What businesses can learn from political campaigns about using big data

Earlier this year, I was shopping for a specific closet door at a home improvement store, and the store said it still had three such doors in stock. I drove to the store, and although the store's inventory reported on the associate's mobile device that three such units were in stock—the reality was that not only were the doors not in stock at the  store, but they had actually been discontinued.

I'm sure I'm not the only consumer who has been frustrated by the "stockout" problem. Stockouts are an all too common occurrence across many industries and have been exacerbated as companies struggle with synchronizing the data from the many disparate systems they run, including systems that house big data. When data flowing in from these systems is not adequately synchronized with what is going on in the real world, customers can be disappointed and management risks making decisions based upon data that isn't fact.

SEE: Feature comparison: Data analytics software, and services (Tech Pro Research)

What exactly is data synchronization?

According to Wikipedia Data Synchronization is  "the process of establishing consistency among data from a source to a target data storage and vice versa and the continuous harmonization of the data over time."

Data synchronization is a highly technical topic. It is also a problem that highly impacts big data. Why? Because there are so many more sources of big data that flow into an enterprise at breakneck speeds, but that must still be synchronized for absolute accuracy into a single version of the truth.

For example, if you build and sell boats, you will likely have purchasing and inventory systems that store and report parts, a production system that reports how many parts have been consumed in end item manufacture, sales systems that report what's available to be sold, and engineering systems with unsecured CAD big data that report on the current revision levels of products. If all of these systems aren't synchronized to reflect the up-to-the-minute accuracy of the boats you sell, there are liable to be breakdowns that disappoint consumers and salespersons, and that can lead to management decisions made on inaccurate data.

What can IT do to assure that the data served up from applications that access big data and transactional data is accurate? Find out with the five examples below.

SEE: Feature comparison: Data analytics software and services (Tech Pro Research)

1. Plan your data update processes

Every time you plan or modify an application and/or admit a new big data source into your IT reporting to the business, your requirements planning should include how you will synchronize all incoming data so that data can be as fresh and accurate as possible. This planning should include the frequency of when you perform data updates and synchronization to master datasets. The frequency of data updates and synchronization (and any limitations) should be communicated to end users so they understand upfront what the data limitations are.

2. Consider the limitations of mobile devices and downloads

Increasingly, sales associates and others use mobile devices in the field. Because of Internet bandwidth limitations and the inability of these devices to process extensive data downloads quickly, the resident sales and inventory data on these devices may not always be in sync with what's "real" in the master database. As part of your end user communication process, IT should make users aware of these potential data accuracy constraints.

3. Develop a data synchronization methodology

Most sites already have data synchronization policies and update procedures for synchronizing their mission-critical transactional data, but they haven't necessarily addressed big data.

There are an immense number of data sources and extreme velocity of data delivery with big data. Nevertheless, timestamps on data, and also information on the timezones the data is coming in from, need to be synced in order to know where the freshest data is. There are also the realities of the data update process that have to be faced. Not all data can be updated in real time, so decisions have to be made on when the data is synced with master data, and whether any batch data synchronizations occur nightly, or in scheduled batch "burst" modes throughout the day. These processes should be documented in IT operations guides—and they should be updated every time you add a new big data information source to your processing.

SEE: Building an effective data science team: A guide for business and tech leaders (free PDF) (TechRepublic)

4. Obtain the tooling you will need for synchronization

There are commercial tools available that can assist with data synchronization. These tools can help you with your big data synchronization efforts and also automate portions of your data synchronization operations.

5. Seek out service providers who can assist with data synchronization

Big data cloud processors such as AWS EMR recognize the data synchronization issue, and have data synchronization methods that enable them to perform synchronization for you. If you are executing your big data-processing in the cloud, ask your cloud vendor what services it can provide to assure the freshest, highest quality representations of your big data.

Also see

Young team of data analysts are captivated by the code

Image: Getty Images/iStockphoto