Big Data

3 best practices for defining big data buckets

Once you understand the different sizes of big data, it will be easier to make IT investment decisions about storage and processing.

Image: iStock/nito100

The immensity of big data demands that corporate IT find ways to classify it, distribute it, and discard the data that isn't needed. While there are many historical IT practices that can be applied to managing big data, there are differences between big data and traditional transactional data that cannot be ignored.

The most noticeable difference between transactional data and big data is that the former is structured into fixed record lengths, which automatically makes the data easier to manage. In contrast, big data comes in all shapes and sizes.

Understand the different sizes of big data

Big data can be unpredictable and unstable, which is why IT must find new ways of classifying the data for purposes of management. One of these methods is defining various data "buckets" into which data is placed for classification. These different buckets are defined by the size of the data that each bucket carries, as well as by the end user groups for which this data is processed and ultimately delivered to.

One way to define these big data buckets is by the size of the data.

  • "Big" data is used for historical analytics over broad timespans, and that runs in a batch mode on a big data engine like Hadoop. The data for jobs of this nature resides in large data lakes, and it can take many hours to process and distribute.
  • Medium-size data can be found in a data mart subset of data.
  • Small data comes in the form of high-volume data snippets that flow through data conduits quickly in real time or near real time, and that are immediately actionable, such as monitoring and responding to the temperature readings of a thermostat at a remote location.

By understanding the different sizes of big data, and where they are likely to be needed in the enterprise, IT is in a better position to assess the processing and storage resources that will be needed now and in the longer term for this data. This knowledge also helps to shape overall IT architecture and investment decisions, as well as decisions on what big data to outsource and which to maintain internally.

To illustrate, an enterprise might require high security and the ability to immediately access real-time or near real-time small data. The enterprise will be looking for investments in solid-state and in-memory storage, and for processing that can facilitate the throughput. At the other end of the spectrum, an enterprise might have high security requirements, but slower moving big data of intermediate and large sizes that it chooses to place on cheaper hard disk or even tape. If the enterprise has data with low security requirements, regardless of the data size, it could opt to place this data into a public cloud data repository or archive.

The expense of different big data storage and security requirements can be assessed by IT on a per business unit basis to determine which business units require the faster, more expensive data access with heightened security, and which business units have lesser data security and access needs. This assists in the internal IT billing process.

Employ these best practices

1: Understand the sizes and payloads of your data, and who your power users are for each classification

The characteristics of your data payloads will likely determine the types of IT investments in storage and processing that you will need to make.

2: Determine what you can outsource and what you must maintain internally

High security data should ideally be kept within enterprise walls, but if you have large chunks of data that simply need to be stored and potentially accessed in the future and they do not carry significant security requirements, you might consider using a cloud-based storage service.

3: Define the enterprise business cases that will use your data

These business cases can vary from to-the-second information on the instrumentation of power plants to long-term historical trends modeling over 20 years. Without a business case, it is difficult to justify IT investments and the return on them or to charge end users for services.

Also see

About Mary Shacklett

Mary E. Shacklett is president of Transworld Data, a technology research and market development firm. Prior to founding the company, Mary was Senior Vice President of Marketing and Technology at TCCU, Inc., a financial services firm; Vice President o...

Editor's Picks

Free Newsletters, In your Inbox