The world is generating 2.5 quintillion bytes of data per day, and unstructured data is a problem for 95% of companies. One problem companies face is how to store all of this data, along with clearing enough bandwidth to transfer big data.
SEE: Report: SMB’s unprepared to tackle data privacy (TechRepublic Premium)
This is where data compression enters the conversation. In data compression, data is encoded by using fewer bits than the original data. There are two approaches to data compression: Lossless compression, which eliminates redundancy but does not lose any of the original data; and lossy data compression, which modifies data by removing unnecessary or less important information.
Using data compression in the transmission and storage of big data is important because it reduces the amount of network bandwidth and storage that IT must provision for that data. Just as important, there are some types of big data that you don’t really want to keep—such as the jitter from device-to-device handshakes that are part of Internet of Things (IoT) communications data.
However, to maximize your data compression operations on big data, you have to know when and where to use the different types of data compression tools and formulas that are available. Here are several useful guidelines to keep in mind when you select a data compression methodology.
When to use lossless data compression
If you have a big data application and you can’t afford to lose any data, and you need to unpack every byte of data that you compress, you’ll want a lossless data compression methodology.
SEE: Navigating data privacy (free PDF) (TechRepublic)
An example of when you would want lossless data compression, even if it means that you have to store more data, is when you’re compressing data that originates from a database. At the time that you choose to recommit this data to its database, you’ll need to unpack the full data so it can match up with the data on the database side and be stored.
When to use lossy data compression
There are times when you don’t need or want all of the data, such as jitter from IoT and network appliances. You don’t need that data—only the data that gives you contextual information that you need for the business. A second example is the use of artificial intelligence (AI) in data compression formulas that might be used at the frontend of a data ingestion process. If you are studying a specific problem and you only want data that directly relates to that problem, you might decide to have your data compression formula dis-include any data that isn’t relevant to the problem.
How to conserve processing
CPU processing cycles for big data are expensive, so part of the data compression process should focus on offloading processing from the CPU.
This can be done by using Field-Programmable Gate Arrays (FPGAs), which are microchips that can be configured by you as additional processors for your computer. By using FPGAs, you can offload some of the compression processing from your CPU and accelerate the performance of your hardware.
How to select the right codec
A codec is a hardware-software combination that compresses and decompresses data, so it plays a central role in big data compression and decompression operations. There are many different kinds of codecs, so it is important to select the right type of codec for the right type of data or file.
The type of codec that you select will depend on the data and file type you are trying to compress. There are codecs for both lossless and lossy data. There are also codecs that must process all data files as “wholes,” while other codecs can split the data up so it can be parallel processed and then reassembled at its destination. Some codecs are set up for visual data, while others process audio data only.
Why is data compression important?
Determining the type of data compression you’re going to use for big data is a vital part of big data operations. On the resource end alone, IT can’t afford the cost of runaway processing and burgeoning storage. Data, even if it must be stored in its entirety, should be compressed as much as possible.
That said, there are additional steps you can take to limit storage and processing, as well as best-fit operations for the algorithms and methodologies that you employ in big data compression. Mastering these options is a key data point for IT.