Image: iStock/Trifonov_Evgeniy

With big data streaming into organizations worldwide at the rate of 2.5 quintillion bytes of data each day, it’s incumbent on organizations to determine just how much of this big data is vital and needed, and which portions of big data are excess and can be eliminated before the data ever enters corporate systems. If companies fail to do this, bandwidth, storage, and processing capabilities can be overrun–along with budgets.

SEE: Report: SMB’s unprepared to tackle data privacy (TechRepublic Premium)

For every operation and analysis companies perform with big data, the key is to define each business use case upfront and predetermine how much data you’ll really need to address the business case. Inevitably there will be some data that you just don’t need. Paring this data out of your data ingestion process is what I call narrowing the aperture of the lens through which data streams into your data repository.

Here are two divergent examples of data lens adjustment:

IBM RoboRXN and the mechanics of molecular formulation

When IBM designed its RoboRXN project, which takes in enormous quantities of unedited data from the worldwide open source community and others on potential molecular combinations for product formulation, decisions had to be made on how much of this data was relevant to the project they were working on.

SEE: Navigating data privacy (free PDF) (TechRepublic)

The RoboRXN project focused on designing new molecules for pharmaceutical solutions, such as the COVID-19 vaccine. This meant that white papers, statistical research findings, and other sources of research that weren’t directly germane to the molecular formulation project that was being worked were not needed. What IBM decided to do was to implement artificial intelligence (AI) at the front of the data ingestion process while this enormous trove of unedited data was streaming in.

The AI algorithm posed one major question: Did each element of incoming data contain anything relevant to the focus of the project? For research that was not at all related to the project, or that was only distantly and tangentially related, the AI eliminated the data so it was never admitted to the data repository. In other words, the aperture of the data lens to the project’s data repository was tightened, admitting only those elements of data that were relevant to the project. As a result, data storage and processing were reduced, and so was cost.

SETI and the search for extraterrestrial life

Founded in 1984, the mission of the SETI Institute was to seek out extraterrestrial life. This was done by monitoring radio signals and emissions from space to determine if there were any repetitive patterns that could signify a communication from another life form. Scientists and volunteers participated in the SETI initiative, painstakingly examining mountains of unedited radio signals that flowed in ceaselessly.

SEE: Lossless or lossy: If you have big data, know what type of compression to use (TechRepublic)

In this effort, few assumptions could be made upfront about good versus bad data, because no one was entirely sure about what they were looking for. Consequently, there were few ways to “narrow” the aperture on the data lens, which had to be kept wide open. This resulted in high levels of processing, storage, and manual work.

What the Institute was able to do was to narrow down data after it had been searched in total for potential signals that might indicate intelligent life forms. At this point, only the signals with life potential needed to be stored in much smaller databases.

Lessons from SETI and IBM RoboRXN

The examples of IBM RoboRXN and SETI’s search for extraterrestrial life are at opposite ends of the data lens spectrum. In IBM’s case, there was the ability to narrow down the data lens aperture at the front of the process. This was not the case with SETI.

SEE: Big data success: Why desktop integration is key (TechRepublic)

What these use cases tell data scientists and IT is that there is the potential to tamp down big data ingestion at a stage of pre-processing if you have a tight enough use case that does not have the potential of requiring data that initially is regarded as extraneous. In other cases, you have limited ability to tighten up data ingestion.

The goal in every big data project should be to include a task line that addresses how wide you need to set the aperture of the data lens for incoming data. This aperture can be adjusted upward or downward based upon the needs of each use case.

When you do this, you have a realistic way of controlling the processing, storage, and funding that will be needed for each project.