Big data memory tech is improving genome research

In-memory data storage has the potential to unlock big data file processing—and now new virtualization concepts are bringing it to life.

A symbol of health and the globe on the virtual screen.

Natali_Mis, Getty Images/iStockphoto

I have long felt that storage and memory aren't emphasized enough in IT planning—especially in the area of the very large data files that characterize big data.

Imagine, for instance, that you could virtualize and scale in-memory processing to eliminate data clogs and I/O problems and by doing so exponentially shorten your time to results, whether in real time or batch? Now imagine that at the same time, without losing speed, your memory can take continuous snapshots of data and offer near-immediate failover and recovery when you need it?

SEE: Electronic Data Disposal Policy (TechRepublic Premium)

For a genome research institute or a university that can take days to process large files of genomic data, these capabilities would be invaluable.

At Penn State University, the data being used in genome research was greater than available memory. Software was constantly crashing with out-of-memory errors that prevented researchers from doing gene alignment on large orthogroups, which are sets of genes derived from a single gene. Receiving an OOM error isn't uncommon with various operating platforms, databases and programming environments that don't support large memory footprints, so the staff wasn't surprised. Unfortunately, however, these genome workloads can run for hours and even days. When a job crashes, the job must be restarted from the beginning, and this costs time and money.

"For real-time and long-running use cases, when data sets get to hundreds of gigabytes or terabytes in size, the root cause of various performance problems is Data is Greater than Memory, or DGM," said Yong Tian, vice president of product management at MemVerge. "Routine data management operations that should take seconds become painfully slow. Loading, saving, snapshotting, replicating and transporting hundreds of gigabytes of data takes minutes to hours."

Tian said that the main bottleneck with applications using big data is I/O to storage. "The fastest SSD (solid state drive) is 1,000 times slower than memory, and the fastest disk is 40,000 times slower than memory. The more DGM grows, the more I/O to storage, and the slower the application goes," he explained.

One solution to the problem is in-memory resource virtualization, which functions as an in-memory resource software abstraction layer in the same way that VMware vSphere is an abstraction layer for compute resources and VMware NSX abstracts networking.

MemVerge's data management uses virtualized dynamic random access memory (DRAM) and persistent memory to bypass the I/O that would normally be required to access storage media like SSD, which is 1,000 times slower to access despite its substantial data storage capacities. Since DRAM already exists in-memory, there is no I/O "drag" on it. DRAM can also store data. 

The end result is that you add higher capacity and lower cost persistent memory by using DRAM. This enables you to cost-effectively scale-up memory capacity so all data can fit into memory, thereby eliminating DGM.

SEE: Snowflake data warehouse platform: A cheat sheet (free PDF) (TechRepublic)

What results are organizations seeing?

"In one case, Analytical Biosciences needed to load 250GB of data from storage at each of the 11 stages of their single-cell sequencing analytical pipeline," Tian said. "Loading data from storage and executing code with I/O to storage consumed 61% of their time-to-discovery (overall completion time for their pipeline)… . Now with virtualized DRAM, the repetitive data loading of 250GB of data that must be done at each stage of the genomic pipeline now happens in one second instead of 13 minutes."

Meanwhile at Penn State, all of the system crashes have been eliminated with the move to virtualized in-memory DRAM storage. And if there is a system crash, in-memory snapshots are happening so fast that it is easy to re-start quickly from the time of the last snapshot.

Virtualized DRAM is a breakthrough in very large file big data processing and data recovery, and it's useful beyond the university setting. 

Examples of real-time big memory applications in the commercial sector include fraud detection in financial services, recommendation engines in retail, real-time animation/VFX editing, user profiling in social media and high performance computing (HPC) risk analysis.

Tian added: "By pioneering a virtual memory fabric that can stretch from on prem to the cloud, we believe that a platform for big data management can be created at the speed of memory in ways never thought possible to meet the challenges facing modern data-centric applications."

Also see