Leaders of the High-Performance Storage System (HPSS), built by US Department of Energy research laboratories and IBM more than 25 years ago, are planning ways to modernize the data management product for its upcoming eighth generation.
Hierarchical storage management is the concept of using organization policies and software automation to determine which data to save, where to save it, when to move it to different storage methods, and when to delete it. There are many commercial products that do this, but HPSS emphasizes the vast scale needed for government science research.
SEE: Quick glossary: Storage (Tech Pro Research)
“How do you know what you’re archiving? We’re talking about archives now that are hundreds of petabytes to an exabyte. We think we’re going to be there in 2-3 years,” said Todd Herr, a storage architect for supercomputing at Lawrence Livermore National Laboratory in Livermore, CA.
The HPSS website lists 37 publicly disclosed customers; the other customers are secret. HPSS is currently in version 7.5.1 as of last year, reaches 7.5.2 imminently, and will see 7.5.3 next year, as stated in the online roadmap.
Version 8 is not yet on the official roadmap, but what are insiders thinking about for it? “What I think our challenge is, is to become good data curators. And I think that’s where we’re going to point the product,” Herr said. That will make HPSS capable of mining its own data and having metadata assigned to it.
To make that happen, “Step one is being able to reveal information in the archive to some of the overarching namespace applications… right now we are working on that,” Herr explained, referring to software made by companies such as Atempo, Robinhood, Starfish, and StrongLink. “I think the next step there is scaling out metadata performance,” such as database partitioning and virtualizing multiple processors when performing searches, he said.
A more general goal for HPSS, not associated with any particular version, is for the software to work more efficiently with tape storage. “What we’re trying to do is enable fast access to tape. If you look across the industry spectrum, the words fast and tape generally don’t go together,” Herr observed. Livermore’s scientists in the nuclear energy field can access research data on tape from 50 years ago, most of which is textual research results or software code portable enough to run in modern systems, perhaps in emulation of vintage hardware.
SEE: Ghosts of tech past: Photos of data storage from the 1950s – 1980s (TechRepublic)
“HPSS, because it exists at these huge supercomputing centers that operate at these massive scales, the consequences that exist for IT are more for a trickle down benefit,” such as 1990s work on fiber channel host-bus adapters, Herr said. “We’re going to hit a problem way faster than most sites, and certainly faster than the vendors themselves because they cannot replicate our environment in most testing.”
Speed-matching buffers can be placed between primary disk storage and archive tape storage, which could be used for both reads and writes. There are also physical improvements such as faster tape motors and faster head placements.
“Physics will always be physics… What you’re trying to minimize are the amount of times you have to go out to tape in the first place,” Herr added. His employer’s upcoming supercomputer, named Sierra, will operate at up to 125 petaflops and have a 125-petabyte file system–plenty of testing ground for new ways of speeding up and managing cutting-edge data storage mechanisms.
Such topics are common at the annual HPSS user group meeting and the upcoming International Conference on Massive Storage Systems and Technology.