A recent study conducted by Santa Barbara, CA-based Strategic Research Corporation revealed that over 60 percent of enterprise data is being stored outside the data center and up to 75 percent of that data is unprotected. According to the study, this is a risky practice because edge data can be as critical to a company's survival as its more closely managed centralized data. Randy Corke, Vice President of Marketing at Signiant, Inc., offers Tech Republic readers strategies for backing up and archiving that data that will help IT professionals sleep at night.
TechRepublic: How widespread is this problem of unprotected non-data center data?
Corke: Companies with multiple, or in some cases, hundreds of remote sites typically have backup failure rates of 50 percent or more per night. But the real problem is that they don't know which remote backups worked and which ones didn't. It's that uncertainty that keeps IT professionals up at night. They know they just can't keep operating their companies with so much of their data unprotected.
TechRepublic: What's causing such high failure rates?
Corke: Part of the problem is that many companies just can't afford full-time IT folks out at every remote site. So they have their administrative people or marketing or sales people try and take on this additional IT responsibility of doing backups. And since it's not their main job and they're not trained, problems with the backup process might go undetected for quite a while. The other thing is that the volumes of data are growing significantly. A University of California at Berkeley School of Information Management and Systems study that was just published a month or so ago said that between 1999 and 2002 there was a 114-percent increase in data stored on the disk. So suddenly the volume of data's grown to the point where staff just can't physically back it up in the three hours the company's allotted to the task every night. So you've got a shortage of trained resources out there and you've got data volumes growing.
TechRepublic: So how does a company gain control over this remote data?
Corke: Ideally, the approach to take here is some sort of central control of this data. Stop relying on the remote locations to individually back it up or manage their own data because there are just too many moving parts in that sort of approach. The first step in gaining central control over this data is to simplify and standardize the way remote data is managed.
One emerging trend is consolidated backup. Instead of individual backups to tape at each remote site, companies are starting to consolidate the data at either a headquarters or a regional hub, if they have many sites. Another approach is to implement information life-cycle management for that data out in the remote sites. If you have data out at the remote site that hasn't been accessed in 90 days or more, statistics say there's only a two-percent chance that that data will ever be accessed again. So why keep that expensive disk out at the remote site and have to back it up all the time when it's not really being used? Following incremental life-cycle management practices, move that data to either a hub site or the data center and store it on a lower cost ATA-type of disk. It's still available to the people at the remote sites if they should need it. However, you're not taking up expensive, high-performance disk space at the remote site.
TechRepublic: What are some key factors to consider in implementing a consolidation strategy?
Corke: I'd say there are five key factors:
- Diversity of your remote network
- Volume and size of files
- Central policy for managing remote data
- Interfacing with remote applications
There are a lot of technologies on the market that can help you move data effectively over wide area networks or Internet connections. But when you're dealing with remote sites, you've typically got different types of network connections going out to different sites. You don't have T1 or T3 connections to every remote site. [In] some cases you may have a 128-K link to a remote site. You need to know what kind of network connections you have out at the remote sites in order to identify the best approach to gaining control over your remote data.
Also, file size is important. Having a large number of smaller files versus a small number of large files can make a difference in CPU load on the remote site during data transmission. So look for a product that moves data very efficiently, one that conserves network resources by going in and detecting what data has changed and moving only the bytes of the data that have changed. In some cases, that might mean moving only parts of files instead of whole files.
Security is another area to investigate. Obviously when you start to move data from remote sites to the corporate data center or a hub site you need to be concerned with the security of that data while it's being transmitted. There are two areas that you need to address: node authentication and data encryption. With node authentication, the technology makes sure that the sending node and the receiving node computers are absolutely, 100-percent authenticated. Typically this is done using digital certificates to ensure that before any data flows you've verified that the correct machines are at either end of the transmission. Then there's the technology to protect the data during transmission using some form of encryption. The best tools out there will encrypt the data to the AES level. It's one level above what's called Triple DES (3-DES)—which, up until a year ago, was state-of-the-art encryption but was then supplanted by AES.
Developing and enforcing a central policy is another very important element in managing remote data from a central perspective. You don't want to set up individual processes for every remote site to scan the data and see what's changed, get it, encrypt it, and send it to corporate headquarters. You need a technology that you can set at the central site—with rules and schedules—and have those things applied to all your remote sites. That way the initial setup is much easier and, maybe more importantly, much easier to change as your business needs change. By having that you also can greatly reduce or eliminate the need for the IT folks on the edge [at the remote sites]. This is a huge cost savings because you have the ability to control what happens to the data out in the remote sites and affect the flow of it to the core, all from the central site.
And then the last piece that's critical to this strategy is the ability to interface with the applications out in the remote sites. As you can imagine, not every remote site has all the same applications. You have a sales location or a manufacturing location or a shipment location, and so forth. So they're going to have different applications out there—all creating data that needs to be managed and backed up and archived. So you need the ability to interface with those applications. You need to be able to trigger those applications to extract the data that should be backed up.
There are technologies out there that can do all of these things. Which you choose will ultimately be driven by your business requirements. For example, if your business needs to periodically move data from remotes sites to a central site—perhaps nightly for a consolidated backup—then choose a product strong in periodic data movement. Two software solutions that immediately come to mind are Signiant's Mobilize and EMC's OnCourse. If your business needs continual data movement for business continuity—so that if a remote site goes down, all of the data exists at a central site for recovery—then continuous movement technology would be best. Some of the continuous backup solutions out there include Legato's RepliStor, NSI's DoubleTake, and Veritas' Storage Replicator.
TechRepublic: Any last words of advice to companies dealing with unprotected remote data?
Corke: The problem of unprotected and unmanaged remote data is becoming a top-five issue for most sizable companies with remote sites. It's a big problem, and you can't ignore it any longer. There are technologies that have emerged over the last couple of years that can give companies central control for their remote data. They do that by being able to automate remote processes and move and aggregate data to central sites so that you can get rid of all the individual processes at the remote sites. By using centrally controlled, more consolidated approaches to backup and archive, you're going to solve a significant portion of your remote data exposure—that data that's unprotected out there.