Discussion on:

7
Comments

Join the conversation!

Follow via:
RSS
Email Alert
Based upon that discussion, the real issue here is sloppy backup practices. If you are executing a normal backup rotation, you should not end up with a proliferation of the same data over and over again.

For example:

- Corporate Off site Full backups created on tape Quarterly and stored in a secure off-site facility for 5 years (depends on your exposure).
- Full Backup on First weekend day of the Month to Tape and stored off-site at the end of each month once the next month's full has been verified.
- Incremental backups of only USER GENERATED data is backed up every 6 hours to local disk storage
- Incremental backups from the previous week are moved to tape every weekend and the disk storage utilized is freed up for new incremental backups - these tapes stay on site

In this example:

- The quarterly full tapes are stored for a long period depending on the data loss exposure requirements.
- The monthly full tapes are recycled on a 18 month rotation scheme.
- The daily incrementals, since they only contain modified or new files, are automatically deduplicating the data store on the local disk.
- The tapes created from the incrementals are kept on-site after the disk space they occupied is cleared for new incrementals.

The result is long term data recoverability, automatic disk grooming, control of the local storage requirements and no need for deduplication - which can lead to a restore failure for more than a single system or backup grouping if the "single instance" of the required data is lost.

0 Votes
+ -
Care to explain?
C_Tharp 6th May 2009
This is the first time that I have seen the term "deduplication". Please explain the problem that you are trying to solve and identify the proposed solutions.
0 Votes
+ -
In this instance, the implied "bad thing" is that an average IT shop will have hundreds (if not thousands) of copies of the same files over the bulk of their backups. This implies that there's a huge storage waste that is solved by a "monitor" that removes these duplicated files from the backups - thus "deduplication".

Since this is another of those IT "made-up-words", you'll notice that your spell check process will always flag it until you add it to your dictionary.

If you haven't already read my response above, please do. The reality of the situation is that a properly managed backup process will not suffer from a duplication of data. Thus, obviating the need for "deduplication" software.
0 Votes
+ -
Hmmm. It sounds like incremental backups to me.

My question was not naive, rather it was a request for clarification. I see a tremendous amount of duplication in the way that people work. I wanted to know if someone had created a product to manage the problem.

Files are broadcast to many people who then make personal copies and possibly modify them. The process repeats. Much of it is unnecessary. There needs to be version control on every file like there is in software code.

Email replies, often to a long list of recipients, include the original message. This is repeated many times creating very long and often very confusing messages. No email system can stop this duplication. Attachments of modified files compounds the problem.

Policies that require retention of everything, forever compound the problem.

And then there is the culture that requires every wise person to practice "CYA". This generates a lot of record keeping and copies of everything for "just-in-case".

If space reduction is the goal, what about the form of the information? How many files or email messages contain copies of graphics, such as screen displays, that could have been reference links instead? How many of these could have used a different format for the same information and produced a much smaller file. There are many graphics formats and several text formats. Has anyone ever looked at the difference between a simple text file and a Word document for the same information? What about html or xml formats? What about Excel versus character delimited formats? Not everything needs to have controlled, pretty presentation.

Well, how do you get control? How much is enough? I suspect that very few are willing to consider the actual costs of these problems. If they did, a lot less would be produced.

No backup system can manage these problems.
That's the big point. However, retention of data doesn't mean retention on disk is the requirement. This is why the mechanism I described of moving incrementals off to tape can resolve the disk capacity points of the deduplication issue.

The big issue that remains is - "how" do you find things after the fact? This is where the focus should be being placed. By using a reliable software solution and tape, you -

A: keep disk space requirements manageable
B: save on energy as tape is still the greenest high capacity storage mechanism going.
C: save on storage costs since tapes are easier to store than disks in an off-site scenario.

With that, the question then becomes one of - "How do I retrieve that one file or email needed to satisfy the legal discovery process?"

If I were a software vendor looking for a useful new technology, I'd be working on extensions on processes like Microsoft's Filesystem Search Indexing, Apple's "Spotlight", and Linux's "Beagle" metadata search engines being tied into my backup and archival operations directly.

This is old technology. I was doing these things twenty years ago.

Some of the backup utilities I have used, and any I would buy, will create a listing of every file written to the tape with the idenity of the tape. The listing files are easily searched with standard tools or an editor. In some cases the listing is a part of the backup utility. Once the tape is identified, load it, and retrieve the file.

Of course, knowing what is in a file is another matter.
0 Votes
+ -
Bingo!
Timpraetor 9th May 2009
You know that and I know that, but most "new" admins only know what the backup software vendors "sell" them.

Apparentl;y, for many it's not that straight forward happy.
Keyboard Shortcuts:
Prev
Next
Toggle
Join the conversation
Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]

Join the TechRepublic Community and join the conversation! Signing-up is free and quick, Do it now, we want to hear your opinion.