Backups are a necessity of digital life. The more frequently you make them, the better you’ll be. Of course, the more files you have, the bigger your backups will be. Quite often, however, what matters is not the total size of your backup, but the maximum size of each of its parts. In other words, it’s often not a problem if a complete system backup consists of 100 archive files, as long as none of them is bigger than X bytes.

The general reason for this is obvious: imposing a maximum size on each archive guarantees that all of them will fit on whatever backup media is available.

In the old days when, as Linus says, “men were men and wrote their own drivers”, the maximum size was 1.44 MB, if you wanted to fit every single tar archive on one floppy. Later on, the limit jumped to 700 MB with the arrival of CDs, and then to several GB with DVD drives. The advent of USB keys further increased the average size of backup devices but, paradoxically, lowered the maximum size of each archive file (at least for a while): if you wanted your key to be surely readable by any computer in which you may plug it, it had to use file systems with a maximum file size of 2 GB.

Imposing certain limits can be quite handy even in these times of cloud computing. It will, for example, help you to automatically distribute your backups over several free online storage services with limited space. I’ve heard of people who get a Gmail address just to have online backups (as attachments to email they send to themselves) that are surely available even when Dropbox and friends are blocked or monitored (it happens, in certain companies): in those cases, making sure that attachments never exceed some predefined size helps to send email to those accounts, even through servers that refuse to handle very large messages.

Okay so, now that (I hope) we have agreed that it is handy to automatically enforce a size limit on each of your backup archives — how do you do it?

The easy solution…

Theoretically, one way is to use the split command, which was created for this very purpose. If your complete backup takes 7100 MB, this command:

split --bytes=2000MB my_complete.backup.tar my_backup_

will split it in four files named my_backup_aa, my_backup_ab, my_backup_ac and my_backup_ad. The first three will measure exactly 2000MB each, the fourth one will be 1100MB, and concatenation with “cat” will recreate your complete, original backup file:

cat my_backup_* > same_as_my_complete.backup.tar

Personally, I have stopped using split. It is very simple to use, but has one very large drawback. The way it works means that, if you need to extract even one single file from the original big backup archive, you must concatenate all its pieces to rebuild it. In the best case, this is an annoyance. In any other case, it may make your backups unusable. In the example above, if any of the four “my_backup_xx” pieces got lost or corrupted, you may have lost at least some of your files, if they were split over two components! Oh, and what if, as it actually happened to me, you needed to extract one file from those same four parts, using a netbook that only has 3 GB of free space?

… and a better solution

After experiencing all the problems above and then some, I came up with the script below. It builds a list of files to back up in one tar archive, until their cumulative size is as close as possible to a given $MAXSIZE. When adding one more file would exceed that threshold, the script builds the tar archive with all the files present in the list, then starts over, until there are no more files to back up. It has the same practical effects as split, without the drawbacks, because each file it produces is a complete, independent tar archive:

       1     #! /bin/bash
       2
       3     BASEDIR=$1
       4     MAXSIZE=20000000000
       5     TOTSIZE=0
       6     ALL_FILES=/tmp/files_sorted_by_name
       7     CURRENT_LIST_NUM=0
       8
       9     rm -f $ALL_FILES
      10     find $BASEDIR -type f | sort > $ALL_FILES
      11
      12     while read line
      13       do
      14       CURFILESIZE=`ls -l $line | cut '-d ' -f5`
      15       TOTSIZE=$(($TOTSIZE+$CURFILESIZE))
      16       if [ "$TOTSIZE" -gt "$MAXSIZE" ]
      17         then
      18            tar cf mybackup.$CURRENT_LIST_NUM.tar -T /tmp/file_list_$CURRENT_LIST_NUM
      19            rm  /tmp/file_list_$CURRENT_LIST_NUM
      20            TOTSIZE=$CURFILESIZE
      21            CURRENT_LIST_NUM=$(($CURRENT_LIST_NUM+1))
      22         fi
      23       echo $line >> /tmp/file_list_$CURRENT_LIST_NUM
      24     done <$ALL_FILES
      25     tar cf mybackup.$CURRENT_LIST_NUM.tar -T /tmp/file_list_$CURRENT_LIST_NUM
      26     rm  /tmp/file_list_$CURRENT_LIST_NUM
      27     exit

The script first saves (line 10) a list of all the files in the target directory ($BASEDIR, passed as first argument) inside $ALL_FILES. Sorting the result of “find” guarantees that all the files in any given folder will end up, if not in the same archive, at least in consecutive ones. The loop of lines 12-24 reads $ALL_FILES one line at a time, calculates the size in bytes of the current file and adds it to $TOTSIZE. If the result is bigger than $MAXSIZE (line 16) we create a tar archive with all the files already listed inside /tmp/file_list_$CURRENT_LIST_NUM, reset $TOTSIZE and increment $CURRENT_LIST_NUM. Otherwise, we add the current file to the file list (line 23).

My actual backup script does a lot more (for one thing, it keeps email, pictures and other documents separate). However, the logic that prevents the tar files from getting too big is as you see it above. You may run it as is, use the current day instead of “mybackup” as prefix, transform all the code into a shell function, and much more. If you do it, please let us know!