The authors of Duplicati, an open-source file backup client, discuss the impetus for the creation of their project, keeping data secure in the cloud, and backup integrity with incremental data storage.
Duplicati is a free, open-source (LGPL) backup client that stores incremental, compressed backups on various public cloud services such as Amazon S3 and Microsoft OneDrive, or on private file servers using standard protocols such as WebDAV, SSH, and FTP. Backups with Duplicati are AES-256 encrypted, and the program features a scheduler to ensure that backups are always up to date.
Duplicati is maintained by Kenneth Skovhede and René Stach, who have generously offered to answer some questions about managing incremental data storage, cloud security, and backup integrity.
James Sanders: What was impetus for the creation of Duplicati?
Kenneth Skovhede: I started using duplicity for my backups and modified it to run on Windows. But after some time I gave up. The engine did not really work great on Windows, and communication over stdin/stdout with a Python process had its limits with regard to an easy-to-use user interface. So I started to rewrite the key component, which was the rdiff/rsync. From there I got to a new user interface and later on I added features that required to break compatibility with duplicity. That was the time when Duplicati was born as a completely new backup solution.
James Sanders: Do you believe commercial backup services are less secure?
Kenneth Skovhede: It all comes down to one question: who has access to your key?
Some providers store the key on the server system, and that is obviously insecure. Other providers claim that the encryption is done client side, and that they do not have access to your key. If the latter is true, then I would say they are safe.
But it is really hard to verify if the source code is not available. As we have seen with the Snowden revelations, even "nice" companies can be forced to turn rogue at any moment. In that sense I would say that anything closed-source is insecure, including backups.
Open source is not a panacea and has its own issues, but it makes it much harder to inject rogue access into a program. Where there are multiple unrelated people from different countries working on a project (or just viewing the code), too many people need to be "persuaded" before something nasty can be applied.
James Sanders: To this extent, do you have any concerns about the security of data stored in the cloud?
Kenneth Skovhede: If you do not control your key and encrypt everything stored remotely, then you have no security. It does not matter who you consider the good guys, as you can be certain that if one outside party can access your data, another can as well. As we can see from both the Snowden documents as well as the current Microsoft Ireland case, geographical borders are no protection for your data, regardless of how the hosting provider describes it.
It is a real problem, as many services are REALLY convenient to use, and it makes perfect business sense to have some company pool together infrastructure services. For backups, I would say the solution is simple, but for something like a virtual machine, shared documents, or a hosted database, it is much harder to keep secure and operational at the same time.
So yes, I am very concerned about storing data in the cloud, and probably store way too much personal data unencrypted anyway (e.g., my email). I think any company that stores data in the cloud should accept the fact that it is not kept secret unless you own the infrastructure.
James Sanders: What type, and what quantity of data have you backed up for your personal needs?
Kenneth Skovhede: We have two main people in our team who are driving the project forward. I myself have:
- 50 GB of family photos
- 10 GB of source code and various documents.
René [Stach] has:
- 115 GB family photo library
- 136 GB private documents of any kind (nicely compressed to 27 GB)
- 5 GB mixed office documents
James Sanders: For Duplicati 2.0, how much smaller have file backups become with the use of block backup and LZMA2?
Kenneth Skovhede: There are different things to consider: Duplicati 2.0 has a block-based storage and does not require chains of full and incremental backups. As a result of that, there are never two or more full backups present. This is a huge saving compared to Duplicati 1.3. Second, the block-based storage never stores the same data twice. If there are two identical files, they are just stored once. If a file is moved, it is not treated as a new file and added to the backup, but there is just one file stored. Third, we introduced LZMA2 compression, which results in about 10-30% smaller files depending on the type of files that are backed up.
Summarizing, I assume that most backups will be shrunk down to 40% or less of the space that was required with Duplicati 1.3.
James Sanders: For concerns about backup integrity, have you considered using something like Reed-Solomon error correction?
René Stach: Yes, that is a feature I look very much forward to implementing. It was registered as an issue back in 2010 where I was hoping to simply use an existing PAR2 library for this (issue #314). Unfortunately, it seems that none of the existing PAR2 implementations are flexible enough to incorporate into Duplicati, so I will look at rolling a new PAR2 library for supporting this.
Due to the technical challenges here -- that we sometimes just do not have the time for -- we added a backup verification feature to Duplicati 2.0. Whenever a backup runs (or if the user starts the verification manually), a random set of backup files is downloaded and its integrity is checked. That way, Duplicati 2.0 can quickly find out when files got corrupted or modified.