Bup, the backup tool with a clever idea
What backup tool are you using? You are using one, right? I am using one these days, namely git. My entire home directory is collection of git repositories. Using git for backups is great because it is easy to synchronize data. It is also easy to restore files without needing access to the backup server. I keep my .git directories in a separate partition and symlink them into the right position. Every few days I push all my git repositories to my backup server that has a user called 'git' with 'git-shell' as the shell setting. So sending backups to the server can happen safely over ssh.
There are downsides to using git though. The main problem becomes obvious if you work with large files. I do not have a lot of large files, but I do have a few virtual machines. One way of backing those up is to put them on a disk that supports taking snapshots. ZFS or 'lvmcreate --snapshot' are solutions for this. But what if you want to backup the different versions of large files to a different storage medium?
This is where bup comes in. Bup is a backup tool with a clever idea on top of git: it splits files by using a rolling checksum. Then it stores each file fragment in a git pack file.
To understand how this works, you need to understand a bit about how git works and how rolling checksums work.
In git, you can store every file for every version in every directory that you ever committed. If you make a commit in git, you store the entire directory tree. Yet, the disk usage of git is not so large. That is because each file is stored under a name that is determined from the content of the file. A fancy name for this is 'content addressable storage'. If two files are the same, they will have the same name. A directory is a list of file names. If directories have the same content, the list is the same, and the name under which the directory is stored in the .git folder is also the same. This concept is explained clearly in the git community book.
Git was designed for use with many small files. It cannot work efficiently with large files. If one bit changes, the entire file has to be stored twice. Git has a mechanism to store only the difference between files, but it is expensive to calculate this for large files.
Bup uses the idea of content addressable storage and solves the problem of big files with a rolling checksum. The concept is explained in an entertaining way in the bup DESIGN document. A rolling checksum is a checksum which you can slide along a data blob. You start at the start of the blob and calculate the checksum. Then you slide one byte along the blob; you substract the byte at the low end of the sliding window and add a byte at the high end of the sliding window. This gives you a new checksum. With a rolling checksum it is efficient to calculate a checksum for all positions in a blob (apart from the part where the sliding window started). At each position where the last 13 bits of the checksum are 1, bup splits the blob. This gives chunks with an average size of 8192 bytes. All of these chunks are compressed and stored separately in a git packfile. If a bit is changed in the blob, two to three things change: the chunk where the bit is located, the following chunk (if there is one) and the list of chunks that is stored, similar to a directory, to restore the blob.
The current implementation of bup was written with speed of development in mind; git binaries are called from python code. There is a read-only FUSE implementation on top of bup to make it convenient to browse old versions of files. Going one step further, write support could be added. That would make bup into a filesystem, similar to Fossil with Venti or ZFS with deduplication. Bup would be better at deduplication, due to the rolling checksum.