Duplicity and dealing with large backup files. Life hack!
December 21st, 2018
So we recently ran in to an issue for which we had to use the backups to restore some data. You know... one of those days. While we were restoring the latest sql dump on the backup server, it took ages to built up the duplicity cache but why does that happen?
If you are not familiar with it, Duplicity is one of many backup tools available on Linux systems that uses the rsync algorithm and GnuPG to create and encrypt backups from your application' files and directories. The essential crux is because it uses GnuPG to encrypt and sign these backup archives, they will be safe from spying and modification by the server. If you don't have the key, tough luck - getting in. Storing the encryption key on the same server would defeat the purpose of having secure archives, so in a way Duplicity isn't aware of what is in the backups. It obviously knows when it creates them but after encrypting, it quickly forgets.
This is where the cache comes storming in. That cache holds your backup chains index files, a table of contents of the backup repository so to speak. Operations like fetching status and creating incremental backups need to know what is already backed up to work and if they are cached the archives do not need to be decrypted each and every time. From what we have seen the cache is built from copying a lot of related backup information like GPG signatures, volumes etcetera to a separate cache directory. Can you see where our troubles began?
This particular application is not big in the way of Facebook big but it's substantial for us. Having switched to a new and bigger backup server in January '18 we had somewhere around 322 incremental backups and 10 full backups. These are backups from a database currently ~16GB in size thus the storage starts racking up in numbers. Either Duplicity wasn't aware of it's cache, it was moved or never created we do not know but when trying to restore the dump it had to make a pass of each backup. First indexing which files it has followed by "replaying" them to get to the final state. Over a span of a couple of hours we couldn't quite fully figure out why it had to replay all of them when there was a full backup almost at the end of that chain. It would surprise me if there's isn't an option to say "restore from file starting from the last full backup" but if there is.. we could not find it. After waiting for that restore to happen and searching for alternatives we came across a StackOverflow post (that I sadly cannot find now) that made a very useful suggestion.
If you want to restore from a particular point in time, take the first full backup before that point and copy every incremental backup after that full to a new folder.
Given that you create full backups every first day of the month, find all volumes created on that day. That's your first full backup. Move those volumes and signatures on the days that follow.
cp duplicity-full.20181201T011505Z.vol1.difftar.gpg /home/restore_this ... cp duplicity-new-signatures.20181201T011505Z.to.20181202T011507Z.sigtar.gpg ...
Point your restore command to that new folder et voilà!
duplicity --file-to-restore mydatabase.sql file:///home/restore_this
Duplicity merely caches those backups and avoids caching almost a years worth of data. The restore should be done - a lot - faster.