Nothing is more important than backups

There is one single most important thing for running a research group as a young scientist. It is not finding good students or writing high-impact papers (of course, you should do that, too), but it is having good backups. Of everything. And to have more than one backup. In different places. In short: be completely paranoid about your backups, and it will someday pay off.

Luckily, I have always been somewhat paranoid about my backups, but still not paranoid enough. Let me tell you two recent stories …

Part 1: The Macbook

A few months ago, I had to give a rather important talk. I spent a whole week preparing my slides. As usual, I work on my beloved Macbook for that. While putting the final touches on my talks and doing a few practice runs on the afternoon before I had to leave, my Macbook stared to show only a blue screen. It did not boot anymore, I could not logging, I could not access any data on it.

Luckily, I have backups, you would think. I use an external hard disk with time machine, but this backup is only done when I connect the external disk. Usually I do this once a week, but sometimes I forget, especially when there are many (apparently) more important things to do (like preparing talks). So the time machine backup did know anything about my talk and did not help. My secondary backup is done by Backblaze. This works online and permanently backups new files in the background [1]. As a last resort, I copy important files (like upcoming talks) to Dropbox. So I was able to recover my talk from Backblaze and Dropbox and could give the talk using my iPad. Thank you, Backup!

Part 2: The Fileserver

All data of my research group is on a file server, that is part of our computing cluster. Really, all data: raw data from all quantum-chemical calculations done in the last four years; all manuscript, notes, etc; all source code developed in the last four years; and probably much more that I did not think of now. It seems that backup is important here …

All this data is stored in a RAID-5 consisting of six disks. This means that one disk can fail without any data loss. This is the first line of defense. A few weeks ago, one disk failed. So far so good, no big problem: we ordered a new disk and replaced it, then the RAID rebuilds itself (which takes a few hours) and then another disk can fail without data loss. All this can be done without switching off the server. Actually, without doing anything other than replacing the disk.

However, what happened was that while the RAID was rebuilding, another disk failed. So we where completely screwed. All data on our file server was lost. Again, luckily we do have another backup: All data is backed up every night to the disks in our desktop computers using a home-built solution based on duplicity. So, we could restore basically everything from our backup [2].

While this sound all good, things were actually more difficult that expected: It turned out that for one directory, the backup had failed a few weeks ago because the desktop computer used for this backup had been switched off. So we were missing a few weeks of data there. Second, restoring data with duplicity was more difficult than backing up: For some files it failed with error messages or because the network connection was interrupted, and there is no option to resume restoring. Therefore, a lot of manual work was necessary, but this was successful in the end.

Lessons

  1. Backup you data! Backup everything. Have more than one backup. Have backups in different places. In short: be completely paranoid about your backups, and it will someday pay off!
  2. Backup should always be your number one priority. It somethings seems to be not right (like a desktop computer that is used for backups is switched off), fix it. Immediately.
  3. Practice restoring your backups! Discovering the problems with your backup software when you really rely on it is not the best time. This is like a fire drill: you should practice it before there is an actual fire.
  4. Disks that you bought at the same time might start to fail at the same time. Do not rely on your RAID as a “backup”.

[1] We could discuss data security now, but for the moment I trust Backblaze. All data is encyrpted before it is transmitted and should be only accessible by me with my encryption passphrase (which differs from the password). (Side note: Store your password and encryption passphrase not only on your computer or you will be screwed)

[2] In fact, we could save most data on our RAID by forcing one failed disk to go online again. This screwed up the filesystem somehow, but after doing a repair with some apparently pretty dangerous options, most files seemed to be there again so that we could at least get the server up and running again. Afterwards, we still did a full restore from the duplicity backup.

Advertisements