Using rsync for Backup

Last Updated on 03/17/2018 by dboth

Backup options

There are many options for performing backups. Most Linux distributions are provided with one or more open source programs especially designed to perform backups. There are many commercial options available as well. But none of those directly met my needs so I decided to use basic Linux tools to do the job.

In a previous article, Using tar and ssh for backups, I showed that fancy and expensive backup programs are not really necessary to design and implement a viable backup program.

Since last year, I have been experimenting with another backup option, the rsync command which has some very interesting features that I have been able to use to good advantage. My primary objectives were to create backups from which users could locate and restore files without having to untar a backup tarball, and to reduce the amount of time taken to create the backups.

This article is intended only to describe my own use of rsync in a backup scenario. It is not a look at all of the capabilities of rsync or the many other interesting ways in which it can be used.

rsync

The rsync command was written by Andrew Tridgell and Paul Mackerras and first released in 1996. The primary intention for rsync is to remotely synchronize the files on one computer with those on another. Did you notice what they did to create the name there? rsync is open source software and is provided with all of the distros with which I am familiar.

The rsync command can be used to synchronize two directories or directory trees whether they are on the same computer or on different computers but it can do so much more than that. rsync creates or updates the target directory to be identical to the source directory. The target directory is freely accessible by all the usual Linux tools because it is not stored in a tarball or zip file or any other archival file type; it is just a regular directory with regular files that can be navigated by regular users using basic Linux tools. This meets one of my primary objectives.

One of the most important features of rsync is the method it uses to synchronize preexisting files that have changed in the source directory. Rather than copying the entire file from the source, it uses checksums to compare blocks of the source and target files. If all of the blocks in the two files are the same, no data is transferred. If the data differs, only the block that has changed on the source is transferred to the target. This saves an immense amount of time and network bandwidth for remote sync. For example, when I first used my rsync BASH script to back up all of my hosts to a large external USB hard drive, it took about 3 hours. That is because all of the data had to be transferred because none of it had been previously backed up. Subsequent syncs took between 3 and 8 minutes of real time, depending upon how many files had been changed or created since the previous sync. I used the time command to determine this so it is empirical data. Last night, for example, it took 3 minutes and 12 seconds to complete a sync of approximately 750GB of data from 6 remote systems and the local workstation. Of course, only a few hundred megabytes of data were actually altered during the day and needed to be synchronized.

The following simple rsync command can be used to synchronize the contents of two directories and any of their subdirectories. That is, the contents of the target directory are synchronized with the contents of the source directory so that at the end of the sync, the target directory is identical to the source directory.

rsync -aH sourcedir targetdir

The -a option is for archive mode which preserves permissions, ownerships and symbolic (soft) links. The -H is used to preserve hard links. Note that either the source or target directories can be on a remote host.

Now let’s assume that yesterday we used rsync to synchronized two directories. Today we want to resync them, but we have deleted some files from the source directory. The normal way in which rsync would do this is to simply copy all the new or changed files to the target location and leave the deleted files in place on the target. This may be the behavior you want, but if you would prefer that files deleted from the source also be deleted from the target, you can add the --delete option to make that happen.

Another interesting option, and my personal favorite because it increases the power and flexibility of rsync immensely, is the --link-dest option. The –-link-dest option allows a series of daily backups that take up very little additional space for each day and also take very little time to create.

Specify the previous day’s target directory with this option and a new directory for today. rsync then creates today’s new directory and a hard link for each file in yesterday’s directory is created in today’s directory. So we now have a bunch of hard links to yesterday’s files in today’s directory. No new files have been created or duplicated. Just a bunch of hard links have been created. Wikipedia has a very good description of hard links. After creating the target directory for today with this set of hard links to yesterday’s target directory, rsync performs its sync as usual, but when a change is detected in a file, the target hard link is replaced by a copy of the file from yesterday and the changes to the file are then copied from the source to the target.

So now our command looks like the following.

 rsync -aH --delete --link-dest=yesterdaystargetdir sourcedir todaystargetdir

There are also times when it is desirable to exclude certain directories or files from being synchronized. For this there is the --exclude option. Use this option and the pattern for the files or directories you want to exclude. You might want to exclude browser cache files so your new command will look like this.

rsync -aH --delete --exclude Cache --link-dest=yesterdaystargetdir sourcedir todaystargetdir

Note that each file pattern you want to exclude must have a separate exclude option.

rsync can sync files with remote hosts as either the source or the target. For the next example, let’s assume that the source directory is on a remote computer with the hostname remote1 and the target directory is on the local host. Even though SSH is the default communications protocol used when transferring data to or from a remote host, I always add the ssh option. The command now looks like this.

rsync -aH -e ssh --delete --exclude Cache --link-dest=yesterdaystargetdir remote1:sourcedir todaystargetdir

This is the final form of my rsync backup command.

rsync has a very large number of options that you can use to customize the synchronization process. For the most part, the relatively simple commands that I have described here are perfect for making backups for my personal needs. Be sure to read the extensive man page for rsync to learn about more of its capabilities as well as the options discussed here.

Performing backups

I automated my backups because – “automate everything.” I wrote a BASH script that handles the details of creating a series of daily backups using rsync. This includes ensuring that the backup medium is mounted, generating the names for yesterday and today’s backup directories, creating appropriate directory structures on the backup medium if they are not already there, performing the actual backups and unmounting the medium.

I run the script daily, early every morning, as a cron job to ensure that I never forget to perform my backups.

Recovery Testing

No backup regimen would be complete without testing. You should regularly test recovery of random files or entire directory structures to ensure not only that the backups are working, but that the data in the backups can be recovered for use after a disaster. I have seen too many instances where a backup could not be restored for one reason or another and valuable data was lost because the lack of testing prevented discovery of the problem.

Just select a file or directory to test and restore it to a test location such as /tmp so that you won’t overwrite a file that may have been updated since the backup was performed. Verify that the files’ contents are as you expect them to be. Restoring files from a backup made using the rsync commands above simply a matter of finding the file you want to restore from the backup and then copying it to the location you want to restore it to.

I have had a few circumstances where I have had to restore individual files and, occasionally, a complete directory structure. Most of the time this has been self-inflicted when I accidentally deleted a file or directory. At least a few times it has been due to a crashed hard drive. So those backups do come in handy.

Resources

Wikipedia: rsync

Wikipedia: Hard Links

Up one level: Backups and Disaster Recovery

Previous: Using tar and ssh for backups

Next: Scripting Languages