Adam M. Costello, Chris Umans, Felix Wu
This was a class project for CS 252, spring 1998.
Traditional backup systems perform periodic full dumps plus incremental dumps. This strategy requires high peak bandwidth and is difficult to adapt to an online setting. Current systems requiring 24-hour availability and large file systems of the future will require a different approach. We present an online backup system intended to achieve low peak bandwidth and small waiting times during an online restore, while maintaining a backup size comparable to existing systems. A simulation of our system on disk traces over a two month period indicates that these goals are substantially achieved. We believe that our system or a similar scheme would therefore be well-suited for certain existing file systems, and will become increasingly attractive for large disks in general as disk capacity grows.
Traditional backup systems perform periodic full dumps every month or few months, weekly incremental dumps, and daily incremental dumps. In the event of catastrophic failure, the entire file system can be restored from at most three dumps: one monthly dump, one weekly dump, and one daily dump. Individual files can be restored similarly. In order to ensure consistency of the backup, the file system must remain quiescent while dumps are being performed. To minimize impact on the user, the dumps are typically performed during periods of low use (overnight), using a file locking mechanism or multiple scans to achieve consistency.
A more attractive approach is possible for file systems that support snapshots. Snapshots are frozen, read-only copies of the disk. They are typically implemented using a copy-on-write scheme, and can thus be taken by pausing running applications very briefly. Consistent backup can be performed by dumping the contents of the snapshot to tape. Snapshots would seem to provide an elegant solution to the problem of providing online backup while maintaining 24-hour availability. However, performing the dump stresses many parts of the I/O system: disk seeks increase dramatically, utilization of I/O busses and the interconnect between the disk and tapes increases, and CPU time is consumed. The stress is most severe during full dumps, which have much more data to copy than the incrementals, but have an equal overnight period in which to do it.
Systems that are continuously in use would benefit if dumps were spread out over a long period of time, smoothing out the demands of the backups and reducing the peak impact.
In the future, there may be no choice but to spread dumps out over a long period of time, because of increasing gaps in disk technology. Over the last ten years, capacity, bandwidth, and access times have improved by factors of about 50, 20, and 3, respectively. For large disk systems, it may become infeasible to dump the entire contents (many terabytes) in a few hours, or even a few days.
A similar problem arises during a full restore. For the large disks of the future, a full restore will take longer than is typical for today's backup systems, so users must be able to access restored data even as the restore progresses. This immediately suggests that the restore should be “intelligent”, restoring frequently used blocks first. This improvement would be useful even for today's relatively quick restores.
There are other advantages to a backup system that permits dumps to occur over a longer time period. Lower peak bandwidth means dumps over wide-area networks are possible for automatic off-site backup. Or on-site backups can use fewer tape drives and less expensive interconnects between the tapes and the disks. Most importantly, a backup system that allows dumps to proceed gradually has the time to write a dump in many smaller units, where the data in successive units are of decreasing importance. This allows intelligent ordering of blocks within the dump (to facilitate online restores), a task that is near impossible if the overriding concern is to dump data to tape as fast as possible.
Our backup system takes the concept of gradual dumps to its logical conclusion: We permit a dump to run as long as the time originally required to write the data to disk. As a result we are able to achieve low peak bandwidth at all times to support online backup while maintaining 24-hour availability, and we are able to order blocks on tape to support intelligent online restores.
The goals of our backup system are as follows:
The backup and reliability component of the Personal Terabyte project is examining the question of how to back up terabyte disks when they become popular in home computers. The project has produced a good survey [CVK98] of existing backup techniques.
The backup system for the Spiralog log-structured file system [GBD96] leverages the log structure to perform efficient online backup. The restore process relies on the improved seek times of modern tape drive hardware.
The ADSTAR Distributed Storage Manager (ADSM) [CRS+95] is a hierarchical storage management system that also handles backup and restore. The challenge of doing incremental-only backups is how to do efficient restores when the data is scattered haphazardly across a long history of tape. ADSM's answer is to replace a passive backup medium (tape) with an active device (a hierarchical storage server using magnetic disks, optical jukeboxes, and tape), which can continually reorganize the data. ADSM has been used for online database backup [IBM98].
It is important to remember that brute force is always an option. A system based on the SGI Origin2000 achieved an online backup rate of one terabyte per hour [CCC] of a one terabyte 138-disk TPC-C database using 66 UltraSCSI busses and 38 tape drives.
The traditional backup scheme takes incremental dumps at multiple time scales (daily, weekly), but bottoms out at periodic full dumps. Instead, we continue the sequence of ever-coarser incrementals.
Rather than require a large dump to be written all at once, we write it gradually.
Because this causes a coarse dump to take a long time, we must not require the finer dumps to wait for it to finish; instead, we allow dumps at multiple time scales to proceed concurrently.
The most urgently needed data was probably dumped most recently, so dumps should be restored in reverse chronological order.
We target the Petal virtual disk [LT96]. Petal presents a virtual block device constructed from multiple disks on multiple servers. Therefore, we are backing up blocks rather than files. No matter what the underlying physical configuration is, we can design our backup system to operate on a single virtual disk. Also, Petal provides snapshots as described in the introduction.
We take periodic snapshots of the disk (probably once or a few times per day) numbered sequentially. Snapshot 0 is taken at the birth of the disk, so it is empty. We use t to denote the sequence number of the most recent snapshot.
At any given time, there are l levels of backup tapes, where l = floor(logr t) + 1 (we number the levels from 0 to l - 1). For each level there are a sequence of records on tape, where a record between two snapshots is a set of blocks including all that changed between the two snapshots, plus some additional blocks to be discussed later. Records at level i store changes made in successive time windows of length ri, starting at time 0. At the coarser levels, one record may span multiple tapes, while at the finer levels, multiple records may be written to a single tape. At any moment there are l tapes being written, one for each level; they may be in l separate tape drives, or they may time-share fewer tape drives.
We write the records as follows. During the time between two successive snapshots, we write one part of a record onto the currently active tape at each of the l levels. A part at level i is a 1/ri fraction of the total data to be stored on the currently active record. As new levels are introduced (as t grows), we begin writing to them only when the first r records in the previous level are complete. This leads to the following picture (for r = 2):
In the time between taking snapshot t and t+1, we are writing the dark parts to the tapes in levels 0 through 3. We do not begin to write at level 4 until after snapshot t+15.
Tape reliability is another issue in backup systems. Older tapes tend to wear out or become corrupted, even as they are in storage. Therefore, most backup systems include some redundancy among the tapes themselves. In our system, sets of records at finer levels serve as a redundant copy of records at coarser levels. One nice feature of our system is that the oldest tapes have the highest level of redundancy, as can be seen from the figure.
To facilitate the selection of blocks written to a record, each snapshot includes the following information for each virtual block:
An epoch number, which is the sequence number (t) from the time at which the block was last modified prior to this snapshot.
Additional information used in the calculation of the temperature of the block, which is a speculative measure of how soon the block is likely to be accessed, probably based on a last access time and perhaps an access count.
The existing Petal implementation already includes the epoch number, and might already include the information needed for the temperature.
We will first deal with the blocks that must be written to a given record. If this record covers the time window from snapshot t0 through snapshot t1, then we need to write the blocks that have changed during this interval. All of these blocks are present in snapshot t1; in fact, that snapshot contains all blocks that have changed since the birth of the disk. To limit ourselves to the ones that have changed during the window, we consider only blocks whose epoch number is >= t0. We call this set of blocks the necessary blocks.
In order to optimize so that a full restore tends to restore earlier the blocks likely to be accessed sooner, we expand the set of blocks to be written, and write them in a sensible order.
To expand the set, we choose some constant u times the number of necessary blocks, and include that many of the hottest unnecessary blocks. These are blocks that would have been restored later at a coarser level of the restore, but we are willing to store them redundantly to make them available earlier.
To order the set, we rank the blocks by temperature, and write the hotter blocks earlier. An exact sorted sequence is not necessary; it is sufficient to sort the blocks into buckets covering disjoint temperature ranges. To do this, we scan all of the blocks to gather the temperature distributions of the necessary and unnecessary blocks, and to determine how many unnecessary blocks should be written. We then divide up the temperature range into buckets, and repeat the scan once for each bucket, writing only the blocks covered by that bucket.
All of the data required for writing a backup record is contained in the snapshot at the right boundary of the record. As the backup proceeds, snapshots become obselete and should be deleted to conserve disk space. Just after completing the record for the time window from snapshot t0 to snapshot t1, we delete snapshot t1, unless this snapshot falls at a record boundary at the next coarser level.
Given this arrangement, we can always restore the disk to its state at snapshot t-1, from tape, using the following procedure. Start at the coarsest level and write all complete records to the disk; shift to the next finer level and write all the complete records starting from the snapshot at which we left the previous level; shift to the next finer level and do the same thing; stop at the finest level with the last record, ending at snapshot t-1 (or t, if the record for that last time window was completely written before the crash).
In actuality, we restore these records in reverse order, so that more recently changed blocks are restored first. This insures that the first version of a virtual block to be restored is the current version; other versions (with earlier epoch numbers) will be copied to disk as the restore proceeds, and are used to reconstruct the older snapshots that existed on disk at the time of the crash.
If we want to restore a set of blocks much smaller than the whole disk (for example, a file) to its state at any time t0, we can copy from the snapshot whose time step is closest to t0, which may be close enough. The more recent t0 is, the more precisely we can aim at t0. If this is sufficient, then we can discard tapes containing obselete records, as the next coarser level catches up.
Alternatively, if more precise file restores are required, we can keep all tapes archived, allowing old versions of the disk to be restored using a pattern similar to the full restore.
In a block-level backup system, restoring a snapshot containing the relevant file is the only way to restore an old version of a file from a particular date in the past.
Our simulation treats disk accesses during a restore as independent events. In reality, there are dependencies between accesses. To the extent that these dependencies are reflected in similar usage statistics for sets of dependent blocks, our system should be robust to dependencies. However, depending on the details of the file system, there may be a small set of blocks (probably a subset of the metadata) that are infrequently accessed, but potentially critical to the functioning of the file system. Without modification, our system would restore these blocks near the end of the restore. However, with a simple modification, we can solve this problem: we can allow the file system to tag a small number of blocks as essential, and our backup system can then include all essential blocks among the unnecessary blocks that it already writes to each record.
The goal of our simulation study was to answer the following questions:
The following sections describe our input data and simulator, then present the results.
We are using Drew Roselli's long-term file system traces from 1 October 1996 through 31 March 1997. The traces consist of logged system calls on about twenty instructional workstations, which were then filtered to remove file operations that would have been absorbed by a 32 MB file cache on each machine.
These traces are at a higher level (system calls, like read(), write(), and trunc()) than we need. Therefore, we convert them to block-level traces, assuming a block size of 4 kB.
Another concern is that the traces are a record of all activity on a set of clients, not all activity on a file system. The file systems shared by these clients also had other clients that were not traced. Therefore, the traces contain accesses to data whose creation does not appear in the trace. There are two ways to deal with this problem:
Fabricate write operations to create the data. The tricky part is choosing the times of the write operations.
Consider only the subset of data that was created by the traced machines since the trace began, and ignore references to data outside this subset.
We originally started pursuing the former option, but because our system relies on knowing when data is written, and wants to operate from the birth of the disk, we later decided that the latter option made more sense.
A limitation of our conversion from system call traces to block-level traces is that operations on metadata (directories and inodes) are ignored. Refining the conversion process to handle metadata is future work.
A limitation of the system call traces themselves is that they have no way to track I/O performed by the virtual memory system. In particular, although the traces show calls to mmap(), they cannot show the resulting data transfers. As a first-order approximation, we assume that every call to mmap() on a file known to be non-empty results in an immediate read of the first block.
In order to get some sense of the nature of the trace files we used, we plotted the total size of the disk system over time, as shown below.
Our simulator tracks the current state of the disk by reading events from a block trace file and recording information about which blocks currently exist, which blocks have been deleted but not overwritten, and when a given block was last accessed and last modified. Conceptually, periodic snapshots of this disk would be taken and the backup records would be written gradually using the information contained in these snapshots. However, for our purposes, we need not actually simulate this process; instead, we simply track the sizes of the parts of each record, and write out the entire record as soon as it is available. Since we can determine which parts of which records are written out at each time interval, this allows us to determine the total amount of information that needs to go to tape at each interval.
The record output by the simulator is a list of block identifiers, each with an associated flag indicating whether that block had been deleted and not overwritten at the end of the time window. The flag is necessary to prevent us from restoring old versions of deleted blocks.
To simulate a crash and restore, we pick some point at which to divide the trace file. We assume the system crashes at that point, and then simulate a restore by reading the backup data that had been written up to that point. We assume that reading a block from the tape takes some fixed amount of time and stamp each block with the time of its restore. (Note that blocks on tape flagged as deleted would contain no data in the actual system; hence, we assume that the time for such reads is negligible.)
We then continue reading events from the remainder of the trace file, recording the difference between the time a block is requested and time of its restore. We assume the system would need to wait for a block to be restored before either reading or writing it. Presumably most writes are to portions of a block, and hence the entire block would need to be loaded before the write could complete.
When a block is written for the first time during the simulated restore, the simulator detects that a write is attempting to access a block that does not exist on the restored disk. We assume that in an actual system there is some mechanism for requesting a new block, distinct from writing an existing block, so that attempts to create files during a restore need not wait for the restore to complete. Therefore, we ignore newly created blocks during a restore.
The following results are based on the two month portion of the trace from October and November of 1996. We assume a disk block size of 4096 bytes and a file block size of 512 bytes. The ratio r of the lengths of time windows at neighboring levels is 2. The maximum ratio u of unnecessary to necessary blocks in each record is 5% except where otherwise specified.
We compare our system to a traditional backup system that uses periodic full dumps and intervening incrementals. We assume the traditional system is file-based, so that it can write out data in file blocks rather than disk blocks.
Bandwidth for our system is simply the amount of backup data that needs to be written to tape during the time between snapshots, divided by the amount of time between snapshots. For the traditional system, we measure only the bandwidth needed to do a full dump of the disk in six hours (overnight). Because our trace is relatively short, we show the traditional system doing full dumps once every two weeks; a real system would perform such dumps less frequently.
As the graph below indicates, the amount of data written by our system does fluctuate somewhat, reflecting the daily and weekly usage patterns of the disk system, but the peak bandwidth of our system is substantially lower than the bandwidth required for full dumps.
The following scatter plots show the wait time for every disk access, for the traditional backup and for our scheme with u values of 0 and 5%. We plot these points as a function of the time from the beginning of the restore, because later accesses will naturally wait less under any scheme. The first set of plots shows data for a crash at 9:00am Tue 26 Nov 1996. The second set shows data for a crash at 3:00pm on the same day. Other crash points tend to produce similar results. We assume that the time to restore a single block is 100 ms.
As can be seen in these plots, in a traditional system, no locality information is exploited, so wait times are uniformly distributed between zero and the time until the restore completes. In our system, even though the total restore time is longer, our use of locality information allows us to push down a large portion of the wait times.
Another way to view the same data is to plot the fraction of accesses that need to wait at all. The following graphs give the fraction of operations that block, for the same times and u values as above.
It is evident from the results that including a small number of unnecessary blocks significantly improves our system's performance. The tradeoff is that we extend the length of the restore, since each unnecessary block is a redundant copy of a block that will eventually get written anyway. We examined a full range of values for u, specifically 0, 0.05, 0.1, 0.2, 0.5, and 1.0, and reached two conclusions. First, the larger values of u do not significantly extend the length of the restore. Thus during most time intervals, there tend not to be many blocks that are read but not written. Second, values of u greater than 0.05 have only a marginal (and sometimes unnoticable) impact on the performance of the restore. This is because 0.05 captures most unnecessary blocks during most intervals. In a sense, this is the best possible situation: a small sacrifice in the length of the restore achieves almost all of the possible benefit from the strategy of writing unnecessary blocks.
Even with u set to 0, our system's restore takes longer than the time to restore from a full dump, because a full dump contains only one copy of every block, whereas our tapes contain multiple copies, at most one per record. (Only one copy actually gets written to the disk, but the tape must scan over multiple copies, delaying other blocks). However, a traditional scheme will also scan multiple copies of blocks, one from the full dump, and possibly one for each incremental dump. In our measurements of a traditional scheme, we only include the time to scan a single copy of each block, so we are underestimating the length of a traditional restore, as well as the wait times. Therefore our system should compare even more favorably in a real implementation.
The amount of backup data stored in any system will depend on whether or not the system intends to support restores of old copies of a file. In a traditional system, if we want such support, we need to keep all full dumps and all incrementals; if we just want to be able to restore the current disk, we need only keep the most recent full dump and the subsequent incrementals. In our system, we can support precise file restore by keeping all records; if we do not need such support, we can delete records as soon as they are made redundant by the corresponding record at the next level. In either system, one might also consider a hybrid approach: we could delete only some of the older dumps/records, allowing file restores to a given precision. In our system, we can also trade off restore time and restore precision by deleting certain levels of old records.
In either scenario, the amounts of data stored by our system and a traditional system are comparable. If we just want to restore the current disk, our system saves exactly the data needed for a full restore as described above (plus some partially completed records). The time to finish a restore in our simulation is proportional to the size of the backup data used. By looking at the restore performance graphs above, we can see that the data required by our system for a full restore is less than twice the amount of data in a full dump. Since a traditional system will also keep some incrementals, the difference in the amount of data stored is even less.
If we want to support precise file restores, our system potentially keeps all records. For the trace above, this totals approximately 34 GB. The total size of the full dumps during the same period was 7.7 GB, meaning our system outputs a little over 4 times the data contained in the traditional dumps. Accounting for traditional incrementals would decrease this gap: a rough calculation indicates that the incrementals would account for another 7-10 GB, meaning that our system takes about twice as much space as a traditional system on the trace data above. Notice, however, our trace was only two months long, and the full dumps were increasing in size quickly. Therefore, one might expect that over a longer period the comparison would be more favorable.
In preparing to design a backup system for large disks of the future, the Personal Terabyte project conducted a survey [CVK98] of backup systems in which they concluded:
As file systems grow to multiple terabytes, it is likely that new backup strategies will be required to protect them. The most promising techniques for handling very large file systems appear to be incremental-only backup schemes, device-based backup to use disk bandwidth efficiently and to avoid writing entire files based on small file changes, snapshots and copy-on-write for on-line backup, compression of data before backup, and automated creation of off-site backup files.
We have designed a system possessing many of the qualities envisioned in the quote above: it is an incremental-only device-based scheme built on a disk system that supports snapshots. Our simulation study indicates that it achieves low peak bandwidth and small waiting times during restores, without greatly increasing the size of the backup data or the restore time. These characteristics enable automatic off-site backup over a wide-area network, as well as the use of less expensive backup hardware. In sum, our backup scheme is suitable for current systems requiring 24-hour availability and will become increasingly attractive for large disks in general as disk capacity grows.