Showing posts with label storage. Show all posts
Showing posts with label storage. Show all posts

[ next | prev | up ]

Backups, heal thyself

MIT venti server

Around 2000, Sean Quinlan and Sean Dorward built a content-addressed storage system called Venti. Their paper, “Venti: a new approach to archival storage” (PDF; also HTML), won best paper at the storage conference where it was presented. Today's post is about how Venti can heal itself.

Venti names stored objects by their SHA1 hashes: you write a block, you get back the SHA1 hash. You show up with a SHA1 hash, you can get the block. Objects larger than a block are typically broken up into blocks and then stored in a tree. Each pointer in the tree is the SHA-1 hash of the block it points to:

Hash trees

Thus, if you've stored a big file (say, a disk image) and you change one block, only that one block and the (probably five or so) blocks that point at it actually change: the rest of the storage can be shared with the previous copy of the file.

The primary benefit of using content-addressing is that duplicates get removed for free. If two different people write the same data, Venti only stores the data once. Or if one person archives the same data twice, Venti only stores the data once. My research group at MIT runs a Venti server to store FFS disk images of its main file server. Writing the same images over and over again only stores changed blocks, and if multiple people have the same giant files, they take up extra space on the file server itself, but not in Venti. Our live file system usage is about 600GB, but the incremental backup cost in Venti is about 2 GB per night. All our nightly backups going back to 2005 take up about 3 TB, or 1.7 TB compressed.

Another benefit of using content-addressing is that if the data goes bad, you notice. You can check the SHA-1 hash of the data returned by a read to make sure it is the block you expected. Clients can verify the data returned by the server, and the server can verify the data returned by its disks. In fact the latter has been the bigger problem in Venti installations: disks go bad, and if you tend to notice more if you have checks like these.

A few years ago, I wanted to download all the 6 GB of file system traces used in the Venti paper. Not all of the files downloaded correctly: a few were truncated too early. It turned out that the Venti server now storing the traces at Bell Labs had some disk corruption, and the Venti server noticed because the SHA-1 hashes didn't match expectations. No problem, just use the earlier Venti server that the data was copied from. That server has had disk corruption too, some of it overlapping the bad blocks. So there are still traces that can't be read. Okay, well the traces were saved on backup tapes too. Dig those up, and the backup tapes aren't complete either: still two trace files with missing blocks. There are also backup tapes of the original file system blocks that were used to create the traces. Those are actually readable, so we can regenerate the bad trace files and have a complete set again.

Here's the fun part. The original trace files were stored as a single Venti archive, a giant hash tree like the one pictured above that contains pointers to all the trace files. Not all the files can be read out of the archive, since we have these pointers to missing blocks. Creating a new archive holding just the two missing trace files makes the old archive start working. Now that Venti has blocks with the desired SHA1 hashes, the pointers to missing blocks become pointers to existing blocks. Given the right raw materials, the broken archive healed automatically.

With a little engineering, you could create a typical incremental backup system that didn't use hashes to identify duplicate blocks, and you'd probably end up being within a factor of two as space-efficient as Venti. For a long time I had this lingering fear that perhaps indexing by SHA-1 hash, which is pretty expensive to implement efficiently, was overkill. Watching that previously broken archive start working again was the moment when I fully believed that Venti's was the right approach.


Further reading. The use of SHA1 specifically isn't really important: any strong collision-resistant hash function will do, though people disagree on the wisdom of this approach. Remember that Venti is being used in situations where there are no adversaries: the main concern is an accidental collision. The second paper just linked to summarizes:

We conclude that it is certainly fine to use a 160-bit hash function like SHA1 or RIPEMD-160 with compare-by-hash. The chances of an accidental collision is about 2^-160. This is unimaginably small. You are roughly 2^90 times more likely to win a U.S. state lottery and be struck by lightning simultaneously than you are to encounter this type of error in your file system.

Absolute File System Design

Butler Lampson and Robert Sproull's 1979 SOSP paper “An Open Operating System for a Single-User Machine” (PDF; also HTML) describes the operating system on the Alto, a groundbreaking personal networked workstation built at Xerox PARC in the late 1970s and early 1980s. Many Alto features became everyday aspects of modern computing environments, but one that didn't catch on was the Alto's robust file system.

The Alto file system was designed to withstand disk or programming errors. If some set of disk blocks were lost, the Alto file system would only lose the data contained in those sectors. If you lose a data block from a file, you lose just that data block. If you lose a block in a directory, you lose the name for that file but not the file itself.

Section 3 of the paper gives the details. The basic technique is that each disk block contains a small header giving the id number and version of the associated file along with the block's offset in the file and the amount of data in the block (the last block in a file might contain less than a block worth of data). Lampson and Sproull refer to this information as absolute: it is the single point of truth for the block in question. A disk scavenger (like Unix's fsck) can scan the entire file system reading block labels to reconstruct the contents of any file.

Scanning the entire file system to find block contents is of course very slow. The block label also contains pointers to the previous and next block in a file, so that reading a file can just follow the block pointers (the file system layout algorithm arranged that much of the time files would be laid out sequentially so that following these pointers would not be too expensive in terms of disk seeks). These pointers are called hints, because they exist only for performance, not for correctness. If the hints are incorrect, that fact will become clear when they are used (the label of the pointed at block will not agree with expectations), and the scavenger can be invoked to fix the file system.

Contrast this scheme to inodes in a modern Unix file system (say, BSD FFS or Linux Ext2/3). If you lose the 512-byte disk block containing a 1GB file's inode, the entire file is inaccessible. No amount of scavenging can help, because the list of blocks making up the file was only contained in the inode, and the blocks themselves are not tagged with which file they belong to. This is much less robust than the Alto file system. Of course, the situation is even worse if you lose the 512-byte disk block containing the file system superblock: then you've lost the entire file system. Because the superblock is so crucial, file systems usually maintain backup copies of the superblock scattered across the disk. Inodes are almost as important but not so carefully guarded.

Why don't modern file systems don't use labels like the Alto did?

Disk geometry might play a role: it is hard to reconcile a (say) 32 byte label with (say) a 4096 byte data block. You put the label in the data block, cutting the actual data size to 4064 bytes. You could put the label in the sector before the block, taking up 512 bytes for a 32 byte label. Or you could write a label sector containing 16 labels every 16 data blocks. The third is probably the best option, but all of them are somewhat cumbersome. (But you'd lose less data!)

Another reason is historical: the original Unix file system didn't have any such mechanism, and FFS and Ext2/3 differ little from the original Unix file system.

Another reason might be philosophical: you're supposed to handle disk failures with backups, not clever file system tricks. This is the least useful argument: who wouldn't want a file system that needed to resort to backups less often?

Another reason might be pragmatic: maybe disks fail in different ways now than they did in 1979. (Of course, that doesn't explain the original Unix file system or FFS.) Since there's so little data on disk failures, it's hard to say anything for certain.

Whatever the arguments against labels, it is undeniable that they produced a very robust system. In fact, the operating system on the Alto could be almost completely swapped out an replaced with user programs, each of which could supply its own file system implementation. Over time, there came to be three file system implementations sharing the disk, one in BCPL, one in Mesa, and one in Smalltalk. In a retrospective paper, Lampson wrote:

If a disk address is wrong, the label check will detect the error, and various recovery mechanisms can be invoked. A Scavenger program, written by Jim Morris, takes about a minute to check or restore the consistency of an Alto file system; it can be routinely run by nonprofessional users. As a result of this conservative design, the file system has proved to be very reliably; loss of any information other than bits physically damaged on the disk is essentially unheard of. This reliability is achieved in spite of the fact that many programs besides the standard operating system have access to the disk.

Who wouldn't want a robust system like that? And why aren't more of today's file systems that robust?