Showing posts with label file systems. Show all posts
Showing posts with label file systems. Show all posts

[ next | prev | up ]

Backups, heal thyself

MIT venti server

Around 2000, Sean Quinlan and Sean Dorward built a content-addressed storage system called Venti. Their paper, “Venti: a new approach to archival storage” (PDF; also HTML), won best paper at the storage conference where it was presented. Today's post is about how Venti can heal itself.

Venti names stored objects by their SHA1 hashes: you write a block, you get back the SHA1 hash. You show up with a SHA1 hash, you can get the block. Objects larger than a block are typically broken up into blocks and then stored in a tree. Each pointer in the tree is the SHA-1 hash of the block it points to:

Hash trees

Thus, if you've stored a big file (say, a disk image) and you change one block, only that one block and the (probably five or so) blocks that point at it actually change: the rest of the storage can be shared with the previous copy of the file.

The primary benefit of using content-addressing is that duplicates get removed for free. If two different people write the same data, Venti only stores the data once. Or if one person archives the same data twice, Venti only stores the data once. My research group at MIT runs a Venti server to store FFS disk images of its main file server. Writing the same images over and over again only stores changed blocks, and if multiple people have the same giant files, they take up extra space on the file server itself, but not in Venti. Our live file system usage is about 600GB, but the incremental backup cost in Venti is about 2 GB per night. All our nightly backups going back to 2005 take up about 3 TB, or 1.7 TB compressed.

Another benefit of using content-addressing is that if the data goes bad, you notice. You can check the SHA-1 hash of the data returned by a read to make sure it is the block you expected. Clients can verify the data returned by the server, and the server can verify the data returned by its disks. In fact the latter has been the bigger problem in Venti installations: disks go bad, and if you tend to notice more if you have checks like these.

A few years ago, I wanted to download all the 6 GB of file system traces used in the Venti paper. Not all of the files downloaded correctly: a few were truncated too early. It turned out that the Venti server now storing the traces at Bell Labs had some disk corruption, and the Venti server noticed because the SHA-1 hashes didn't match expectations. No problem, just use the earlier Venti server that the data was copied from. That server has had disk corruption too, some of it overlapping the bad blocks. So there are still traces that can't be read. Okay, well the traces were saved on backup tapes too. Dig those up, and the backup tapes aren't complete either: still two trace files with missing blocks. There are also backup tapes of the original file system blocks that were used to create the traces. Those are actually readable, so we can regenerate the bad trace files and have a complete set again.

Here's the fun part. The original trace files were stored as a single Venti archive, a giant hash tree like the one pictured above that contains pointers to all the trace files. Not all the files can be read out of the archive, since we have these pointers to missing blocks. Creating a new archive holding just the two missing trace files makes the old archive start working. Now that Venti has blocks with the desired SHA1 hashes, the pointers to missing blocks become pointers to existing blocks. Given the right raw materials, the broken archive healed automatically.

With a little engineering, you could create a typical incremental backup system that didn't use hashes to identify duplicate blocks, and you'd probably end up being within a factor of two as space-efficient as Venti. For a long time I had this lingering fear that perhaps indexing by SHA-1 hash, which is pretty expensive to implement efficiently, was overkill. Watching that previously broken archive start working again was the moment when I fully believed that Venti's was the right approach.


Further reading. The use of SHA1 specifically isn't really important: any strong collision-resistant hash function will do, though people disagree on the wisdom of this approach. Remember that Venti is being used in situations where there are no adversaries: the main concern is an accidental collision. The second paper just linked to summarizes:

We conclude that it is certainly fine to use a 160-bit hash function like SHA1 or RIPEMD-160 with compare-by-hash. The chances of an accidental collision is about 2^-160. This is unimaginably small. You are roughly 2^90 times more likely to win a U.S. state lottery and be struck by lightning simultaneously than you are to encounter this type of error in your file system.

The Hideous Name

In 1985, email addresses looked a lot different than they do today. They were full of now-foreign symbols like : ! and %, because each mailer had its own syntax and there was no translation between syntaxes at borders, resulting in addresses like

research!ucbvax!@cmu-cs-pt.arpa:@CMU-ITC-LINUS:dave%CMU-ITC-LINUS@CMU-CS-PT.

Rob Pike and Peter Weinberger used that address as the opening example in their paper “The Hideous Name,” which examines good and bad examples of naming in both file systems and mail addresses. Name spaces, they argue, are most powerful when new semantics can be added without adding new syntax.

The mail examples in “The Hideous Name” are not particularly relevant anymore. Everyone has standardized on the two-level ARPANET @ syntax, and the other syntaxes are used only by people looking for open mail relays. But the file system examples are more relevant today than they were 20 years ago.

Unix file names are the canonical example: a program that parses file names like /etc/passwd will have no problem parsing names of ‘special’ files like /dev/stdin. The latter is special only in semantics, not in syntax: if you run cp /dev/stdin /tmp, cp will have no trouble deciding the name of the target file (/tmp/stdin). Contrast this with the convention of using to mean standard input: what target should cp – /tmp use? In fact (on Linux, at least) it looks for a file named rather than using standard input, illustrating another problem. Special syntax is often implemented by individual programs rather than the operating system, resulting in non-uniform handling of the names. It's hard to justify why (again, on Linux) cat – >/tmp/a means something different from cp – /tmp/a.

Providing uniform interpretation of names means having some central arbiter, tradtionally the kernel. Then all programs use the same name space, and one can apply old tools to new resources. For example, if other machines' file systems were mounted on /n/system/,

sed 's!:.*!!' /n/*/etc/passwd | sort -u 

produces a list of all the user of all the machines. And if backups are mounted on /dump,

ls -l /dump/am/2007/02*/etc/passwd

shows all the password files from last February. Special semantics, but no special syntax.

Better than changing the kernel for each file system is having a general mechanism by which the kernel can defer interpretation of particular file trees to other programs. Plan 9 was the first system to make extensive use of such user-level file servers. FUSE now makes it possible to do the same on modern Unixes. For example, Amit Singh has built some interesting new file systems for OS X using MacFUSE.

Pike and Weinberger end their paper with an observation from Doug McIlroy:

bad notations can stifle progress. Roman numerals hobbled mathematics for a millennium but were propagated by custom and by natural deference to authority. Today we no longer meekly accept individual authority. Instead, we have “standards,” impersonal imprimaturs on convention. Some standards are sound and indispensable; some simply celebrate bureaucratic littleness of mind. A harvest of gimmicks to save appearances within the standard has grown up, then gimmicks to save the appearances within the appearances. You know how each one got there: an overnight hack to paste another tumor onto a wild cancerous growth. The concern was with method, regardless of results. The result is extravagantly worse than Roman numerals: you can't read the notation right to left or left to right. As an amalgam of languages, it can't be deciphered by a native speaker of any one of them, much as if we were to switch at random places in a number between Roman and Arabic signs and between big-endian and little-endian order. But now that it all “works” — at least for the strong of stomach — the tumors themselves are being standardized.

What bad notations are stifling your progress?

Absolute File System Design

Butler Lampson and Robert Sproull's 1979 SOSP paper “An Open Operating System for a Single-User Machine” (PDF; also HTML) describes the operating system on the Alto, a groundbreaking personal networked workstation built at Xerox PARC in the late 1970s and early 1980s. Many Alto features became everyday aspects of modern computing environments, but one that didn't catch on was the Alto's robust file system.

The Alto file system was designed to withstand disk or programming errors. If some set of disk blocks were lost, the Alto file system would only lose the data contained in those sectors. If you lose a data block from a file, you lose just that data block. If you lose a block in a directory, you lose the name for that file but not the file itself.

Section 3 of the paper gives the details. The basic technique is that each disk block contains a small header giving the id number and version of the associated file along with the block's offset in the file and the amount of data in the block (the last block in a file might contain less than a block worth of data). Lampson and Sproull refer to this information as absolute: it is the single point of truth for the block in question. A disk scavenger (like Unix's fsck) can scan the entire file system reading block labels to reconstruct the contents of any file.

Scanning the entire file system to find block contents is of course very slow. The block label also contains pointers to the previous and next block in a file, so that reading a file can just follow the block pointers (the file system layout algorithm arranged that much of the time files would be laid out sequentially so that following these pointers would not be too expensive in terms of disk seeks). These pointers are called hints, because they exist only for performance, not for correctness. If the hints are incorrect, that fact will become clear when they are used (the label of the pointed at block will not agree with expectations), and the scavenger can be invoked to fix the file system.

Contrast this scheme to inodes in a modern Unix file system (say, BSD FFS or Linux Ext2/3). If you lose the 512-byte disk block containing a 1GB file's inode, the entire file is inaccessible. No amount of scavenging can help, because the list of blocks making up the file was only contained in the inode, and the blocks themselves are not tagged with which file they belong to. This is much less robust than the Alto file system. Of course, the situation is even worse if you lose the 512-byte disk block containing the file system superblock: then you've lost the entire file system. Because the superblock is so crucial, file systems usually maintain backup copies of the superblock scattered across the disk. Inodes are almost as important but not so carefully guarded.

Why don't modern file systems don't use labels like the Alto did?

Disk geometry might play a role: it is hard to reconcile a (say) 32 byte label with (say) a 4096 byte data block. You put the label in the data block, cutting the actual data size to 4064 bytes. You could put the label in the sector before the block, taking up 512 bytes for a 32 byte label. Or you could write a label sector containing 16 labels every 16 data blocks. The third is probably the best option, but all of them are somewhat cumbersome. (But you'd lose less data!)

Another reason is historical: the original Unix file system didn't have any such mechanism, and FFS and Ext2/3 differ little from the original Unix file system.

Another reason might be philosophical: you're supposed to handle disk failures with backups, not clever file system tricks. This is the least useful argument: who wouldn't want a file system that needed to resort to backups less often?

Another reason might be pragmatic: maybe disks fail in different ways now than they did in 1979. (Of course, that doesn't explain the original Unix file system or FFS.) Since there's so little data on disk failures, it's hard to say anything for certain.

Whatever the arguments against labels, it is undeniable that they produced a very robust system. In fact, the operating system on the Alto could be almost completely swapped out an replaced with user programs, each of which could supply its own file system implementation. Over time, there came to be three file system implementations sharing the disk, one in BCPL, one in Mesa, and one in Smalltalk. In a retrospective paper, Lampson wrote:

If a disk address is wrong, the label check will detect the error, and various recovery mechanisms can be invoked. A Scavenger program, written by Jim Morris, takes about a minute to check or restore the consistency of an Alto file system; it can be routinely run by nonprofessional users. As a result of this conservative design, the file system has proved to be very reliably; loss of any information other than bits physically damaged on the disk is essentially unheard of. This reliability is achieved in spite of the fact that many programs besides the standard operating system have access to the disk.

Who wouldn't want a robust system like that? And why aren't more of today's file systems that robust?