research!rsc

Thoughts and links about programming, by

Absolute File System Design
Posted on Monday, February 11, 2008.

Butler Lampson and Robert Sproull's 1979 SOSP paper “An Open Operating System for a Single-User Machine” (PDF; also HTML) describes the operating system on the Alto, a groundbreaking personal networked workstation built at Xerox PARC in the late 1970s and early 1980s. Many Alto features became everyday aspects of modern computing environments, but one that didn't catch on was the Alto's robust file system.

The Alto file system was designed to withstand disk or programming errors. If some set of disk blocks were lost, the Alto file system would only lose the data contained in those sectors. If you lose a data block from a file, you lose just that data block. If you lose a block in a directory, you lose the name for that file but not the file itself.

Section 3 of the paper gives the details. The basic technique is that each disk block contains a small header giving the id number and version of the associated file along with the block's offset in the file and the amount of data in the block (the last block in a file might contain less than a block worth of data). Lampson and Sproull refer to this information as absolute: it is the single point of truth for the block in question. A disk scavenger (like Unix's fsck) can scan the entire file system reading block labels to reconstruct the contents of any file.

Scanning the entire file system to find block contents is of course very slow. The block label also contains pointers to the previous and next block in a file, so that reading a file can just follow the block pointers (the file system layout algorithm arranged that much of the time files would be laid out sequentially so that following these pointers would not be too expensive in terms of disk seeks). These pointers are called hints, because they exist only for performance, not for correctness. If the hints are incorrect, that fact will become clear when they are used (the label of the pointed at block will not agree with expectations), and the scavenger can be invoked to fix the file system.

Contrast this scheme to inodes in a modern Unix file system (say, BSD FFS or Linux Ext2/3). If you lose the 512-byte disk block containing a 1GB file's inode, the entire file is inaccessible. No amount of scavenging can help, because the list of blocks making up the file was only contained in the inode, and the blocks themselves are not tagged with which file they belong to. This is much less robust than the Alto file system. Of course, the situation is even worse if you lose the 512-byte disk block containing the file system superblock: then you've lost the entire file system. Because the superblock is so crucial, file systems usually maintain backup copies of the superblock scattered across the disk. Inodes are almost as important but not so carefully guarded.

Why don't modern file systems don't use labels like the Alto did?

Disk geometry might play a role: it is hard to reconcile a (say) 32 byte label with (say) a 4096 byte data block. You put the label in the data block, cutting the actual data size to 4064 bytes. You could put the label in the sector before the block, taking up 512 bytes for a 32 byte label. Or you could write a label sector containing 16 labels every 16 data blocks. The third is probably the best option, but all of them are somewhat cumbersome. (But you'd lose less data!)

Another reason is historical: the original Unix file system didn't have any such mechanism, and FFS and Ext2/3 differ little from the original Unix file system.

Another reason might be philosophical: you're supposed to handle disk failures with backups, not clever file system tricks. This is the least useful argument: who wouldn't want a file system that needed to resort to backups less often?

Another reason might be pragmatic: maybe disks fail in different ways now than they did in 1979. (Of course, that doesn't explain the original Unix file system or FFS.) Since there's so little data on disk failures, it's hard to say anything for certain.

Whatever the arguments against labels, it is undeniable that they produced a very robust system. In fact, the operating system on the Alto could be almost completely swapped out an replaced with user programs, each of which could supply its own file system implementation. Over time, there came to be three file system implementations sharing the disk, one in BCPL, one in Mesa, and one in Smalltalk. In a retrospective paper, Lampson wrote:

If a disk address is wrong, the label check will detect the error, and various recovery mechanisms can be invoked. A Scavenger program, written by Jim Morris, takes about a minute to check or restore the consistency of an Alto file system; it can be routinely run by nonprofessional users. As a result of this conservative design, the file system has proved to be very reliably; loss of any information other than bits physically damaged on the disk is essentially unheard of. This reliability is achieved in spite of the fact that many programs besides the standard operating system have access to the disk.

Who wouldn't want a robust system like that? And why aren't more of today's file systems that robust?

(Comments originally posted via Blogger.)

  • garyj (February 11, 2008 10:08 AM) the reason this doesn't matter at all is because disks don't fail in such a way as to make it matter. I.e., if you lose a few blocks, it is VERY LIKELY you've lost a lot of blocks, or that entire sectors are unreadable. Studies of disk failures also indicate that once you've lost a few sectors, you are likely going to lose a lot more soon (because such errors are due to platter warpage, foreign material, head alignment, etc).

    Unix's scheme of keeping multiple copies of inode's is actually much more reliable in dealing with such situations.

    Of course, with the move to solid state drives, maybe Lampson and Sproull's method deserves another look (anyone have any pointers to research on flash drive failures?).

  • seregine (February 11, 2008 7:37 PM) Interesting. So if I insert a block at the beginning of a 1GB file, the FS has to write to every other block to update its offset?

  • Russ Cox (February 12, 2008 3:19 AM) @seregine: Sure, but the traditional Unix file interface doesn't provide an "insert" operation anyway. "Inserting" at the beginning of the file is accomplished by writing the entire file out again at a different offset anyway.

    This is why MP3 tags are usually placed at the end of the MP3 file, making them easier to edit.

  • fanf (February 15, 2008 2:51 PM) One reason against block labels is that they interleave data and metadata, so users can't mmap files and the OS can't have a unified buffer cache.

  • Russ Cox (February 15, 2008 4:13 PM) @fanf: Placing a block full of N labels every N data blocks would address that problem.

  • mike.w.meyer (February 15, 2008 8:42 PM) Interleaving data and metadata also hoses the performance of
    multi-block (not even large) transfers. If I don't do that, I can just
    point the disk controller at consecutive chunks of RAM, and let it
    load blocks directly into the users output buffer. With metadata at
    the start of the block, I have to put the block in an intermediate
    buffer in order to copy out just the data.

  • Russ Cox (February 17, 2008 6:06 AM) @mike.w.meyer: I don't believe this is a concern in practice, since it's far more common to load the data first into the kernel's buffer cache. The buffer cache can certainly handle some of the buffers being label blocks, so it would still be a sequential transfer.

  • Marius Gedminas (March 15, 2008 5:52 AM) What are the performance implications of this kind of file system? Would seeking into the middle of a large file require you to read all the blocks from the beginning of the file to follow the metadata chain? How about appending to the end of a big file?

    How much disk space would this kind of extensive metadata use, compared to something like ext3?

  • Russ Cox (March 15, 2008 8:10 AM) @marius: Sure, if you used only the linked list to navigate, then the file system accesses would be slow. But there's nothing stopping you from building block trees like Unix file systems and using them as hints to find the blocks faster.

    The space overhead is pretty minimal: a few words per 4096-byte block would be under 1%. Make the block size bigger and the overhead goes down.

  • forsyth (April 30, 2008 3:25 PM) ken thompson's file server for Plan 9 tags each block with a little data (type and path) which isn't quite enough to allow recovering whole files automatically but does allow simple consistency checks that (for me) not only detected problems (hardware and software) but prevented them escalating to destroy file systems (eg, data, directory, and free list blocks have different types).

  • Keith (June 24, 2009 7:59 PM) Sorry to arrive at the party so late. But as a file system nerd I couldn't help commenting...

    Some modern file systems do stuff like this. Unfortunately, they're not systems that are well described in the academic literature. Sun's new ZFS file system stores multiple copies of every metadata block, along with checksums of every block. So if an inode or indirect block gets corrupted you can detect the corruption and find a good copy of the data. It's not the same as Alto, but is makes it much harder to loose track of your data blocks.

    NetApp's WAFL file system stores block identity information along with the block---either in an enlarged 4160-byte block or in a nearby block similar to "label sector" idea you describe.

    There are also a variety of real world issues that make the Alto approach less useful.

    With RAID, I can usually reconstruct lost data (or metadata) without the need to resort to scanning every block in the file system. This doesn't help the individual user with a single-disk file system, but a lot of the more advanced file systems are targeting folks who can pay for them, and these folks typically have RAID.

    With the sizes of modern disks it is increasingly impractical to do a full scan of a file system. Disk sizes are already over 1 TB, and there are folks running file systems that span dozens of those disks. If I lose a few inodes, it may be faster to restore from backup than to scan the entire file system looking for the unreferenced blocks.

    Performance and capacity. At the high-end, you can get disks that will support 520-byte sectors, giving you room to store block labels. But most data is stored on lower-end disks. There you can either use 9 sectors for a 4KB block, essentially wasting 10% of your storage, or you can do something like your "label sector." The label sectors would work, but would force you to do disk writes for each block you modify.

    As you'll notice, however, none of these issues invalidates the Alto approach. There would still be cases where it would save you. And given the numerous clever ideas people are coming up with in new file systems like btrfs, tux3, and HAMMER, you would think more people would find ways to use this idea in modern file systems.