Mel was Real

Things have been pretty crazy around here, so no new posts recently, but this caught my attention.

James Grimmelmann reports a few interesting facts about the Librascope/Royal McBee LGP-30. Specifically, Mel was real, and the hexadecimal digits were 0 1 2 3 4 5 6 7 8 9 F G J K Q W!

I can't resist telling a story. James is now a professor at the New York Law School, but he and I were fellow computer science students in college. He was a TA for the undergraduate introductory theory course when I took it, and his duties included, among other more enjoyable tasks, grading the first problem set. The “challenge” problem asked for a proof that in any group of six people, there is either a group of three that all know each other (a clique) or a group of three that all don't know each other (an anti-clique). I didn't see how to prove it*, so I just wrote a simple C program to iterate over all 215 possibilities and verify that the condition holds. I wrote something about using brute force and attached the C program to the problem set. This problem set was just a warm-up in proof techniques, but the course is all about what can and can't be computed by various systems. The pinnacle result, of course, is Turing's proof that some problems are uncomputable and Cook's definition of NP-completeness. In that light, my brute force proof ran against everything the course was trying to teach.

James was the grader for that question. Next to the comment about brute force, he wrote “Wait until the NP-completeness unit, Mr. Brute Force” At the end, he added, “Very, very clever. Such tricks will not avail you much where this course is headed.” I didn't appreciate it then, but in hindsight I find it very funny.

* Truth be told, I still don't see how to prove it. You're supposed to use the pigeonhole principle, and the details have been explained to me in the past, but I can never reconstruct them on demand. The exhaustive search still feels easier, though perhaps less enlightening.

Backups, heal thyself

MIT venti server

Around 2000, Sean Quinlan and Sean Dorward built a content-addressed storage system called Venti. Their paper, “Venti: a new approach to archival storage” (PDF; also HTML), won best paper at the storage conference where it was presented. Today's post is about how Venti can heal itself.

Venti names stored objects by their SHA1 hashes: you write a block, you get back the SHA1 hash. You show up with a SHA1 hash, you can get the block. Objects larger than a block are typically broken up into blocks and then stored in a tree. Each pointer in the tree is the SHA-1 hash of the block it points to:

Hash trees

Thus, if you've stored a big file (say, a disk image) and you change one block, only that one block and the (probably five or so) blocks that point at it actually change: the rest of the storage can be shared with the previous copy of the file.

The primary benefit of using content-addressing is that duplicates get removed for free. If two different people write the same data, Venti only stores the data once. Or if one person archives the same data twice, Venti only stores the data once. My research group at MIT runs a Venti server to store FFS disk images of its main file server. Writing the same images over and over again only stores changed blocks, and if multiple people have the same giant files, they take up extra space on the file server itself, but not in Venti. Our live file system usage is about 600GB, but the incremental backup cost in Venti is about 2 GB per night. All our nightly backups going back to 2005 take up about 3 TB, or 1.7 TB compressed.

Another benefit of using content-addressing is that if the data goes bad, you notice. You can check the SHA-1 hash of the data returned by a read to make sure it is the block you expected. Clients can verify the data returned by the server, and the server can verify the data returned by its disks. In fact the latter has been the bigger problem in Venti installations: disks go bad, and if you tend to notice more if you have checks like these.

A few years ago, I wanted to download all the 6 GB of file system traces used in the Venti paper. Not all of the files downloaded correctly: a few were truncated too early. It turned out that the Venti server now storing the traces at Bell Labs had some disk corruption, and the Venti server noticed because the SHA-1 hashes didn't match expectations. No problem, just use the earlier Venti server that the data was copied from. That server has had disk corruption too, some of it overlapping the bad blocks. So there are still traces that can't be read. Okay, well the traces were saved on backup tapes too. Dig those up, and the backup tapes aren't complete either: still two trace files with missing blocks. There are also backup tapes of the original file system blocks that were used to create the traces. Those are actually readable, so we can regenerate the bad trace files and have a complete set again.

Here's the fun part. The original trace files were stored as a single Venti archive, a giant hash tree like the one pictured above that contains pointers to all the trace files. Not all the files can be read out of the archive, since we have these pointers to missing blocks. Creating a new archive holding just the two missing trace files makes the old archive start working. Now that Venti has blocks with the desired SHA1 hashes, the pointers to missing blocks become pointers to existing blocks. Given the right raw materials, the broken archive healed automatically.

With a little engineering, you could create a typical incremental backup system that didn't use hashes to identify duplicate blocks, and you'd probably end up being within a factor of two as space-efficient as Venti. For a long time I had this lingering fear that perhaps indexing by SHA-1 hash, which is pretty expensive to implement efficiently, was overkill. Watching that previously broken archive start working again was the moment when I fully believed that Venti's was the right approach.


Further reading. The use of SHA1 specifically isn't really important: any strong collision-resistant hash function will do, though people disagree on the wisdom of this approach. Remember that Venti is being used in situations where there are no adversaries: the main concern is an accidental collision. The second paper just linked to summarizes:

We conclude that it is certainly fine to use a 160-bit hash function like SHA1 or RIPEMD-160 with compare-by-hash. The chances of an accidental collision is about 2^-160. This is unimaginably small. You are roughly 2^90 times more likely to win a U.S. state lottery and be struck by lightning simultaneously than you are to encounter this type of error in your file system.

Bi-quinary and other bases

Doug McIlroy's talk on History of Computing at Bell Labs mentioned that George Stibitz's relay-based decimal calculators that stored digits in what he called bi-quinary: “Arithmetic was done in bi-quinary, a two out of five representation for decimal integers, and if there weren't exactly two out of five relays activated it would stop.” This allowed a sequence of computing jobs to run unattended over the weekend: if the computer detected an error (not exactly two relays active in some digit), then it would stop the current job and start over.

McIlroy's description of the number system as being 5 bits differs from any of the representations given by the Wikipedia entries for bi-quinary coded decimal and biquinary, which describe a system that is basically unary: 5 bits to count 0 through 4 (only 1 bit on at a time) and two top bits: 01 = 0, 10 = 5. Other historical summaries you can find online by searching for Stibitz and bi-quinary seem to back up Wikipedia. In this version, bi-quinary was 1-of-2 plus 1-of-5, not 2-of-5. In the 7-bit version, the bit positions are worth 5, 0, 4, 3, 2, 1, and 0, which isn't very interesting. It's far more interesting to think about the 2-out-of-5 system.

The valid codes in a 2-of-5 system like McIlroy described would be 00011, 00101, 00110, 01001, 01010, 01100, 10001, 10010, 10100, 11000. If you assign these the values 0 through 9, you can set up a system of linear equations and solve them to find the value of each of the five bit positions b4, b3, b2, b1, and b0. The fact that 00011 = 0 means that b1 + b0 = 0. The fact that 00110 - 00101 = 2 - 1 = 1 means that b1 - b0 = 1. These two equalities imply b1 = ½ and b0 = -½. Given that you can pick out the other bits fairly easily: from right to left the "places" are -½, ½, 1½, 3½, and 6½. This works for 0 through 8 but then, sadly, fails for 9: 11000 = 6½ + 3½ = 10. No other code assignment has a solution either (I wrote a small program to try them all.)

I guess that's why the 1-of-2 plus 1-of-5 form was more popular than 2-of-5: it was probably easier to design the arithmetic circuits if each bit could be assigned a particular value. It's too bad, because 2-of-5 would have mathematical elegance on its side if it could be made to work.

If you dig around, you can find other esoteric number representations: negabinary has base -2 (negative two), so the positional values are 1, -2, 4, -8, and so on. Thus 5 is 101 but 3 is 111 = 1-2+4. Negabinary appears to have been first invented by Vittorio Grunwald in 1885. It was used in a few experimental computers in the 1950s, but it didn't catch on. Even weirder than negabinary numbers are the quater-imaginary numbers, invented by a high school student named Donald Knuth in 1955. Those use base 2i (where i is the imaginary square root of -1). (Knuth is also responsible for the negabinary history.)

Odder still are balanced ternary (ternary but digits are -1, 0, and 1) and Bubble Babble, but no discussion of interesting counting systems or high school math nerds would be complete without mentioning counting on your fingers in binary.