Lessons from the Debian/OpenSSL Fiasco
Posted on Wednesday, May 21, 2008.
Last week, Debian announced that in September 2006
they accidentally broke the OpenSSL pseudo-random number generator
while trying to silence a Valgrind warning.
One effect this had is that the
installed on recent Debian systems (and Debian-derived systems like Ubuntu)
could only generate 32,767 different possible SSH keys of a given type and size,
so there are a lot of people walking around with the same keys.
Many people have had fingers pointed at them, but it is not really interesting who made the mistake: everyone makes mistakes. What's interesting is the situation that encouraged making the mistake and that made it possible not to notice it for almost two years.
To do that, you have to understand the code involved and the details of the bug; those require understanding a little bit about entropy and random number generators.
Entropy is a measure of how surprising a piece of data is. Jon Lester didn't pitch in last night's Red Sox game, but that's not surprising--he only pitches in every fifth game, so most nights he doesn't pitch. On Monday night, though, he did pitch and (surprise!) threw a no-hitter. No one expected that before the game: it was a high-entropy event.
Entropy can be measured in bits: a perfect coin flip produces 1 bit of entropy. A biased coin produces less: flipping a coin that comes up heads only 25% of the time produces only about 0.4 bits for tails and 2 bits for heads, or 0.8 bits of entropy on average. The exact details aren't too important. What is important is that entropy is a quantitative measure of the amount of unpredictability in a piece of data.
An ideal random byte sequence has entropy equal to its length, but most of the time we have to make do with lower-entropy sequences. For example, the dates of Major League no-hitters this year would be a very unpredictable sequence, but also a very short one. On the other hand, collecting the dates that Lester pitches for the Red Sox this year is fairly predictable—it's basically every fifth day—but there are still unexpected events like injuries and rain-outs that make it not completely predictable. The no-hitter sequence is very unpredictable but is very short. The Lester starts sequence is only a little unpredictable but so much longer that it likely has more total entropy.
Computers are faced with the same problem: they have access to long low-entropy (mostly predictable) sources, like the interarrival times between device interrupts or static sampled from a sound or video card, but they need high-entropy (completely unpredictable) byte sequences where every bit is completely unpredictable. To convert the former into the latter, a secure pseudo-random number generator uses a deterministic function that acts as an “entropy juicer.” It takes a large amount of low-entropy data, called the entropy pool, and then squeezes it to produce a smaller amount of high-entropy random byte sequences. Deterministic programs can't create entropy, so the amount of entropy coming out can't be greater than the amount of entropy going in. The generator has just condensed the entropy pool into a more concentrated form. Cryptographic hash functions like MD5 and SHA1 turn out to be good entropy juicers. They let every input bit have some effect on each output bit, so that no matter which input bits were the unpredictable ones, the output has a uniformly high entropy. (In fact this is the very essence of a hash function, which is why cryptographers just say hash function instead of made-up terms like entropy juicer.)
The bad part about entropy is that you can't actually
measure it except in context: if you read a 32-bit number
/dev/random, as I just did,
you're likely to be happy with 4016139919,
but if you read ten more, you won't be happy if you
keep getting 4016139919.
(That last sentence is, technically, completely flawed, but you get the point.
See also this comic and this comic.)
The nice thing about entropy juicers is that as long as there's
some entropy somewhere in the pool, they'll extract it.
It never hurts to add some more data to the pool: either
it will add to the collective entropy, or it will have no effect.
OpenSSL implements such a hash-based entropy juicer.
It provides a function
RAND_add(buf, n, e) that adds a
buffer of length
n and entropy
to the entropy pool.
Internally, the entry pool is just an MD5 or SHA1 hash state:
MD_update to add the bytes
to a running hash computation.
e is an assertion made by the
caller about the entropy contained in
OpenSSL uses it to keep a running estimate of the total amount
of entropy in the buffer.
OpenSSL also provides a
RAND_bytes(buf, n) that
returns a high-entropy random byte sequence.
If the running estimate indicates that there isn't enough
entropy in the pool,
RAND_bytes returns an error.
This is entirely reasonable.
The Unix-specific code in OpenSSL seeds the entropy juicer
with some dodgy code. The essence of
rand_unix.c is (my words):
char buf; fd = open("/dev/random", O_RDONLY); n = read(fd, buf, sizeof buf); close(fd); RAND_add(buf, sizeof buf, n);
Notice that the parameters to
RAND_add say to
use the entire buffer but only expect
RAND_load_file does a similar
thing when reading from a file, and there the uninitialized
reference is explicitly marked as intentional (actual code):
i=fread(buf,1,n,in); if (i <= 0) break; /* even if n != i, use the full array */ RAND_add(buf,n,i);
The rationale here is that including the uninitialized fragment at the end of the buffer might actually increase the actual amount of entropy, and in any event being honest about the amount of entropy being claimed won't break the entropy pool estimates.
Similarly, the function
whose main purpose is to fetch
bytes from the juicer, adds the contents of
to the entropy pool (it behaves like
RAND_add(buf, n, 0))
before it fills in
In at least three different places, then, the OpenSSL developers explicitly chose to use uninitialized buffers as possible entropy sources. While this is theoretically defensible (it can't hurt), it's mostly a combination of voodoo and wishful thinking, and it makes the code difficult to understand and analyze.
In particular, the
RAND_bytes convention causes
problems at every call site that looks like:
char buf; n = RAND_bytes(buf, sizeof buf);
This idiom causes so many warnings with Valgrind (and its commercial cousin, Purify)
that there is an
#ifdef to turn the behavior off:
#ifndef PURIFY MD_Update(&m,buf,j); /* purify complains */ #endif
The other two cases occur much more rarely:
RAND_poll, one would have to be
the kernel had very little randomness to spare,
RAND_load_file on a file
that reported one size in
returned fewer bytes in
When they do occur, a different instance of
the one in
RAND_add, is on the stack trace
reported by Valgrind.
A Debian maintainer was faced with Valgrind reports of uses
of uninitialized data at a call to
which is used to mix a buffer into the entropy pool inside
Normally you might look farther up the call stack, but
there were multiple instances, with different callers.
The only common piece was
Since the second
MD_update was already known
to be non-essential—it could be safely disabled to run under Purify—it
stood to reason that perhaps the first one was non-essential as well.
The context of each call looks basically identical (the source code
is not particularly well factored).
The Debian maintainer knew he didn't understand the code, so he asked for help on the openssl-dev mailing list:
Subject: Random number generator, uninitialised data and valgrind. Date: 2006-05-01 19:14:00 When debbuging applications that make use of openssl using valgrind, it can show alot of warnings about doing a conditional jump based on an unitialised value. Those unitialised values are generated in the random number generator. It's adding an unintialiased buffer to the pool. The code in question that has the problem are the following 2 pieces of code in crypto/rand/md_rand.c: 247: MD_Update(&m,buf,j); 467: #ifndef PURIFY MD_Update(&m,buf,j); /* purify complains */ #endif Because of the way valgrind works (and has to work), the place where the unitialised value is first used, and the place were the error is reported can be totaly different and it can be rather hard to find what the problem is. ... What I currently see as best option is to actually comment out those 2 lines of code. But I have no idea what effect this really has on the RNG. The only effect I see is that the pool might receive less entropy. But on the other hand, I'm not even sure how much entropy some unitialised data has. What do you people think about removing those 2 lines of code?
He got two substantive replies. The first reply was from someone with an email address at openssl.org who appears to be one of the developers:
List: openssl-dev Subject: Re: Random number generator, uninitialised data and valgrind. Date: 2006-05-01 22:34:12 > What I currently see as best option is to actually comment out > those 2 lines of code. But I have no idea what effect this > really has on the RNG. The only effect I see is that the pool > might receive less entropy. But on the other hand, I'm not even > sure how much entropy some unitialised data has. > Not much. If it helps with debugging, I'm in favor of removing them. (However the last time I checked, valgrind reported thousands of bogus error messages. Has that situation gotten better?)
The second is from someone not at openssl.org who answers the developer's question:
List: openssl-dev Subject: Re: Random number generator, uninitialised data and valgrind. Date: 2006-05-02 6:08:10 > Not much. If it helps with debugging, I'm in favor of removing them. > (However the last time I checked, valgrind reported thousands of bogus > error messages. Has that situation gotten better?) I recently compiled vanilla OpenSSL 0.9.8a with -DPURIFY=1 and on Debian GNU/Linux 'sid' with valgrind version 3.1.1 was able to debug some application using both TLS/SSL as S/MIME without any warning or error about the OpenSSL code. Without -DPURIFY you're indeed flooded with warnings. So yes I think not using the uninitialized memory (it's only a single line, the other occurrence is already commented out) helps valgrind.
Both essentially said, “go ahead, remove the
The Debian maintainer did, causing
RAND_add not to add
anything to the entropy pool but still update the entropy estimate.
There were other
MD_update calls in the code that
buf, and those remained. The only one
that was a little unpredictable was one in
that added the current process ID to the entropy pool on each call.
That's why OpenSSH could still generate 32,767 possible SSH keys
of a given type and size (one for each pid) instead of just one.
Cascade of Failures
Like any true fiasco, a cascade of bad decisions and mistakes all lined up to make this happen:
- The OpenSSL code was too clever by half. The intentional uninitialized memory references obscured the code and made it hard to test for real bugs.
The OpenSSL code isn't very well organized. There was code in both
RAND_bytesto update entropy to the pool instead of having that functionality factored into one place. The fact that one instance wasn't essential made it look like the other instance wasn't essential.
The OpenSSL code's paranoia hid the Debian-introduced bug.
Throwing the pid into the entropy pool on each call to
RAND_bytesisn't actually helping create entropy, but it does keep the buggy Debian version from being completely deterministic. If it had been completely deterministic, the bug would likely have been noticed much sooner, probably long before it got into a stable release.
The Debian maintainer asked for help with code he didn't
understand, but the snippets in his post to the OpenSSL list
don't include enough context to understand where the
MD_updatecalls really are in the code. Showing a few lines around each call wouldn't have made the situation clearer, since the two code sections look pretty similar. Mentioning the names of the functions containing each call might have helped.
- The OpenSSL developers responded incorrectly, probably without actually looking at the code to see which calls were being talked about. The presence of the #ifdef PURIFY on the one call likely made it very easy to assume the other call was similarly unimportant.
Reading various blogs you find various involved party's intelligence being insulted, but this is really a failure of process, not of intelligence. The maintainer knew he wasn't an expert on the code in question, asked the experts for help, and was given bad information.
I've spent a lot of the past decade maintaining both Plan 9 and a port of the Plan 9 software to Unix. I've edited a lot of code I didn't fully understand, and I've answered questions about code that I did understand when other people came asking. I've made my share of embarrassing mistakes editing unfamiliar code, and I've probably given my share of unintentionally wrong advice. Even the best programmers are always going to make mistakes. The lessons of the fiasco, for me, are the steps that can be taken to make the mistakes less frequent and easier to find.
For me, the lessons I take away are:
- Try not to write clever code. Try to write well-organized code.
- Inevitably, you will write clever, poorly-organized code. If someone comes along asking questions about it, use that as a sign that perhaps the code is probably too clever or not well enough organized. Rewrite it to be simpler and easier to understand.
- Avoid voodoo code. Zeroing a variable multiple times, for example, doesn't affect correctness now, but it does make the code harder to understand and easier to break without noticing.
- Mailing list discussions aren't a substitute for real code review.
People respond to email when they're tired or on their way out the door.
Code reviews are supposed to be thorough and considered.
Showing a side-by-side file diff of the before and after
md_rand.cto an OpenSSL developer as a real code review would likely have turned up the mistake.
- Distributions like Debian have to maintain their own copies of some programs at least temporarily. That's inevitable, because not all projects will run on Debian's time constraints. But I'm surprised there was no followup with the OpenSSL developers once the patch was created, trying to get them to accept it into the main tree. That could have provoked a code review too. Failing that, I'm surprised Debian doesn't have a guy whose job it is to understand OpenSSL and other security-critical bits of code and vet local changes in a formal process.
In my own software maintenance, I think I'm doing a pretty good job on the first three, and not such a great job on the last two. It is tempting to say that for a low-profile project like Plan 9 they're not really needed, but that's more likely to be laziness than the truth.
Links and Notes
Here's the Debian announcement. I have not seen any account of how Luciano Bello discovered the bug. If you do, please post a comment.
Here's one of the original Debian bug reports about Valgrind complaints.
The original poster actually has a correct fix;
the maintainer incorrectly (but reasonably) argues
that since there are other contexts that also cause
uninitialized references, the bug must be in
I've been talking about
but in the code the actual functions are
The latter are used as the implementation of the former.
RAND_poll is in
RAND_load_file is in