research!rsc

Thoughts and links about programming, by

RSS

Floating Point to Decimal Conversion is Easy
Posted on Friday, July 1, 2011.

Floating point to decimal conversions have a reputation for being difficult. At heart, they're really very simple and straightforward. To prove it, I'll explain a working implementation. It only formats positive numbers, but expanding it to negative numbers, zero, infinities and NaNs would be very easy.

An IEEE 64-bit binary floating point number is an integer v in the range [252, 253) times a power of two: f = v × 2e. Constraining the fractional part of the unpacked float64 to the range [252, 253) makes the representation unique. We could have used any range that spans a multiplicative factor of two, but that range is the first one in which all the values are integers.

In Go, math.Frexp unpacks a float64 into f = fr × 2exp where fr is in the range [½, 1). (C's frexp does too.) Converting to our integer representation is easy:

fr, exp := math.Frexp(f)
v := int64(fr * (1<<53))
e := exp - 53

To convert the integer to decimal, we'll use strconv.Itoa64 out of laziness; you know how to write the direct code.

buf := make([]byte, 1000)
n := copy(buf, strconv.Itoa64(v))

The allocation of buf reserves space for 1000 digits. In general 1/2e requires about 0.7e non-zero decimal digits to write in full. For a float64, the smallest positive number is 1/21074, so 1000 digits is plenty. The second line sets n to the number of bytes copied in from the string representation of the (integer) v. Throughout, n will be the number of digits in buf. Note that we're working with ASCII decimal digits '0' to '9', not bytes 0-9.

Now we've got the decimal for v stored in buf. Since f = v × 2e, all that remains is to multiply or divide buf by 2 the appropriate number of times (e or -e times).

Here's the loop to handle positive e:

for ; e > 0; e-- {
    δ := 0
    if buf[0] >= '5' {
        δ = 1
    }
    x := byte(0)
    for i := n-1; i >= 0; i-- {
        x += 2*(buf[i] - '0')
        x, buf[i+δ] = x/10, x%10 + '0'
    }
    if δ == 1 {
        buf[0] = '1'
        n++
    }
}
dp := n

Each iteration of the inner loop overwrites buf with twice buf. To start, the code determines whether there will be a new digit (δ = 1), which happens when the leading digit is at least 5. Then it runs up the number from right to left, just as you learned in grade school, multiplying each digit by two and using x to carry the result. The digit buf[i] moves into buf[i+δ]. At the end, if the code needs to insert an extra digit, it does. Run e times.

After the loop finishes, we record in dp the current location of the decimal point. Since buf is still an integer (it started as an integer and we've only doubled things), the decimal point is just past all the digits.

Of course, e might have started out negative, in which case we've done nothing and still need to halve buf e times:

for ; e < 0; e++ {
    if buf[n-1]%2 != 0 {
        buf[n] = '0'
        n++
    }
    δ, x := 0, byte(0)
    if buf[0] < '2' {
        δ, x = 1, buf[0] - '0'
        n--
        dp--
    }
    for i := 0; i < n; i++ {
        x = x*10 + buf[i+δ] - '0'
        buf[i], x = x/2 + '0', x%2
    }
}

Dividing by two needs an extra digit if the last digit is odd, to store the final half; in that case we add a 0 to buf so that we'll have room to store a completely precise answer.

After adding the 0, we can set up for the division itself. If the first digit is less than 2, it's going to become a zero; in the interest of avoiding leading zeros we use an initial partial value x and move digits up (δ = 1) during the division. The multiplication ran from right to left copying from buf[i] to buf[i+δ]. The division runs left to right copying from buf[i+δ] to buf[i].

Now buf[0:n] has all the non-zero digits in the exact decimal representation of our f, and dp records where to put the decimal point. We could stop now, but we might as well implement correct rounding while we're here.

The interesting case is when we have more digits than requested (n > ndigit). To make the decision, just like in grade school, we look at the first digit being removed. If it's less than 5, we round down by truncating. If it's greater than 5, we round up by incrementing what's left after truncation. If it's equal to 5, we have to look at the rest of the digits being dropped. If any of the rest of the digits are nonzero, then rounding up is more accurate. Otherwise we're right on the line. In grade school I was taught to handle this case by rounding up: 0.5 rounds to 1. That's a simple rule to teach, but breaking this tie by rounding to the nearest even digit has better numerical properties, because it rounds up and down equally often, and it is the usual rule employed in real calculations.

This all results in the thunderclap rounding condition:

buf[prec] > '5' ||
buf[prec] == '5' && (nonzero(buf[prec+1:n]) || buf[prec-1]%2 == 1)

You can see why children are taught buf[prec] >= '5' instead.

Anyway, if we have to increment after the truncation, we're still working in decimal, so we have to handle carrying incremented 9s ourselves:

i := prec-1
for i >= 0 && buf[i] == '9' {
    buf[i] = '0'
    i--
}
if i >= 0 {
    buf[i]++
} else {
    buf[0] = '1'
    dp++
}

The loop handles the 9s. The if increments what's left, or, if we turned the whole string into zeros, it simulates inserting a 1 at the beginning by changing the first digit to a 1 and moving the decimal place.

Having done that and set n = prec, we can piece together the actual number:

return fmt.Sprintf("%c.%se%+d", buf[0], buf[1:n], dp-1)

That prints the first digit, then a decimal point, then the rest of the digits, and finally the N suffix. The exponent must adjust the number by dp−1 because we are printing all but the first digit after the decimal point.

That's all there is to it. To print  f = v × 2e  you write v in decimal, multiply or divide the decimal by 2 the right number of times, and print what you're left holding.

The Plan 9 C library and the Go library both use variants of the approach above. On my laptop, the code above does 100,000 conversions per second, which is plenty for most uses, and the code is easy to understand.

Why are some converters more complicated than this? Because you can make them a little faster. On my laptop, glibc's sprintf(buf, "%.50e", M_PI) is about 15 times faster than an equivalent print using the code above, because the implementation of sprintf uses much more sophisticated mathematics to speed the conversion.

If you have a stomach for pages of equations, there are many interesting papers that discuss how to do this conversion and its inverse more quickly:

Just remember: The conversion is easy. The optimizations are hard.

Code

The full program is available in this Gist or you can try it using the Go playground.

(Comments originally posted via Blogger.)

  • nicolas cellier (July 1, 2011 10:44 AM) In general 1/2e requires about 0.7e non-zero decimal digits to write in full:
    It requires exactly e digits after the decimal point, but removing the leading zeros is interesting...

    2^10 > 10^3, thus 2^-10e has at least 3e leading zeroes.
    Thus 2^-e has at least trunc(0.3e) leading zeroes, thus the 0.7e digits. Very good to know :)

  • Jan-willem (July 1, 2011 10:47 AM) Do you simply cap prec to avoid extra decimal places? I'm worried about the 1.99999999999 problem that Steele & White discuss, where you really want the shortest decimal that reads as the number you've got.

  • Russ Cox (July 1, 2011 10:55 AM) http://golang.org/src/pkg/strconv/ftoa.go's roundShortest has the logic for the '1.9999999999' problem.

    Basically, it's easy to figure out the (v', e') for the minimum (least) number you could print to get the right answer when converted back, and also the maximum number. Then you walk all three at the same time until they start to differ. Then you can round. If a short form is a valid answer, it you'll get a digit difference very quickly.

  • GlacJAY (July 27, 2011 2:08 AM) Lost minus sign in the power's range.