research!rsc tag:research.swtch.com,2012:research.swtch.com 2022-02-14T10:01:00-05:00 Russ Cox https://swtch.com/~rsc rsc@swtch.com Go’s Version Control History tag:research.swtch.com,2012:research.swtch.com/govcs 2022-02-14T10:00:00-05:00 2022-02-14T10:02:00-05:00 A tour of Go’s four version control systems. <p> Every once in a while someone notices the first commit in the Go repo is dated 1972: <pre>% git log --reverse --stat commit 7d7c6a97f815e9279d08cfaea7d5efb5e90695a8 Author: Brian Kernighan &lt;bwk&gt; AuthorDate: Tue Jul 18 19:05:45 1972 -0500 Commit: Brian Kernighan &lt;bwk&gt; CommitDate: Tue Jul 18 19:05:45 1972 -0500 hello, world R=ken DELTA=7 (7 added, 0 deleted, 0 changed) src/pkg/debug/macho/testdata/hello.b | 7 +++++++ 1 file changed, 7 insertions(+) ... </pre> <p> Obviously something silly is going on, and people usually stop there. But Go’s actual version control history is richer and more interesting. For example, there are a few more fake commits and then the fifth commit is the first real one: <pre>commit 18c5b488a3b2e218c0e0cf2a7d4820d9da93a554 Author: Robert Griesemer &lt;gri@golang.org&gt; AuthorDate: Sun Mar 2 20:47:34 2008 -0800 Commit: Robert Griesemer &lt;gri@golang.org&gt; CommitDate: Sun Mar 2 20:47:34 2008 -0800 Go spec starting point. SVN=111041 doc/go_spec | 1197 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1197 insertions(+) </pre> <p> Why does that commit have a different trailer than the three fake commits? <a class=anchor href="#subversion"><h2 id="subversion">Subversion</h2></a> <p> Go started out using Subversion, as part of a small experiment to evaluate Subversion for wider use inside Google. The experiment did not result in wider Subversion use, but the <code>SVN=111041</code> tag in the <a href="https://go.googlesource.com/go/+/18c5b488a3b2e218c0e0cf2a7d4820d9da93a554">first real commit above</a> records that on the original Subversion server, that Go commit was revision 111,041. (Subversion assigns revision numbers in increasing order, and the server was a small monorepo being used for a few other projects besides Go. There were not 111,040 other Go commits that didn’t make it out.) <a class=anchor href="#perforce"><h2 id="perforce">Perforce</h2></a> <p> The SVN tags continue in the logs <a href="https://go.googlesource.com/go/+/05caa7f82030327ccc9ae63a2b0121a029286501">until July 2008</a>, where we see one last SVN commit and then a new form: <pre>commit 777ee7163bba96f2c9b3dfe135d8ad4ab837c062 Author: Rob Pike &lt;r@golang.org&gt; AuthorDate: Mon Jul 21 16:18:04 2008 -0700 Commit: Rob Pike &lt;r@golang.org&gt; CommitDate: Mon Jul 21 16:18:04 2008 -0700 map delete SVN=128258 doc/go_lang.txt | 6 ++++++ 1 file changed, 6 insertions(+) commit 05caa7f82030327ccc9ae63a2b0121a029286501 Author: Rob Pike &lt;r@golang.org&gt; AuthorDate: Mon Jul 21 17:10:49 2008 -0700 Commit: Rob Pike &lt;r@golang.org&gt; CommitDate: Mon Jul 21 17:10:49 2008 -0700 help management of empty pkg and lib directories in perforce R=gri DELTA=4 (4 added, 0 deleted, 0 changed) OCL=13328 CL=13328 lib/place-holder | 2 ++ pkg/place-holder | 2 ++ src/cmd/gc/mksys.bash | 0 3 files changed, 4 insertions(+) </pre> <p> This was the first commit after Go moved from a lightly-used Subversion server to a lightly-used Perforce server. Google was a <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/39983.pdf">very heavy Perforce user</a> due to its use of a <a href="https://dl.acm.org/doi/pdf/10.1145/2854146">single giant monorepo for most code</a>. Go was not part of that monorepo world, and at the time, projects that didn’t have to be on the heavily-loaded main server were hosted instead on secondary ones. <p> At the transition to Perforce, you can see the introduction of telltale <code>DELTA=</code>, <code>OCL=</code>, and <code>CL=</code> tags. Perforce imposes a linear ordering on change lists, like Subversion revisions, but each change list ends up with two sequence numbers: it is assigned one when it is created and uses that number while it is a local, pending change list, including in our code review systems. Then, when it is submitted and becomes part of the official history, it is assigned a new number, to keep submitted changes in order. The <code>OCL=</code> is the original change list number, while the <code>CL=</code> is the final one. The <code>R=</code> line means that <code>gri</code> (Robert Griesemer) reviewed the change before it was submitted, using our internal code review system, then called Mondrian. Because the server was so lightly used (and presumably the review so quick), no new change lists had been created or submitted while this one was pending, and its final submission reused its original number instead of needing to create a new one. <p> Many other changes have the same <code>OCL=</code> and <code>CL=</code> because they were created and submitted in a single Perforce command, without review, like <a href="https://go.googlesource.com/go/+/c1f5eda7a2465dae196d1fa10baf6bfa9253808a">the next one</a>: <pre>commit c1f5eda7a2465dae196d1fa10baf6bfa9253808a Author: Rob Pike &lt;r@golang.org&gt; AuthorDate: Mon Jul 21 18:06:39 2008 -0700 Commit: Rob Pike &lt;r@golang.org&gt; CommitDate: Mon Jul 21 18:06:39 2008 -0700 change date OCL=13331 CL=13331 doc/go_lang.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) </pre> <p> You can also see from this that the <code>DELTA=</code> line is added by the code review system, not Perforce: this unreviewed change doesn’t have it. (The last two lines are being provided by the <code>git log --stat</code> command as we explore the Git version of these changes.) <p> The bulk of the pre-open-source development of Go was done on that Perforce server. <a class=anchor href="#mercurial"><h2 id="mercurial">Mercurial</h2></a> <p> The <code>OCL=</code> and <code>CL=</code> lines continue until we get to <a href="https://go.googlesource.com/go/+/b74fd8ecb17c1959bbf2dbba6ccb8bae6bfabeb8">October 2009</a>, when they switch to a new form: <pre>commit 942d6590d9005f89e971ed5af0374439a264a20e Author: Kai Backman &lt;kaib@golang.org&gt; AuthorDate: Fri Oct 23 11:03:16 2009 -0700 Commit: Kai Backman &lt;kaib@golang.org&gt; CommitDate: Fri Oct 23 11:03:16 2009 -0700 one more argsize fix. we were copying with the correct alignment but not enough (duh). R=rsc APPROVED=rsc DELTA=16 (13 added, 0 deleted, 3 changed) OCL=36020 CL=36024 src/cmd/5g/ggen.c | 2 +- test/arm-pass.txt | 17 +++++++++++++++-- 2 files changed, 16 insertions(+), 3 deletions(-) commit b74fd8ecb17c1959bbf2dbba6ccb8bae6bfabeb8 Author: Kai Backman &lt;kaib@golang.org&gt; AuthorDate: Fri Oct 23 12:43:01 2009 -0700 Commit: Kai Backman &lt;kaib@golang.org&gt; CommitDate: Fri Oct 23 12:43:01 2009 -0700 fix build issue cause by transition to hg R=rsc http://go/go-review/1013012 src/make-arm.bash | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) </pre> <p> That commit introduces yet another kind of trailer line, which says <code>http://go/go-review/1013012</code>. At one point that link pointed, inside Google, to an internal version of the Rietveld App Engine app, which we used for code review on the way to submission to a private Mercurial repo on Google Code Project Hosting. We were doing that conversion, in October 2009, as part of the preparation to open-source Go in November. <p> It was at this point, during the conversion to Mercurial, that I introduced the “hello, world” commits. The Subversion and Perforce repos had been Google-internal servers, and we had stored essentially all the Go code in the world alongside the Go implementation. We each had “user directories” like <code>/usr/rsc</code> in the repo. Those directories contained various code that wasn’t going out in the release, whether because it was targeting Google-internal technologies or because it was simply not worth publishing. (You can still see references to <code>/usr</code> in a few commit messages that did make it out.) <p> For the conversion, I ended up with a directory full of patches, one per commit, and a script to run the patches to create a new Mercurial repository. Then I edited the patches (with automated help) to remove the files that weren’t being released, including the entire <code>/usr</code> tree, and to add the new open-source copyright notices to every file. <p> This took me about a week, and it was annoyingly difficult. If I removed a file from some patches but then it got renamed in others, Mercurial would complain about the rename using a file that hadn’t been created in the first place. And if I added a copyright notice when the file was created, I had to be careful to update later patches to not have merge conflicts when changing the top of the file. And so on. I remember thinking that it was like being a con man who constructs an elaborate false identity and struggles to keep it all straight. <a class=anchor href="#hello_world"><h2 id="hello_world">Hello, world</h2></a> <p> A month earlier, I had created object file parsing packages like <a href="https://go.dev/pkg/debug/macho">debug/macho</a>, and as test data for each of those packages I <a href="https://go.googlesource.com/go/+/bf69025825fd2b8e7aac01f27d5c974bd30af542">checked in an object file</a> compiled from a trivial “hello, world” program, along with the source code for them: <pre>commit bf69025825fd2b8e7aac01f27d5c974bd30af542 Author: Russ Cox &lt;rsc@golang.org&gt; AuthorDate: Fri Sep 18 11:49:22 2009 -0700 Commit: Russ Cox &lt;rsc@golang.org&gt; CommitDate: Fri Sep 18 11:49:22 2009 -0700 Mach-O file reading R=r DELTA=784 (784 added, 0 deleted, 0 changed) OCL=34715 CL=34788 src/pkg/debug/macho/Makefile | 12 + src/pkg/debug/macho/file.go | 374 +++++++++++++++++++++ src/pkg/debug/macho/file_test.go | 159 +++++++++ src/pkg/debug/macho/macho.go | 230 +++++++++++++ src/pkg/debug/macho/testdata/gcc-386-darwin-exec | Bin 0 -&gt; 12588 bytes src/pkg/debug/macho/testdata/gcc-amd64-darwin-exec | Bin 0 -&gt; 8512 bytes .../macho/testdata/gcc-amd64-darwin-exec-debug | Bin 0 -&gt; 4540 bytes 7 files changed, 775 insertions(+) </pre> <p> This was the original commit that introduced <code>src/pkg/debug/macho/testdata/hello.c</code>, of course. As I added copyright notices to files, it seemed wrong to add a copyright notice to that <code>hello.c</code> file. Instead, since I had the repo split into this patch-file-per-commit form, it was easy to create a few fake commits that showed at least part of the real history of that program, as an Easter egg for people who looked that closely: <pre>commit 7d7c6a97f815e9279d08cfaea7d5efb5e90695a8 Author: Brian Kernighan &lt;bwk&gt; AuthorDate: Tue Jul 18 19:05:45 1972 -0500 Commit: Brian Kernighan &lt;bwk&gt; CommitDate: Tue Jul 18 19:05:45 1972 -0500 hello, world R=ken DELTA=7 (7 added, 0 deleted, 0 changed) src/pkg/debug/macho/testdata/hello.b | 7 +++++++ 1 file changed, 7 insertions(+) commit 0bb0b61d6a85b2a1a33dcbc418089656f2754d32 Author: Brian Kernighan &lt;bwk&gt; AuthorDate: Sun Jan 20 01:02:03 1974 -0400 Commit: Brian Kernighan &lt;bwk&gt; CommitDate: Sun Jan 20 01:02:03 1974 -0400 convert to C R=dmr DELTA=6 (0 added, 3 deleted, 3 changed) src/pkg/debug/macho/testdata/hello.b | 7 ------- src/pkg/debug/macho/testdata/hello.c | 3 +++ 2 files changed, 3 insertions(+), 7 deletions(-) commit 0744ac969119db8a0ad3253951d375eb77cfce9e Author: Brian Kernighan &lt;research!bwk&gt; AuthorDate: Fri Apr 1 02:02:04 1988 -0500 Commit: Brian Kernighan &lt;research!bwk&gt; CommitDate: Fri Apr 1 02:02:04 1988 -0500 convert to Draft-Proposed ANSI C R=dmr DELTA=5 (2 added, 0 deleted, 3 changed) src/pkg/debug/macho/testdata/hello.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) commit d82b11e4a46307f1f1415024f33263e819c222b8 Author: Brian Kernighan &lt;bwk@research.att.com&gt; AuthorDate: Fri Apr 1 02:03:04 1988 -0500 Commit: Brian Kernighan &lt;bwk@research.att.com&gt; CommitDate: Fri Apr 1 02:03:04 1988 -0500 last-minute fix: convert to ANSI C R=dmr DELTA=3 (2 added, 0 deleted, 1 changed) src/pkg/debug/macho/testdata/hello.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) </pre> <p> The <a href="https://go.googlesource.com/go/+/7d7c6a97f815e9279d08cfaea7d5efb5e90695a8">July 1972 commit</a> shows the very first “hello, world” program, copied from Brian Kernighan’s “<a href="https://www.bell-labs.com/usr/dmr/www/btut.html">A Tutorial Introduction to the Language B</a>”: <pre>main( ) { extrn a, b, c; putchar(a); putchar(b); putchar(c); putchar(’!*n’); } a ’hell’; b ’o, w’; c ’orld’; </pre> <p> Various online sources refer to this as a “book”, but it definitely was not. It was a printed, 17-page document that was later included as half of January 1973’s “<a href="https://www.bell-labs.com/usr/dmr/www/bintro.html">Bell Laboratories Computing Science Technical Report #8: The Programming Language B</a>” (still not even close to a book!). The “hello, world” program appears in section 7, preceded by the less catchy “hi!” program in section 6. As I write this blog post in 2022, I cannot find any online reference for the B tutorial being originally dated July 1972, but I believe I must have had a good reason. <p> The <a href="https://go.googlesource.com/go/+/0bb0b61d6a85b2a1a33dcbc418089656f2754d32">January 1974 commit</a> converts the program to C, as found in Kernighan’s “<a href="https://www.bell-labs.com/usr/dmr/www/ctut.pdf">Programming in C — A Tutorial</a>”. The linked PDF is a retyped copy from Dennis Ritchie, without a date, but Ritchie’s “<a href="https://www.bell-labs.com/usr/dmr/www/cman74.pdf">C Reference Manual</a>” technical memo dated January 15, 1974 cites Kernighan’s tutorial as “Unpublished internal memorandum, Bell Laboratories, 1974”, implying that the tutorial must have been written in January as well. That C program was shorter than what we are used to today, but much nicer than the B program: <pre>main( ) { printf("hello, world"); } </pre> <p> I skipped over the presentation in the first edition of <i>The C Programming Language</i>, which looks like: <pre>main() { printf("hello, world\n"); } </pre> <p> The <a href="https://go.googlesource.com/go/+/0744ac969119db8a0ad3253951d375eb77cfce9e">April 1988 commit</a> shows the program from the “Draft-Proposed ANSI C” version of the second edition of <i>The C Programming Language</i>: <pre>#include &lt;stdio.h&gt; main() { printf("hello, world\n"); } </pre> <p> The <a href="https://go.googlesource.com/go/+/d82b11e4a46307f1f1415024f33263e819c222b8">second April 1988 commit</a> shows the final full ANSI C version we know today: <pre>#include &lt;stdio.h&gt; int main(void) { printf("hello, world\n"); return 0; } </pre> <p> Wikipedia says the second edition was published in April 1988. I didn’t have a day, but April 1 seemed appropriate. I didn’t have a time for either one of these, so I used 02:03:04 for the second commit and therefore 02:02:04 for the first. <p> The email addresses in the commits are also period-appropriate, although Mondrian’s <code>R=</code> and <code>DELTA=</code> tags clearly are not: there was no code review happening back then! <p> For what it’s worth, I’ve since heard from many people about how the 1972 dates have broken various presentations and analyses they’ve done on the Go repository as compared to other projects. Oops! My apologies. (I also heard from someone who was analyzing Mercurial repos for “branchiness” and decided to use the Go repo as a large test case. Their program said it had branchiness 0 and they spent a while debugging their program before realizing that the repo really was completely linear, thanks to our “rebase and merge” policy.) <a class=anchor href="#public_mercurial"><h2 id="public_mercurial">Public Mercurial</h2></a> <p> We published the Google Code Mercurial repo on November 10, 2009, and to mark the occasion we added one more Easter egg. A footnote in Brian Kernighan and Rob Pike’s 1984 book <i>The Unix Programming Environment</i> says:<blockquote> <p> Ken Thompson was once asked what he would do differently if he were redesigning the <small>UNIX</small> system. His reply: “I’d spell <code>creat</code> with an <code>e</code>.”</blockquote> <p> This refers to the <a href="https://man7.org/linux/man-pages/man2/open.2.html">creat(2) system call</a> and the <code>O_CREAT</code> file open flag. <p> Immediately once the release was public, <a href="https://go.googlesource.com/go/+/c90d392ce3d3203e0c32b3f98d1e68c4c2b4c49b">Ken mailed me a change</a> fixing this mistake: <pre>commit c90d392ce3d3203e0c32b3f98d1e68c4c2b4c49b Author: Ken Thompson &lt;ken@golang.org&gt; AuthorDate: Tue Nov 10 15:05:15 2009 -0800 Commit: Ken Thompson &lt;ken@golang.org&gt; CommitDate: Tue Nov 10 15:05:15 2009 -0800 spell it with an "e" R=rsc http://go/go-review/1025037 src/pkg/os/file.go | 1 + 1 file changed, 1 insertion(+) </pre> <p> Sadly, we didn’t execute the joke perfectly, in that the code review went to our internal Rietveld server instead of the public one. So the <a href="https://go.googlesource.com/go/+/44fb865a484b8e12adfa0a1413eacc807cec085b">very next commit</a> updated the code review server in our configuration. <p> But here, for posterity, is the review, preserved in my email: <p> <img name="creat" class="center pad" width=776 height=1008 src="creat.png" srcset="creat.png 1x, creat@2x.png 2x"> <a class=anchor href="#git"><h2 id="git">Git</h2></a> <p> So things stood from November 2009 until late 2014, when we knew that Google Code Project Hosting was going to shut down and we needed a new home. After investigating a few options, we ended up using <a href="https://www.gerritcodereview.com/">Gerrit Code Review</a>, which has been a fantastic choice. Many people think of Go as being hosted on GitHub, but GitHub is only primary for our issue tracker: the official primary copy of the sources is on <code>go.googlesource.com</code>. <p> You can see the transition from Mercurial to Git in the logs when the <code>R=</code> lines end, followed by a few completely unreviewed commits, and then <code>Reviewed-by:</code> lines begin: <pre>commit 94151eb2799809ece7e44ce3212aa3cbb9520849 Author: Russ Cox &lt;rsc@golang.org&gt; AuthorDate: Fri Dec 5 21:33:07 2014 -0500 Commit: Russ Cox &lt;rsc@golang.org&gt; CommitDate: Fri Dec 5 21:33:07 2014 -0500 encoding/xml: remove SyntaxError.Byte It is unused. It was introduced in the CL that added InputOffset. I suspect it was an editing mistake. LGTM=bradfitz R=bradfitz CC=golang-codereviews https://golang.org/cl/182580043 src/encoding/xml/xml.go | 1 - 1 file changed, 1 deletion(-) commit 258f53dee33b9055ea168cb186f8c076edee5905 Author: David Symonds &lt;dsymonds@golang.org&gt; AuthorDate: Mon Dec 8 13:50:49 2014 +1100 Commit: David Symonds &lt;dsymonds@golang.org&gt; CommitDate: Mon Dec 8 13:50:49 2014 +1100 remove .hgtags. .hgtags | 140 ---------------------------------------------------------------- 1 file changed, 140 deletions(-) commit 369873c6e5d00314ae30276363f58e5af11b149c Author: David Symonds &lt;dsymonds@golang.org&gt; AuthorDate: Mon Dec 8 13:50:49 2014 +1100 Commit: David Symonds &lt;dsymonds@golang.org&gt; CommitDate: Mon Dec 8 13:50:49 2014 +1100 convert .hgignore to .gitignore. .hgignore =&gt; .gitignore | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) commit f33fc0eb95be84f0a688a62e25361a117e5b995b Author: David Symonds &lt;dsymonds@golang.org&gt; AuthorDate: Mon Dec 8 13:53:11 2014 +1100 Commit: David Symonds &lt;dsymonds@golang.org&gt; CommitDate: Mon Dec 8 13:53:11 2014 +1100 cmd/dist: convert dist from Hg to Git. src/cmd/dist/build.c | 100 ++++++++++++++++++++++++++++++--------------------- 1 file changed, 59 insertions(+), 41 deletions(-) commit 26399948e3402d3512cb14fe5901afaef54482fa Author: David Symonds &lt;dsymonds@golang.org&gt; AuthorDate: Mon Dec 8 11:39:11 2014 +1100 Commit: David Symonds &lt;dsymonds@golang.org&gt; CommitDate: Mon Dec 8 04:42:22 2014 +0000 add bin/ to .gitignore. Change-Id: I5c788d324e56ca88366fb54b67240cebf5dced2c Reviewed-on: https://go-review.googlesource.com/1171 Reviewed-by: Andrew Gerrand &lt;adg@golang.org&gt; .gitignore | 1 + 1 file changed, 1 insertion(+) </pre> <p> As part of the conversion from Mercurial to Git, we did not add visible lines to the commit bodies showing the old Mercurial hashes, but we did record them in the underlying Git commit objects. For example: <pre>% git cat-file -p 7d7c6a97f815e9279d08cfaea7d5efb5e90695a8 tree e06bd601885e16ad3d72c2a8c9b411889b2e478e author Brian Kernighan &lt;bwk&gt; 80352345 -0500 committer Brian Kernighan &lt;bwk&gt; 80352345 -0500 golang-hg f6182e5abf5eb0c762dddbb18f8854b7e350eaeb hello, world R=ken DELTA=7 (7 added, 0 deleted, 0 changed) % </pre> <p> The <code>golang-hg</code> line records the original Mercurial commit hash. <p> And that’s the end of the story, until we move to a fifth version control system at some point in the future. What NPM Should Do Today To Stop A New Colors Attack Tomorrow tag:research.swtch.com,2012:research.swtch.com/npm-colors 2022-01-10T11:45:00-05:00 2022-01-10T11:47:00-05:00 Stop blindly installing the latest version of all transitive dependencies. <p> Over the weekend, a developer named <a href="https://github.com/marak">Marak Squires</a> intentionally sabotaged his popular NPM package <a href="https://github.com/Marak/colors.js/commit/074a0f8ed0c31c35d13d28632bd8a049ff136fb6">colors</a> and his less popular package <a href="https://github.com/Marak/faker.js/commit/2c4f82f0af819e2bdb2623f0e429754f38c2c2f2">faker</a>. As I write this, NPM claims <a href="https://www.npmjs.com/package/colors">18,971 direct dependents for colors</a> and <a href="https://www.npmjs.com/package/faker">2,751 for faker</a>. Open Source Insights counts at least 42,000 more <a href="https://deps.dev/npm/colors/1.4.0/dependents">indirect dependents for colors</a>. Many popular NPM packages depend on these packages. <p> A misfeature in NPM’s design means that as soon as the sabotaged version of colors was published, fresh installs of command-line tools depending on colors immediately started using it, with no testing that it was in any way compatible with each tool. (Spoiler alert: it wasn’t!) <p> The specific misfeature is that when you use NPM to install a package, including a command-line tool, NPM selects the dependency versions according to the requirements listed in package.json as well as the state of the world at that moment, preferring the newest possible allowed version of every dependency. This means that the moment Marak updated colors, installs of aws-cdk and the other tools started breaking, and the bug reports started rolling in, like <a href="https://github.com/aws/aws-cdk/issues/18322">this one</a>: <p> <img name="npm-colors" class="center pad" width=697 height=456 src="npm-colors.png" srcset="npm-colors.png 1x, npm-colors@1.5x.png 1.5x, npm-colors@2x.png 2x, npm-colors@3x.png 3x"> <p> And also these in <a href="https://github.com/apostrophecms/apostrophe/issues/3611">apostrophe</a>, <a href="https://github.com/cdk8s-team/cdk8s-cli/issues/171">cdk8s</a>, <a href="https://github.com/compodoc/compodoc/issues/1171">compodoc</a>, <a href="https://github.com/foreversd/forever/issues/1126">foreversd</a>, <a href="https://github.com/hexojs/hexo/issues/4865">hexo</a>, <a href="https://github.com/highcharts/node-export-server/issues/319">highcharts</a>, <a href="https://github.com/facebook/jest/issues/12226">jest</a>, <a href="https://github.com/netlify/cli/issues/3981">netlify</a>, <a href="https://github.com/oclif/oclif/issues/786">oclif</a>, and more. <p> NPM users may be upset at Marak today, but at least the change didn’t do anything worse than print garbage to the terminal. It could have been worse. A lot worse. Even ignoring this kind of intentional breakage, innocent bugs happen all the time too. Essentially every open source software license points out that the code is made available with no warranty at all. Modern package managers need to be designed to expect and mitigate this risk. <p> Anyone running modern production systems knows about testing followed by <a href="https://sre.google/sre-book/reliable-product-launches/#gradual-and-staged-rollouts-yDsrIPFV">gradual or staged rollouts</a>, in which changes to a running system are deployed gradually over a long period of time, to reduce the possibility of accidentally taking down everything at once. For example, the last time I needed to make a change to Google’s core DNS zone files, the change was tested against many many regression tests and then deployed to each of Google’s four name servers, one at a time, over a period of 24 hours. Regression testing checks that the change does not appear to affect answers it shouldn’t have, and then the gradual rollout gives plenty of time for both automated systems and reliability engineers to notice unexpected problems and stop the rollout. <p> NPM’s design choice is the exact opposite. The latest version of colors was promoted to use in all its dependents before any of them had a chance to test it and without any kind of gradual rollout. Users can disable this behavior today, by pinning the exact versions of all their dependencies. For example here is <a href="https://github.com/aws/aws-cdk/pull/18324/commits/9802d23b0359d3089dadc1b75e20db3b97a09921">the fix to aws-cdk</a>. That’s not a good answer, but at least it’s possible. <p> The right path forward for NPM and package managers like it is to stop preferring the latest possible version of all dependencies when installing a new package. Instead, they should prefer to use the dependency versions that the package was actually tested with, or versions as close as possible to those. I call that a <i><a href="vgo-mvs">high-fidelity build</a></i>. In contrast, the people who installed aws-cdk and other packages over the weekend got low-fidelity builds: NPM inserted a new version of colors that the developers of these other packages had never tested against. Users got to test that brand new configuration themselves over the weekend, and the test failed. <p> High-fidelity builds solve both the testing problem and the gradual rollout problem. A new version of colors wouldn’t get picked up by an aws-cdk install until the aws-cdk authors had gotten a chance to test it and push a new version configuration in a new version of aws-cdk. At that point, all new aws-cdk installs would get the new colors, but all the other tools would still be unaffected, until they too tested and officially adopted the new version of colors. <p> There are many ways to produce high-fidelity builds. In Go, a package declares the minimum required version of each dependency, and that’s what the build uses, unless some other constraint in the same build graph requests a newer one. And then, it only uses that specific newer one, not the one that just appeared over the weekend and is entirely untested by anyone. For more about this approach, see “<a href="vgo-principles">The Principles of Versioning in Go</a>.” <p> Package managers don’t have to adopt Go’s approach exactly. It would be enough, for example, to record the versions that the aws-cdk developers used for their testing and then reuse those versions during the install. In fact, NPM can already record those versions, in a lock file. But <code>npm install</code> of a new package does not use the information in that package’s lock file to decide the versions of dependencies: lock files are not transitive. <p> NPM also has an <a href="https://docs.npmjs.com/cli/v8/configuring-npm/npm-shrinkwrap-json"><code>npm shrinkwrap</code></a> command, as well as an <a href="https://docs.npmjs.com/cli/v8/commands/npm-ci"><code>npm ci</code></a> command, both of which appear to fix this problem in certain, limited circumstances. Most of the authors and users of commands affected by the colors sabotage should be looking carefully at those today. Kudos to NPM for providing those, but they shouldn’t be off to the side. The next step is for NPM to arrange for that kind of protection to happen by default. And then the same protection is needed when installing a new library package as a dependency, not just when installing a command. All this will require more work. <p> Other language package managers should take note too. Marak has done all of us a huge favor by highlighting the problems most package managers create with their policy of automatic adoption of new dependencies without the opportunity for gradual rollout or any kind of testing whatsoever. Fixing those problems is long overdue. Next time will be worse. On “Trojan Source” Attacks tag:research.swtch.com,2012:research.swtch.com/trojan 2021-11-01T23:33:00-04:00 2021-11-01T23:35:00-04:00 Code review is the answer to this and many other problems. <p> There is a <a href="https://www.trojansource.codes/trojan-source.pdf">paper making the rounds</a>, with a <a href="https://www.trojansource.codes/">slick accompanying web site</a>, in which the authors describe a software supply chain attack they call “Trojan Source: Invisible Vulnerabilities”. In short, if you use comments containing Unicode LTR and RTL code points, which control whether text is rendered left-to-right or right-to-left, you can make code look different in a standard Unicode rendering than it does to a program ignoring the comments. <p> The authors claim this is “a new type of attack” that “cannot be perceived directly by human code reviewers” and “pose[s] an immediate threat”, and they propose that compilers should be “upgraded to block this attack.” None of this is true. <a class=anchor href="#old"><h2 id="old">The attack is old</h2></a> <p> First, the attacks are not new. Here is a <a href="https://golang.org/issue/20209">Go example from 2017</a>, a <a href="https://twitter.com/eggleroy/status/1006150385067855872">Rust example from 2018</a>, a <a href="https://twitter.com/zygoloid/status/1187150150835195905">C++ example from 2019</a>, and a <a href="https://twitter.com/jupenur/status/1244286243518713857">Ruby example from 2020</a>. <a class=anchor href="#visible"><h2 id="visible">The change is visible</h2></a> <p> Second, it is technically true that the attack cannot be perceived directly by human code reviewers, but only in the sense that no program in a computer can be perceived directly by humans (except of course the <a href="https://xkcd.com/378/">real programmers using magnetized needles and a steady hand</a>). Instead, we build tools that convert the electrical charges or magnetized particles that represent our programs into some more recognizable form, like a <a href="https://xkcd.com/722/">pattern of lights</a>. And of course we have complete control over the exact conversion those programs use and therefore exactly what they enable us to perceive, a fact I will return to later. <p> (Also, as <a href="https://news.ycombinator.com/item?id=29062322">noted on Hacker News</a>, any editor that syntax highlights comments will suffice to detect the examples in the paper!) <a class=anchor href="#threat_is_not_immediate"><h2 id="threat_is_not_immediate">The threat is not immediate</h2></a> <p> Third, the attacks do not create an immediate threat, at least not a new one. The attack model in the paper assumes you take changes to your code from untrusted people. Not everyone does this! And the people who do take those changes are usually not reading the code carefully enough that this attack is necessary. In the <a href="https://blog.npmjs.org/post/180565383195/details-about-the-event-stream-incident.html">event-stream incident</a>, for example, many people focus on event-stream being hijacked by social engineering, while very few people focus on the fact that the malicious code was hidden by only including it in a “minimized” JavaScript file, a fact the Copay authors missed when upgrading their dependencies, if they looked at them at all. Perhaps most importantly, if you are letting untrusted people make arbitrary changes to your code, they can probably find <a href="https://freedom-to-tinker.com/2013/09/20/software-transparency-debian-openssl-bug/">far more subtle</a> (and deniable!) ways to hide a malicious change than using LTR/RTL code points. <a class=anchor href="#compilers"><h2 id="compilers">The compiler is the wrong place for a fix</h2></a> <p> Finally, compilers should not be “upgraded” to block the attack. Fixing compilers closes only one of many open doors. What about assemblers? What about build and configuration file parsers? What about general YAML parsers? What about every other possible program that in some way executes the content of a text file that allows comments or quoted string literals? <p> For example, the Rust project issued 1.56.1 today, changing <code>rustc</code> to disallow LTR/RTL code points in Rust source files. But as far as I can tell there has been no update to <code>cargo</code>, meaning that a similar attack can probably still be carried out using comments and string literals in <code>Cargo.toml</code>. <p> Considering build configuration files alone, closing all the doors would mean updating the programs associated with <code>BUILD.bazel</code>, <code>CMakefile</code>, <code>Cargo.toml</code>, <code>Dockerfile</code>, <code>GNUmakefile</code>, <code>Makefile</code>, <code>go.mod</code>, <code>package.json</code>, <code>pom.xml</code>, <code>requirements.txt</code>, and many, many more. And again, that’s just build configuration. It would be an unbounded amount of work to require every program that interprets a text file to be changed to exclude a list of certain Unicode code points. <p> Even excluding “bad” Unicode code is pretty complicated. The next problem is <a href="https://github.com/golang/go/issues/20115">homograph attacks</a>, using characters that look the same but aren’t. <p> Changing compilers is also arguably a false sense of security. It’s <a href="https://research.swtch.com/loopsnooper">impossible in general</a> for the compiler to tell whether code is good or bad, for almost any definition of good or bad. Even if the compiler rejects LTR/RTL code points, it can’t reject the many more subtle ways an attacker might introduce an unnoticed change. <a class=anchor href="#review"><h2 id="review">The answer is proper review</h2></a> <p> Instead of changing compilers, we should take advantage of the fact that we must use tools to render our programs for us—tools like text editors and code review web sites. We should make sure we are using those tools and also make sure their renderers make suspicious Unicode use easier to see. There are far fewer of those tools than there are programs that interpret or execute content from text files: most development efforts will have just one review tool and a few common editors. <p> Making LTR/RTL code points visible is particularly important for review tools, and it is good that <a href="https://github.blog/changelog/2021-10-31-warning-about-bidirectional-unicode-text/">GitHub now warns about these characters in PRs</a>. On the other hand, the blog post’s suggestion to use view UTF-8 files in VSCode as CP 437 to “see” Unicode sequences is a kludge. Hopefully VSCode will add a proper warning and visible indicators in UTF-8 mode soon too. <a class=anchor href="#summary"><h2 id="summary">Summary</h2></a> <p> In summary: This is an old, already known bug. The source changes are hardly invisible: many programs can see them today, and no doubt more will be able to see them tomorrow. There is no new threat here: people who can modify the code you run can do a lot of damage already that not enough people watch for and that can be made arbitrarily subtle without Unicode tricks. Making the fix in every compiler is both a lot of work and not nearly good enough: it omits all the other programs that execute instructions from text files. Instead, the right answer to Unicode tricks is proper code review and dependency hygiene, which will handle a much broader class of problems that fully includes this one, especially using review tools that highlight suspicious Unicode, which isn’t limited to LTR/RTL attacks. <p> The authors of this paper have clearly done a good job promoting it. Kudos to them on that. But I am concerned that the attention and response this paper is getting is in general distracting from far more useful security efforts. We should redirect that attention and response at improving general code and dependency review instead. <p> The most egregious example of breathless coverage of this relatively minor issue is Brian Krebs’s headline: <a href="https://krebsonsecurity.com/2021/11/trojan-source-bug-threatens-the-security-of-all-code/">‘Trojan Source’ Bug Threatens the Security of All Code</a>. Imagine the writeup if he’d attended <a href="https://dl.acm.org/doi/pdf/10.1145/358198.358210">Ken Thompson’s Turing Award Lecture</a>. <i>That’s</i> an invisible vulnerability! (For the record, back in 2013, Ken told me that the backdoored compiler did in fact exist and was deployed, but it had a subtle bug that exposed the backdoor: the reproduction code added an extra <code>\0</code> to the string each time it copied itself. This made the compiler binary grow by one byte every time it rebuilt itself. Eventually someone noticed and debugged what was going on, aided by the fact that the “print assembly” compilation mode did not insert the backdoor at all, and all traces were cleaned away. Or so we believe.) How Many Go Developers Are There? tag:research.swtch.com,2012:research.swtch.com/gophercount 2017-07-13T13:00:00-04:00 2021-08-05T10:02:00-04:00 Around two million. <p> How many Go developers are there in the world? <i>Around two million.</i> <p> As of August 2021, my best estimate is between 1.6 and 2.5 million. <p> Previously:<br> In July 2017, I estimated between half a million and a million.<br> In July 2018, I estimated between 0.8 and 1.6 million. <br> In November 2019, I estimated between 1.15 and 1.96 million. <a class=anchor href="#approach"><h2 id="approach">Approach</h2></a> <p> My approach is to compute: <div class=fig> <center> <i>Number of Go Developers &nbsp;&nbsp;</i> = &nbsp;&nbsp; <i>Number of Software Developers</i> &nbsp; × &nbsp; <i>Fraction using Go</i> </center> </div> <p> Then we need to answer how many software developers there are in the world and what percentage of them are using Go. <a class=anchor href="#number_of_software_developers"><h2 id="number_of_software_developers">Number of Software Developers</h2></a> <p> How many software developers are there in the world? <p> In January 2014, <a href="https://www.infoq.com/news/2014/01/IDC-software-developers">InfoQ reported</a> that IDC published a report (no longer available online, it would seem) estimating that there were 11,005,000 “professional software developers” and 7,534,500 “hobbyist software developers,” giving a total estimate of 18,539,500. <p> In October 2016, Evans Data Corporation issued a <a href="https://evansdata.com/press/viewRelease.php?pressID=244">press release</a> advertising their “<a href="https://web.archive.org/web/20160730050104/http://www.evansdata.com/reports/viewRelease.php?reportID=9">Global Developer Population and Demographic Study 2016</a>” in which they estimated the total worldwide population of software developers to be 21 million. <p> Maybe the Evans estimate is too high. The details of their methodology are key to their business and therefore not revealed publicly, so we can’t easily tell how strict or loose their definition of developer is. In January 2017, PK of the DQYDJ blog posted an analysis titled “<a href="https://dqydj.com/number-of-developers-in-america-and-per-state/">How Many Developers are There in America, and Where Do They Live?</a>,” That post, which includes an admirably detailed methodology section, used data from the 2016 American Census Survey (ACS) and included these employment categories as “strict” software developers: <ul> <li> Computer Scientists and Systems Analysts / Network Systems Analysts / Web Developers <li> Computer Programmers <li> Software Developers, Applications and Systems Software <li> Database Administrators</ul> <p> Using that list, PK arrived at a total of 3,357,626 software developers in the United States. The post then added two less strict categories, which expanded the total to 4,185,114 software developers. A conservative estimate of the number of software developers worldwide would probably include only PK’s “strict” category, about 80% of the United States total. If we assume conservatively that the Evans estimate was similarly loose and that the 80% ratio holds worldwide, we can make the Evans estimate stricter by multiplying it by the same 80%, arriving at 16.8 million. <p> Maybe the Evans estimate is too low. In May 2017, RedMonk blogger <a href="http://redmonk.com/jgovernor/2017/05/26/just-how-many-darned-developers-are-there-in-the-world-github-is-puzzled/">James Governor reported</a> that, in a recent speech, GitHub CEO Chris Wanstrath claimed GitHub has 21 million active users and GitHub’s Atom editor has 2 million active users (who haven’t turned off metrics and tracking) and concluded that the IDC and Evans estimates are therefore too low. Governor went on to give a “wild assed guess” of 35 million developers worldwide. <p> <i>Based on all this (and ignoring wild-assed guesses), my estimate in July 2017 was that the number of software developers worldwide was likely to be in the range 16.8–21 million.</i> <p> The <a href="https://evansdata.com/press/viewRelease.php?pressID=268">2018 Evans Data Global Developer Population and Demographic Study</a> estimated 23 million developers worldwide, up from 21 million in the 2016 survey. <p> <i>Based on the confidence range in 2017 being 16.8–21 million, my estimate in July 2018 applied the Evans Data 10% growth to arrive at 18.4–23 million developers in 2018.</i> <p> The <a href="https://evansdata.com/press/viewRelease.php?pressID=278">2019 Evans Data Global Development Survey</a> estimated 23.9 million developers worldwide in May 2019, up from 23 million in the 2018 survey. <p> The <a href="https://www.idc.com/getdoc.jsp?containerId=US44363318">IDC Worldwide Developer Census, 2018</a> estimated 22.30 million developers worldwide, up from <a href="https://www.idc.com/getdoc.jsp?containerId=US44389218">21.25 million in 2017</a>. <p> SlashData’s <a href="https://slashdata-website-cms.s3.amazonaws.com/sample_reports/EiWEyM5bfZe1Kug_.pdf">Global Developer Population 2019</a> estimates 14.7 million developers in Q2 2017, 15.7 million in Q4 2017, 16.9 million in Q2 2018, and 18.9 million in Q4 2018, extrapolating to “at least 21M developers by the end of 2019 and possibly upwards of 23M.” Oddly, SlashData’s <a href="https://www.slashdata.co/free-resources/state-of-the-developer-nation-17th-edition">Developer Economics State of the Developer Nation 17th Edition</a> estimated “18 million active software developers in the world” as of Q2 2019. It is unclear if this was rounded down from the 18.9, if SlashData believes the number of developers decreased in the first half of 2019, or if the numbers are completely unrelated. <p> <i>Based on these, my estimate in November 2019 is that the number of developers worldwide is likely to be in the range 18.9–23.9 million developers.</i> <p> The Evans Data Corporation Worldwide Developer Population and Demographic Study 2020, Volume 2 estimated 24.5 million developers worldwide, with lower growth than predicted attributed to the pandemic. <p> The <a href="https://www.idc.com/getdoc.jsp?containerId=US47513521">IDC Worldwide Developer Census, 2020</a> estimated 26.2 million developers worldwide in January 2020. <p> SlashData’s <a href="https://www.slashdata.co/free-resources/developer-economics-state-of-the-developer-nation-20th-edition">Developer Economics State of the Developer Nation 20th Edition</a> estimated 24.3 million developers worldwide. <p> These estimates seem to be converging, but to be conservative, I'm only going to move the low end halfway from the previous 18.9 to the 24.3 here, producing 21.6 million. <p> <i>Based on these, my estimate in August 2021 is that the number of developers worldwide is likely to be in the range 21.6–26.2 million developers.</i> <a class=anchor href="#fraction_using_go"><h2 id="fraction_using_go">Fraction using Go</h2></a> <p> What fraction of software developers use Go? <p> Stack Overflow has been running an annual developer survey for the past few years. In their <a href="https://insights.stackoverflow.com/survey/2017#technology">2017 survey</a>, 4.2% of all respondents and 4.6% of professional developers reported using Go. Unfortunately, we cannot sanity check Go against the year before, because the <a href="https://insights.stackoverflow.com/survey/2016#technology">2016 survey report</a> cut off the list of popular technologies after Objective-C (6.5%). <p> O’Reilly has been running annual software developer salary surveys for the past few years as well, and their survey asks about language use. The <a href="https://www.oreilly.com/ideas/2016-software-development-salary-survey-report#id-kAiqtmuE">2016 worldwide survey</a> reports that Go is used by 3.9% of respondents, while the <a href="https://www.oreilly.com/ideas/2016-european-software-development-salary-survey#id-4xiji1CA">2016 European survey</a> reports Go is used by 3.3% of respondents. (I derived both these numbers by measuring the bars in the graphs.) The <a href="https://www.oreilly.com/ideas/2017-software-development-salary-survey#tools">2017 worldwide survey</a> reports in commentary that 4.5% of respondents say they use Go. <p> Maybe the 4.2–4.6% estimate is too high. Both of these are online surveys with a danger of self-selection bias among respondents, and that developers who don’t answer online surveys use Go much less than those who do. For example, perhaps the surveys are skewed by geography or experience in a way that affects the result. Suppose, I think quite pessimistically, that the survey respondents are only representative of half of the software developer population, and that in the other half, Go is only half as popular. Then if a survey found 4.2% of developers use Go, the real percentage would be three quarters as much, or 3.15%. <p> <i>Based on all this, my estimate in July 2017 was that the fraction of software developers using Go worldwide is at least 3% and possibly as high as 4.6%.</i> <p> The <a href="https://insights.stackoverflow.com/survey/2018/#most-popular-technologies">2018 Stack Overflow Developer Survey</a> reports that 7.1% of all developers and 7.2% of professional developers use Go. Applying the same pessimism as last year (multiplying by three quarters) suggests a lower bound of 5.3%, but I’ll be even more conservative and use last year’s 4.6% as a lower bound. (The O’Reilly surveys seem to have stopped being run.) <p> <i>Based on this, my estimate in July 2018 was that 4.6–7.1% of developers use Go.</i> <p> The <a href="https://insights.stackoverflow.com/survey/2019/#most-popular-technologies">2019 Stack Overflow Developer Survey</a> reports that 8.2% of all developers and 8.8% of professional developers use Go (compared with 7.1% and 7.2% in 2017). <p> The <a href="https://research.hackerrank.com/developer-skills/2019#skills2">HackerRank 2019 Developer Skills Report</a> reports that 8.8% of respondents know Go at the end of 2018 (compared with 6.08% at the end of 2017). <p> SlashData’s <a href="https://www.slashdata.co/free-resources/state-of-the-developer-nation-17th-edition">Developer Economics State of the Developer Nation 17th Edition</a> estimated 1.1 million active Go software developers, out of 18 million worldwide, or 6.15%. (That happens to match my usual “three quarters of Stack Overflow” conservative lower bound exactly.) <p> The <a href="https://www.jetbrains.com/lp/devecosystem-2019/">JetBrains Developer Ecosystem Survey 2019</a> reports that 18% of developers used Go in 2018. That number seems too high, so I will discount it for now. <p> Evans Data Corporation’s Global Development Survey reports even higher (and harder to believe) percentages for Go usage in the non-public part of the survey. <p> <i>Based on all this, my estimate in November 2019 is that 6.1–8.2% of developers use Go.</i> <p> The <a href="https://insights.stackoverflow.com/survey/2020#technology-programming-scripting-and-markup-languages-all-respondents">2020 Stack Overflow Developer Survey</a> reports that 8.8% of all developers and 9.4% of professional developers use Go. <p> The <a href="https://insights.stackoverflow.com/survey/2020#technology-programming-scripting-and-markup-languages-all-respondents">2021 Stack Overflow Developer Survey</a> reports that 9.55% of all developers and 10.51% of professional developers use Go. <p> SlashData’s <a href="https://www.slashdata.co/free-resources/developer-economics-state-of-the-developer-nation-20th-edition">Developer Economics State of the Developer Nation 20th Edition</a> estimated 2.1 million active Go software developers, out of 24.3 million worldwide, or 8.6%. <p> The <a href="https://www.jetbrains.com/lp/devecosystem-2021/#Main_programming-languages">JetBrains Developer Ecosystem Survey 2021</a> reports that 17% of developers used Go in 2020. That number still seems too high, so I will continue to discount it. <p> Evans Data Corporation’s Global Development Survey 2021 continues to report an even higher (and harder to believe) percentage for Go usage: 22.2% of developers spending at least 10% of their programming time using Go, roughly on par with Kotlin. <p> Like in the total developer population, the believable numbers seem to be converging, but to be conservative, I'm only going to move the low end halfway from the previous 6.1 to the current 8.6, producing 7.4%. <p> <i>Based on all this, my estimate in August 2021 is that 7.4–9.5% of developers use Go.</i> <a class=anchor href="#number_of_go_developers"><h2 id="number_of_go_developers">Number of Go Developers</h2></a> <p> How many Go developers are there? Multiply the low developer count and Go percentages and the high ones. <p> In July 2017, 3–4.6% of 16.8–21 million yielded an estimate of 0.50–0.97 million Go developers. <p> In July 2018, 4.6–7.1% of 18.4–23 million yielded an estimate of 0.85–1.63 million Go developers. <p> In November 2019, 6.1–8.2% of 18.9–23.9 million yields an estimate of 1.15–1.96 million Go developers. <p> In August 2021, 7.4–9.5% of 21.6–26.2 million yields an estimate of 1.6–2.5 million Go developers. Updating the Go Memory Model tag:research.swtch.com,2012:research.swtch.com/gomm 2021-07-12T09:45:00-04:00 2021-07-12T09:47:00-04:00 What changes should we make to Go's memory model? (Memory Models, Part 3) <p> <img name="gophers-racing" class="center pad resizable" width=2286 height=374 src="gophers-racing.jpg"> <p> The current Go language memory model was written in 2009, with minor updates since. It is clear that there are at least a few details that we should add to the current memory model, among them an explicit endorsement of race detectors and a clear statement of how the APIs in <code>sync/atomic</code> synchronize programs. <p> This post restates Go's overall philosophy and the current memory model and then outlines the relatively small adjustments I believe that we should make to the Go memory model. It assumes the background presented in the earlier posts “<a href="hwmm">Hardware Memory Models</a>” and “<a href="plmm">Programming Language Memory Models</a>.” <p> I have opened a <a href="https://golang.org/s/mm-discuss">GitHub discussion</a> to collect feedback on the ideas presented here. Based on that feedback, I intend to prepare a formal Go proposal later this month. The use of GitHub discussions is itself a bit of an experiment, continuing to try to find a <a href="https://research.swtch.com/proposals-discuss">reasonable way to scale discussions of important changes</a>. <a class=anchor href="#gos_design_philosophy"><h2 id="gos_design_philosophy">Go's Design Philosophy</h2></a> <p> Go aims to be a programming environment for building practical, efficient systems. It aims to be lightweight for small projects but also scale up gracefully to large projects and large engineering teams. <p> Go encourages approaching concurrency at a high level, in particular through communication. The first <a href="https://go-proverbs.github.io/">Go proverb</a> is “Don't communicate by sharing memory. Share memory by communicating.” Another popular proverb is that “Clear is better than clever.” In other words, Go encourages avoiding subtle bugs by avoiding subtle code. <p> Go aims not just for understandable programs but also for an understandable language and understandable package APIs. Complex or subtle language features or APIs contradict that goal. As Tony Hoare said in his <a href="https://www.cs.fsu.edu/~engelen/courses/COP4610/hoare.pdf">1980 Turing award lecture</a>:<blockquote> <p> I conclude that there are two ways of constructing a software design: One way is to make it so simple that there are <i>obviously</i> no deficiencies and the other way is to make it so complicated that there are no <i>obvious</i> deficiencies. <p> The first method is far more difficult. It demands the same skill, devotion, insight, and even inspiration as the discovery of the simple physical laws which underlie the complex phenomena of nature. It also requires a willingness to accept objectives which are limited by physical, logical, and technological constraints, and to accept a compromise when conflicting objectives cannot be met.</blockquote> <p> This aligns pretty well with Go's philosophy for APIs. We typically spend a long time during design to make sure an API is right, working to reduce it to its minimal, most useful essence. <p> Another aspect of Go being a useful programming environment is having well-defined semantics for the most common programming mistakes, which aids both understandability and debugging. This idea is hardly new. Quoting Tony Hoare again, this time from his 1972 <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.4380020202">“Quality of Software” checklist</a>:<blockquote> <p> As well as being very simple to use, a software program must be very difficult to misuse; it must be kind to programming errors, giving clear indication of their occurrence, and never becoming unpredictable in its effects.</blockquote> <p> The common sense of having well-defined semantics for buggy programs is not as common as one might expect. In C/C++, undefined behavior has evolved into a kind of compiler writer's <i>carte blanche</i> to turn slightly buggy programs into very differently buggy programs, in ever more interesting ways. Go does not take this approach: there is no “undefined behavior.” In particular, bugs like null pointer dereferences, integer overflow, and unintentional infinite loops all have defined semantics in Go. <a class=anchor href="#gos_memory_model_today"><h2 id="gos_memory_model_today">Go's Memory Model Today</h2></a> <p> <a href="https://golang.org/ref/mem">Go's memory model</a> begins with the following advice, consistent with Go's overall philosophy:<blockquote> <p> Programs that modify data being simultaneously accessed by multiple goroutines must serialize such access. <p> To serialize access, protect the data with channel operations or other synchronization primitives such as those in the <code>sync</code> and <code>sync/atomic</code> packages. <p> If you must read the rest of this document to understand the behavior of your program, you are being too clever. <p> Don't be clever.</blockquote> <p> This remains good advice. The advice is also consistent with other languages' encouragement of DRF-SC: synchronize to eliminate data races, and then programs will behave as if sequentially consistent, leaving no need to understand the remainder of the memory model. <p> After this advice, the Go memory model defines a conventional happens-before-based definition of racing reads and writes. Like in Java and JavaScript, a read in Go can observe any earlier but not yet overwritten write in the happens-before order, or any racing write; arranging to have only one such write forces a specific outcome. <p> The memory model then goes on to define the synchronization operations that establish cross-goroutine happens-before edges. The operations are the usual ones, with some Go-specific flavoring: <ul> <li> If a package <code>p</code> imports package <code>q</code>, the completion of <code>q</code>'s <code>init</code> functions happens before the start of any of <code>p</code>'s. <li> The start of the function <code>main.main</code> happens after all <code>init</code> functions have finished. <li> The <code>go</code> statement that starts a new goroutine happens before the goroutine's execution begins. <li> A send on a channel happens before the corresponding receive from that channel completes. <li> The closing of a channel happens before a receive that returns a zero value because the channel is closed. <li> A receive from an unbuffered channel happens before the send on that channel completes. <li> The <i>k</i>'th receive on a channel with capacity <i>C</i> happens before the <i>k</i>+<i>C</i>'th send from that channel completes. <li> For any <code>sync.Mutex</code> or <code>sync.RWMutex</code> variable <code>l</code> and <i>n</i> &lt; <i>m</i>, call <i>n</i> of <code>l.Unlock()</code> happens before call <i>m</i> of <code>l.Lock()</code> returns. <li> A single call of <code>f()</code> from <code>once.Do(f)</code> happens (returns) before any call of <code>once.Do(f)</code> returns.</ul> <p> This list notably omits any mention of <code>sync/atomic</code> as well as newer APIs in package <code>sync</code>. <p> The memory model ends with some examples of incorrect synchronization. It contains no examples of incorrect compilation. <a class=anchor href="#changes_to_gos_memory_model"><h2 id="changes_to_gos_memory_model">Changes to Go's Memory Model</h2></a> <p> In 2009, as we set out to write Go's memory model, the Java memory model was newly revised, and the C/C++11 memory model was being finalized. We were strongly encouraged by some to adopt the C/C++11 model, taking advantage of all the work that had gone into it. That seemed risky to us. Instead, we decided on a more conservative approach to what guarantees we would make, a decision confirmed by the subsequent decade of papers detailing very subtle problems in the Java/C/C++ line of memory models. Defining enough of a memory model to guide programmers and compiler writers is important, but defining one in complete formality—correctly!—seems still just beyond the grasp of the most talented researchers. It should suffice for Go to continue to say the minimum needed to be useful. <p> This section list the adjustments I believe we should make. As noted earlier, I have opened a <a href="https://golang.org/s/mm-discuss">GitHub discussion</a> to collect feedback. Based on that feedback, I plan to prepare a formal Go proposal later this month. <a class=anchor href="#document_gos_overall_approach"><h3 id="document_gos_overall_approach">Document Go's overall approach</h3></a> <p> The “don't be clever” advice is important and should stay, but we also need to say more about Go's overall approach before diving into the details of happens-before. I have seen multiple incorrect summaries of Go's approach, such as claiming that Go's model is C/C++'s “DRF-SC or Catch Fire.” Misreadings are understandable: the document doesn't say what the approach is, and it is so short (and the material so subtle) that people see what they expect to see rather than what is or is not there. <p> The text to be added would be something along the lines of:<blockquote> <a class=anchor href="#overview"><h3 id="overview">Overview</h3></a> <p> Go approaches its memory model in much the same way as the rest of the language, aiming to keep the semantics simple, understandable, and useful. <p> A <i>data race</i> is defined as a write to a memory location happening concurrently with another read or write to that same location, unless all the accesses involved are atomic data accesses as provided by the <code>sync/atomic</code> package. As noted already, programmers are strongly encouraged to use appropriate synchronization to avoid data races. In the absence of data races, Go programs behave as if all the goroutines were multiplexed onto a single processor. This property is sometimes referred to as DRF-SC: data-race-free programs execute in a sequentially consistent manner. <p> Other programming languages typically take one of two approaches to programs containing data races. The first, exemplified by C and C++, is that programs with data races are invalid: a compiler may break them in arbitrarily surprising ways. The second, exemplified by Java and JavaScript, is that programs with data races have defined semantics, limiting the possible impact of a race and making programs more reliable and easier to debug. Go's approach sits between these two. Programs with data races are invalid in the sense that an implementation may report the race and terminate the program. But otherwise, programs with data races have defined semantics with a limited number of outcomes, making errant programs more reliable and easier to debug.</blockquote> <p> This text should make clear how Go is and is not like other languages, correcting any prior expectations on the part of the reader. <p> At the end of the “Happens Before” section, we should also clarify that certain races can still lead to corruption. It currently ends with:<blockquote> <p> Reads and writes of values larger than a single machine word behave as multiple machine-word-sized operations in an unspecified order.</blockquote> <p> We should add:<blockquote> <p> Note that this means that races on multiword data structures can lead to inconsistent values not corresponding to a single write. When the values depend on the consistency of internal (pointer, length) or (pointer, type) pairs, as is the case for interface values, maps, slices, and strings in most Go implementations, such races can in turn lead to arbitrary memory corruption.</blockquote> <p> This will more clearly state the limitations of the guarantees on programs with data races. <a class=anchor href="#document_happens-before_for_sync_libraries"><h3 id="document_happens-before_for_sync_libraries">Document happens-before for sync libraries</h3></a> <p> New APIs have been added to the <a href="https://golang.org/pkg/sync"><code>sync</code> package</a> since the memory model was written. We need to add them to the memory model (<a href="https://golang.org/issue/7948">issue #7948</a>). Thankfully, the additions seem straightforward. I believe they are as follows. <ul> <li> <p> For <a href="https://golang.org/pkg/sync/#Cond"><code>sync.Cond</code></a>: <code>Broadcast</code> or <code>Signal</code> happens before the return of any <code>Wait</code> call that it unblocks. <li> <p> For <a href="https://golang.org/pkg/sync/#Map"><code>sync.Map</code></a>: <code>Load</code>, <code>LoadAndDelete</code>, and <code>LoadOrStore</code> are read operations. <code>Delete</code>, <code>LoadAndDelete,</code> and <code>Store</code> are write operations. <code>LoadOrStore</code> is a write operation when it returns <code>loaded</code> set to <code>false</code>. A write operation happens before any read operation that observes the effect of the write. <li> <p> For <a href="https://golang.org/pkg/sync/#Pool"><code>sync.Pool</code></a>: A call to <code>Put(x)</code> happens before a call to <code>Get</code> returning that same value <code>x</code>. Similarly, a call to <code>New</code> returning <code>x</code> happens before a call to <code>Get</code> returning that same value <code>x</code>. <li> <p> For <a href="https://golang.org/pkg/sync/#WaitGroup"><code>sync.WaitGroup</code></a>: A call to <code>Done</code> happens before the return of any <code>Wait</code> call that it unblocks.</ul> <p> Users of these APIs need to know the guarantees in order to use them effectively. Therefore, while we should keep the text in the memory model for illustrative purposes, we should also include it in the doc comments in package <code>sync</code>. This will also help set an example for third-party synchronization primitives of the importance of documenting the ordering guarantees established by an API. <a class=anchor href="#document_happens-before_for_sync/atomic"><h3 id="document_happens-before_for_sync/atomic">Document happens-before for sync/atomic</h3></a> <p> Atomic operations are missing from the memory model. We need to add them (<a href="https://golang.org/issue/5045">issue #5045</a>). I believe we should say:<blockquote> <p> The APIs in the <code>sync/atomic</code> package are collectively “atomic operations” that can be used to synchronize the execution of different goroutines. If the effect of an atomic operation A is observed by atomic operation B, then A happens before B. All the atomic operations executed in a program behave as though executed in some sequentially consistent order.</blockquote> <p> This is <a href="https://github.com/golang/go/issues/5045#issuecomment-66076297">what Dmitri Vyukov suggested in 2013</a> and <a href="https://github.com/golang/go/issues/5045#issuecomment-252730563">what I informally promised in 2016</a>. It also has the same semantics as Java's <code>volatile</code>s and C++'s default atomics. <p> In terms of the C/C++ menu, there are only two choices for synchronizing atomics: sequentially consistent or acquire/release. (Relaxed atomics do not create happens-before edges and therefore have no synchronizing effect.) The decision between those comes down to, first, how important it is to be able to reason about the relative order of atomic operations on multiple locations, and, second, how much more expensive sequentially consistent atomics are compared to acquire/release atomics. <p> On the first consideration, reasoning about the relative order of atomic operations on multiple locations is very important. In an earlier post I gave an <a href="plmm#cond">example of a condition variable with a lock-free fast path</a> implemented using two atomic variables, broken by using acquire/release atomics. This pattern appears again and again. For example, a past implementation of <code>sync.WaitGroup</code> used <a href="https://go.googlesource.com/go/+/ee6e1a3ff77a41eff5a606a5aa8c46bf8b571a13/src/pkg/sync/waitgroup.go#54">a pair of atomic <code>uint32</code> values</a>, <code>wg.counter</code> and <code>wg.waiters</code>. The <a href="https://go.googlesource.com/go/+/cf148f3d468f4d0648e7fc6d2858d2afdc37f70d/src/runtime/sema.go#134">Go runtime implementation of semaphores</a> also depends on two separate atomic words, namely the semaphore value <code>*addr</code> and the corresponding waiter count <code>root.nwait</code>. There are more. In the absence of sequentially consistent semantics (that is, if we instead adopt acquire/release semantics), people will still write code like this; it will just fail mysteriously, and only in certain contexts. <p> The fundamental problem is that using acquire/release atomics to make a program data-race-free does not result in a program that behaves in a sequentially consistent manner, because the atomics themselves don't. That is, such programs do not provide DRF-SC. This makes such programs very difficult to reason about and therefore difficult to write correctly. <p> On the second consideration, as noted in the earlier post, <a href="plmm#sc">hardware designers are starting to provide direct support for sequentially consistent atomics</a>. For example, ARMv8 adds the <code>ldar</code> and <code>stlr</code> instructions for implementing sequentially consistent atomics, and they are also the <a href="https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html">recommended implementation of acquire/release atomics</a>. If we adopted acquire/release semantics for <code>sync/atomic</code>, programs written on ARMv8 would be getting sequential consistency anyway. This would undoubtedly lead to programs that rely on the stronger ordering accidentally, breaking on weaker platforms. This may even happen on a single architecture, if the difference between acquire/release and sequentially consistent atomics is difficult to observe in practice due to race windows being small. <p> Both considerations strongly suggest we should adopt sequentially consistent atomics over acquire/release atomics: sequentially consistent atomics are more useful, and some chips have already completely closed the gap between the two levels. Presumably others will do the same if the gap is significant. <p> The same considerations, along with Go's overall philosophy of having minimal, easily understood APIs, argue against providing acquire/release as an additional, parallel set of APIs. It seems best to provide only the most understandable, most useful, least misusable set of atomic operations. <p> Another possibility would be to provide raw barriers instead of atomic operations. (C++ provides both, of course.) Barriers have the drawback of making expectations less clear and being somewhat more architecture-specific. Hans Boehm's page “<a href="http://www.hboehm.info/c++mm/ordering_integrated.html">Why atomics have integrated ordering constraints</a>” presents the arguments for providing atomics instead of barriers (he uses the term fences). Generally, the atomics are far easier to understand than fences, and since we already provide atomic operations today, we can't easily remove them. Better to have one mechanism than two. <a class=anchor href="#maybe"><h3 id="maybe">Maybe: Add a typed API to sync/atomic</h3></a> <p> The definitions above say that when a particular piece of memory must be accessed concurrently by multiple goroutines without other synchronization, the only way to eliminate the race is to make all the accesses use atomics. It is not enough to make only some of the accesses use atomics. For example, a non-atomic write concurrent with atomic reads or writes is still a race, and so is an atomic write concurrent with non-atomic reads or writes. <p> Whether a particular value should be accessed with atomics is therefore a property of the value and not of a particular access. Because of this, most languages put this information in the type system, like Java's <code>volatile int</code> and C++'s <code>atomic&lt;int&gt;</code>. Go's current APIs do not, meaning that correct usage requires careful annotation of which fields of a struct or global variables are expected to only be accessed using atomic APIs. <p> To improve program correctness, I am starting to think that Go should define a set of typed atomic values, analogous to the current <code>atomic.Value</code>: <code>Bool</code>, <code>Int</code>, <code>Uint</code>, <code>Int32</code>, <code>Uint32</code>, <code>Int64</code>, <code>Uint64</code>, and <code>Uintptr</code>. Like <code>Value</code>, these would have <code>CompareAndSwap</code>, <code>Load</code>, <code>Store</code>, and <code>Swap</code> methods. For example: <pre>type Int32 struct { v int32 } func (i *Int32) Add(delta int32) int32 { return AddInt32(&amp;i.v, delta) } func (i *Int32) CompareAndSwap(old, new int32) (swapped bool) { return CompareAndSwapInt32(&amp;i.v, old, new) } func (i *Int32) Load() int32 { return LoadInt32(&amp;i.v) } func (i *Int32) Store(v int32) { return StoreInt32(&amp;i.v, v) } func (i *Int32) Swap(new int32) (old int32) { return SwapInt32(&amp;i.v, new) } </pre> <p> I've included <code>Bool</code> on the list because we have constructed atomic booleans out of atomic integers multiple times in the Go standard library (in unexported APIs). There is clearly a need. <p> We could also take advantage of upcoming generics support and define an API for atomic pointers that is typed and free of package <code>unsafe</code> in its API: <pre>type Pointer[T any] struct { v *T } func (p *Pointer[T]) CompareAndSwap(old, new *T) (swapped bool) { return CompareAndSwapPointer(... lots of unsafe ...) } </pre> <p> (And so on.) To answer an obvious suggestion, I don't see a clean way to use generics to provide just a single <code>atomic.Atomic[T]</code> that would let us avoid introducing <code>Bool</code>, <code>Int</code>, and so on as separate types, at least not without special cases in the compiler. And that's okay. <a class=anchor href="#maybe"><h3 id="maybe">Maybe: Add unsynchronized atomics</h3></a> <p> All other modern programming languages provide a way to make concurrent memory reads and writes that do not synchronize the program but also don't invalidate it (don't count as a data race). C, C++, Rust, and Swift have relaxed atomics. Java has <a href="https://docs.oracle.com/javase/9/docs/api/java/lang/invoke/VarHandle.html"><code>VarHandle</code>'s “plain” mode</a>. JavaScript has non-atomic accesses to the <code>SharedArrayBuffer</code> (the only shared memory). Go has no way to do this. Perhaps it should. I don't know. <p> If we wanted to add unsynchronized atomic reads and writes, we could add <code>UnsyncAdd</code>, <code>UnsyncCompareAndSwap</code>, <code>UnsyncLoad</code>, <code>UnsyncStore</code>, and <code>UnsyncSwap</code> methods to the typed atomics. Naming them “unsync” avoids a few problems with the name “relaxed.” First, some people use relaxed as a relative comparison, as in “acquire/release is a more relaxed memory order than sequential consistency.” You can argue that's not proper usage of the term, but it happens. Second, and more important, the critical detail about these operations is not the memory ordering of the operations themselves but the fact that they have <i>no effect</i> on the synchronization of the rest of the program. To people who are not experts in memory models, seeing <code>UnsyncLoad</code> should make clear that there is no synchronization, whereas <code>RelaxedLoad</code> probably would not. It's also nice that <code>Unsync</code> looks at a glance like <code>Unsafe</code>. <p> With the API out of the way, the real question is whether to add these at all. The usual argument for providing an unsynchronized atomic is that it really does matter for the performance of fast paths in certain data structures. My general impression is that it matters most on non-x86 architectures, although I don't have data to back this up. Not providing an unsynchronized atomic can be argued to penalize those architectures. <p> A possible argument against providing an unsynchronized atomic is that on x86, ignoring the effect of potential compiler reorderings, unsynchronized atomics are indistinguishable from acquire/release atomics. They might therefore be abused to write code that only works on x86. The counterargument is that such subterfuge would not pass muster with the race detector, which implements the actual memory model and not the x86 memory model. <p> With the lack of evidence we have today, we would not be justified in adding this API. If anyone feels strongly that we should add it, the way to make the case would be to gather evidence of both (1) general applicability in code that programmers need to write, and (2) significant performance improvements on widely used systems arising from using non-synchronizing atomics. (It would be fine to show this using programs in languages other than Go.) <a class=anchor href="#document_disallowed_compiler_optimizations"><h3 id="document_disallowed_compiler_optimizations">Document disallowed compiler optimizations</h3></a> <p> The current memory model ends by giving examples of invalid programs. Since the memory model serves as a contract between the programmer and the compiler writers, we should add examples of invalid compiler optimizations. For example, we might add:<blockquote> <a class=anchor href="#incorrect_compilation"><h3 id="incorrect_compilation">Incorrect compilation</h3></a> <p> The Go memory model restricts compiler optimizations as much as it does Go programs. Some compiler optimizations that would be valid in single-threaded programs are not valid in Go programs. In particular, a compiler must not introduce a data race in a race-free program. It must not allow a single read to observe multiple values. And it must not allow a single write to write multiple values. <p> Not introducing data races into race-free programs means not moving reads or writes out of conditional statements in which they appear. For example, a compiler must not invert the conditional in this program: <pre>i := 0 if cond { i = *p } </pre> <p> That is, the compiler must not rewrite the program into this one: <pre>i := *p if !cond { i = 0 } </pre> <p> If <code>cond</code> is false and another goroutine is writing <code>*p</code>, then the original program is race-free but the rewritten program contains a race. <p> Not introducing data races also means not assuming that loops terminate. For example, a compiler must not move the accesses to <code>*p</code> or <code>*q</code> ahead of the loop in this program: <pre>n := 0 for e := list; e != nil; e = e.next { n++ } i := *p *q = 1 </pre> <p> If <code>list</code> pointed to a cyclic list, then the original program would never access <code>*p</code> or <code>*q</code>, but the rewritten program would. <p> Not introducing data races also means not assuming that called functions always return or are free of synchronization operations. For example, a compiler must not move the accesses to <code>*p</code> or <code>*q</code> ahead of the function call in this program (at least not without direct knowledge of the precise behavior of <code>f</code>): <pre>f() i := *p *q = 1 </pre> <p> If the call never returned, then once again the original program would never access <code>*p</code> or <code>*q</code>, but the rewritten program would. And if the call contained synchronizing operations, then the original program could establish happens before edges preceding the accesses to <code>*p</code> and <code>*q</code>, but the rewritten program would not. <p> Not allowing a single read to observe multiple values means not reloading local variables from shared memory. For example, a compiler must not spill <code>i</code> and reload it a second time from <code>*p</code> in this program: <pre>i := *p if i &lt; 0 || i &gt;= len(funcs) { panic("invalid function index") } ... complex code ... // compiler must NOT reload i = *p here funcs[i]() </pre> <p> If the complex code needs many registers, a compiler for single-threaded programs could discard <code>i</code> without saving a copy and then reload <code>i</code> <code>=</code> <code>*p</code> just before <code>funcs[i]()</code>. A Go compiler must not, because the value of <code>*p</code> may have changed. (Instead, the compiler could spill <code>i</code> to the stack.) <p> Not allowing a single write to write multiple values also means not using the memory where a local variable will be written as temporary storage before the write. For example, a compiler must not use <code>*p</code> as temporary storage in this program: <pre>*p = i + *p/2 </pre> <p> That is, it must not rewrite the program into this one: <pre>*p /= 2 *p += i </pre> <p> If <code>i</code> and <code>*p</code> start equal to 2, the original code does <code>*p</code> <code>=</code> <code>3</code>, so a racing thread can read only 2 or 3 from <code>*p</code>. The rewritten code does <code>*p</code> <code>=</code> <code>1</code> and then <code>*p</code> <code>=</code> <code>3</code>, allowing a racing thread to read 1 as well. <p> Note that all these optimizations are permitted in C/C++ compilers: a Go compiler sharing a back end with a C/C++ compiler must take care to disable optimizations that are invalid for Go.</blockquote> <p> These categories and examples cover the most common C/C++ compiler optimizations that are incompatible with defined semantics for racing data accesses. They establish clearly that Go and C/C++ have different requirements. <a class=anchor href="#conclusion"><h2 id="conclusion">Conclusion</h2></a> <p> Go's general approach of being conservative in its memory model has served us well and should be continued. There are, however, a few changes that are overdue, including defining the synchronization behavior of new APIs in the <code>sync</code> and <code>sync/atomic</code> packages. The atomics in particular should be documented to provide sequentially consistent behavior that creates happens-before edges synchronizing the non-atomic code around them. This would match the default atomics provided by all other modern systems languages. <p> Perhaps the most unique part of the update is the idea of clearly stating that programs with data races may be stopped to report the race but otherwise have well-defined semantics. This constrains both programmers and compilers, and it prioritizes the debuggability and correctness of concurrent programs over convenience for compiler writers. <a class=anchor href="#acknowledgements"><h2 id="acknowledgements">Acknowledgements</h2></a> <p> This series of posts benefited greatly from discussions with and feedback from a long list of engineers I am lucky to work with at Google. My thanks to them. I take full responsibility for any mistakes or unpopular opinions. Programming Language Memory Models tag:research.swtch.com,2012:research.swtch.com/plmm 2021-07-06T10:50:00-04:00 2021-07-06T10:52:00-04:00 An introduction to programming language memory models. (Memory Models, Part 2) <p> Programming language memory models answer the question of what behaviors parallel programs can rely on to share memory between their threads. For example, consider this program in a C-like language, where both <code>x</code> and <code>done</code> start out zeroed. <pre>// Thread 1 // Thread 2 x = 1; while(done == 0) { /* loop */ } done = 1; print(x); </pre> <p> The program attempts to send a message in <code>x</code> from thread 1 to thread 2, using <code>done</code> as the signal that the message is ready to be received. If thread 1 and thread 2, each running on its own dedicated processor, both run to completion, is this program guaranteed to finish and print 1, as intended? The programming language memory model answers that question and others like it. <p> Although each programming language differs in the details, a few general answers are true of essentially all modern multithreaded languages, including C, C++, Go, Java, JavaScript, Rust, and Swift: <ul> <li> First, if <code>x</code> and <code>done</code> are ordinary variables, then thread 2's loop may never stop. A common compiler optimization is to load a variable into a register at its first use and then reuse that register for future accesses to the variable, for as long as possible. If thread 2 copies <code>done</code> into a register before thread 1 executes, it may keep using that register for the entire loop, never noticing that thread 1 later modifies <code>done</code>. <li> Second, even if thread 2's loop does stop, having observed <code>done</code> <code>==</code> <code>1</code>, it may still print that <code>x</code> is 0. Compilers often reorder program reads and writes based on optimization heuristics or even just the way hash tables or other intermediate data structures end up being traversed while generating code. The compiled code for thread 1 may end up writing to <code>x</code> after <code>done</code> instead of <code>before</code>, or the compiled code for thread 2 may end up reading <code>x</code> before the loop.</ul> <p> Given how broken this program is, the obvious question is how to fix it. <p> Modern languages provide special functionality, in the form of <i>atomic variables</i> or <i>atomic operations</i>, to allow a program to synchronize its threads. If we make <code>done</code> an atomic variable (or manipulate it using atomic operations, in languages that take that approach), then our program is guaranteed to finish and to print 1. Making <code>done</code> atomic has many effects: <ul> <li> The compiled code for thread 1 must make sure that the write to <code>x</code> completes and is visible to other threads before the write to <code>done</code> becomes visible. <li> The compiled code for thread 2 must (re)read <code>done</code> on every iteration of the loop. <li> The compiled code for thread 2 must read from <code>x</code> after the reads from <code>done</code>. <li> The compiled code must do whatever is necessary to disable hardware optimizations that might reintroduce any of those problems.</ul> <p> The end result of making <code>done</code> atomic is that the program behaves as we want, successfully passing the value in <code>x</code> from thread 1 to thread 2. <p> In the original program, after the compiler's code reordering, thread 1 could be writing <code>x</code> at the same moment that thread 2 was reading it. This is a <i>data race</i>. In the revised program, the atomic variable <code>done</code> serves to synchronize access to <code>x</code>: it is now impossible for thread 1 to be writing <code>x</code> at the same moment that thread 2 is reading it. The program is <i>data-race-free</i>. In general, modern languages guarantee that data-race-free programs always execute in a sequentially consistent way, as if the operations from the different threads were interleaved, arbitrarily but without reordering, onto a single processor. This is the <a href="hwmm#drf">DRF-SC property from hardware memory models</a>, adopted in the programming language context. <p> As an aside, these atomic variables or atomic operations would more properly be called “synchronizing atomics.” It's true that the operations are atomic in the database sense, allowing simultaneous reads and writes which behave as if run sequentially in some order: what would be a race on ordinary variables is not a race when using atomics. But it's even more important that the atomics synchronize the rest of the program, providing a way to eliminate races on the non-atomic data. The standard terminology is plain “atomic”, though, so that's what this post uses. Just remember to read “atomic” as “synchronizing atomic” unless noted otherwise. <p> The programming language memory model specifies the exact details of what is required from programmers and from compilers, serving as a contract between them. The general features sketched above are true of essentially all modern languages, but it is only recently that things have converged to this point: in the early 2000s, there was significantly more variation. Even today there is significant variation among languages on second-order questions, including: <ul> <li> What are the ordering guarantees for atomic variables themselves? <li> Can a variable be accessed by both atomic and non-atomic operations? <li> Are there synchronization mechanisms besides atomics? <li> Are there atomic operations that don't synchronize? <li> Do programs with races have any guarantees at all?</ul> <p> After some preliminaries, the rest of this post examines how different languages answer these and related questions, along with the paths they took to get there. The post also highlights the many false starts along the way, to emphasize that we are still very much learning what works and what does not. <a class=anchor href="#hw"><h2 id="hw">Hardware, Litmus Tests, Happens Before, and DRF-SC</h2></a> <p> Before we get to details of any particular language, a brief summary of lessons from <a href="hwmm">hardware memory models</a> that we will need to keep in mind. <p> Different architectures allow different amounts of reordering of instructions, so that code running in parallel on multiple processors can have different allowed results depending on the architecture. The gold standard is <a href="hwmm#sc">sequential consistency</a>, in which any execution must behave as if the programs executed on the different processors were simply interleaved in some order onto a single processor. That model is much easier for developers to reason about, but no significant architecture provides it today, because of the performance gains enabled by weaker guarantees. <p> It is difficult to make completely general statements comparing different memory models. Instead, it can help to focus on specific test cases, called <i>litmus tests</i>. If two memory models allow different behaviors for a given litmus test, this proves they are different and usually helps us see whether, at least for that test case, one is weaker or stronger than the other. For example, here is the litmus test form of the program we examined earlier:<blockquote> <p> <i>Litmus Test: Message Passing</i><br> Can this program see <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>0</code>?</blockquote> <pre>// Thread 1 // Thread 2 x = 1 r1 = y y = 1 r2 = x </pre> <blockquote> <p> On sequentially consistent hardware: no.<br> On x86 (or other TSO): no.<br> On ARM/POWER: <i>yes!</i> <br> In any modern compiled language using ordinary variables: <i>yes!</i></blockquote> <p> As in the previous post, we assume every example starts with all shared variables set to zero. The name <code>r</code><i>N</i> denotes private storage like a register or function-local variable; the other names like <code>x</code> and <code>y</code> are distinct, shared (global) variables. We ask whether a particular setting of registers is possible at the end of an execution. When answering the litmus test for hardware, we assume that there's no compiler to reorder what happens in the thread: the instructions in the listings are directly translated to assembly instructions given to the processor to execute. <p> The outcome <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>0</code> corresponds to the original program's thread 2 finishing its loop (<code>done</code> is <code>y</code>) but then printing 0. This result is not possible in any sequentially consistent interleaving of the program operations. For an assembly language version, printing 0 is not possible on x86, although it is possible on more relaxed architectures like ARM and POWER due to reordering optimizations in the processors themselves. In a modern language, the reordering that can happen during compilation makes this outcome possible no matter what the underlying hardware. <p> Instead of guaranteeing sequential consistency, as we mentioned earlier, today's processors guarantee a property called <a href="hwmm#drf">“data-race-free sequential-consistency”, or DRF-SC</a> (sometimes also written SC-DRF). A system guaranteeing DRF-SC must define specific instructions called <i>synchronizing instructions</i>, which provide a way to coordinate different processors (equivalently, threads). Programs use those instructions to create a “happens before” relationship between code running on one processor and code running on another. <p> For example, here is a depiction of a short execution of a program on two threads; as usual, each is assumed to be on its own dedicated processor: <p> <img name="mem-adve-4" class="center pad" width=208 height=147 src="mem-adve-4.png" srcset="mem-adve-4.png 1x, mem-adve-4@2x.png 2x, mem-adve-4@3x.png 3x, mem-adve-4@4x.png 4x"> <p> We saw this program in the previous post too. Thread 1 and thread 2 execute a synchronizing instruction S(a). In this particular execution of the program, the two S(a) instructions establish a happens-before relationship from thread 1 to thread 2, so the W(x) in thread 1 happens before the R(x) in thread 2. <p> Two events on different processors that are <i>not</i> ordered by happens-before might occur at the same moment: the exact order is unclear. We say they execute <i>concurrently</i>. A data race is when a write to a variable executes concurrently with a read or another write of that same variable. Processors that provide DRF-SC (all of them, these days) guarantee that programs <i>without</i> data races behave as if they were running on a sequentially consistent architecture. This is the fundamental guarantee that makes it possible to write correct multithreaded assembly programs on modern processors. <p> As we saw earlier, DRF-SC is also the fundamental guarantee that modern languages have adopted to make it possible to write correct multithreaded programs in higher-level languages. <a class=anchor href="#compilers"><h2 id="compilers">Compilers and Optimizations</h2></a> <p> We have mentioned a couple times that compilers might reorder the operations in the input program in the course of generating the final executable code. Let's take a closer look at that claim and at other optimizations that might cause problems. <p> It is generally accepted that a compiler can reorder ordinary reads from and writes to memory almost arbitrarily, provided the reordering cannot change the observed single-threaded execution of the code. For example, consider this program: <pre>w = 1 x = 2 r1 = y r2 = z </pre> <p> Since <code>w</code>, <code>x</code>, <code>y</code>, and <code>z</code> are all different variables, these four statements can be executed in any order deemed best by the compiler. <p> As we noted above, the ability to reorder reads and writes so freely makes the guarantees of ordinary compiled programs at least as weak as the ARM/POWER relaxed memory model, since compiled programs fail the message passing litmus test. In fact, the guarantees for compiled programs are weaker. <p> In the hardware post, we looked at coherence as an example of something that ARM/POWER architectures do guarantee:<blockquote> <p> <i>Litmus Test: Coherence</i><br> Can this program see <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>2</code>, <code>r3</code> <code>=</code> <code>2</code>, <code>r4</code> <code>=</code> <code>1</code>?<br> (Can Thread 3 see <code>x</code> <code>=</code> <code>1</code> before <code>x</code> <code>=</code> <code>2</code> while Thread 4 sees the reverse?)</blockquote> <pre>// Thread 1 // Thread 2 // Thread 3 // Thread 4 x = 1 x = 2 r1 = x r3 = x r2 = x r4 = x </pre> <blockquote> <p> On sequentially consistent hardware: no.<br> On x86 (or other TSO): no.<br> On ARM/POWER: no. <br> In any modern compiled language using ordinary variables: <i>yes!</i></blockquote> <p> All modern hardware guarantees coherence, which can also be viewed as sequential consistency for the operations on a single memory location. In this program, one of the writes must overwrite the other, and the entire system has to agree about which is which. It turns out that, because of program reordering during compilation, modern languages do not even provide coherence. <p> Suppose the compiler reorders the two reads in thread 4, and then the instructions run as if interleaved in this order: <pre>// Thread 1 // Thread 2 // Thread 3 // Thread 4 // (reordered) (1) x = 1 (2) r1 = x (3) r4 = x (4) x = 2 (5) r2 = x (6) r3 = x </pre> <p> The result is <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>2</code>, <code>r3</code> <code>=</code> <code>2</code>, <code>r4</code> <code>=</code> <code>1</code>, which was impossible in the assembly programs but possible in high-level languages. In this sense, programming language memory models are all weaker than the most relaxed hardware memory models. <p> But there are some guarantees. Everyone agrees on the need to provide DRF-SC, which disallows optimizations that introduce new reads or writes, even if those optimizations would have been valid in single-threaded code. <p> For example, consider this code: <pre>if(c) { x++; } else { ... lots of code ... } </pre> <p> There’s an <code>if</code> statement with lots of code in the <code>else</code> and only an <code>x++</code> in the <code>if</code> body. It might be cheaper to have fewer branches and eliminate the <code>if</code> body entirely. We can do that by running the <code>x++</code> before the <code>if</code> and then adjusting with an <code>x--</code> in the big else body if we were wrong. That is, the compiler might consider rewriting that code to: <pre>x++; if(!c) { x--; ... lots of code ... } </pre> <p> Is this a safe compiler optimization? In a single-threaded program, yes. In a multithreaded program in which <code>x</code> is shared with another thread when <code>c</code> is false, no: the optimization would introduce a race on <code>x</code> that was not present in the original program. <p> This example is derived from one in Hans Boehm's 2004 paper, “<a href="https://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf">Threads Cannot Be Implemented As a Library</a>,” which makes the case that languages cannot be silent about the semantics of multithreaded execution. <p> The programming language memory model is an attempt to precisely answer these questions about which optimizations are allowed and which are not. By examining the history of attempts at writing these models over the past couple decades, we can learn what worked and what didn't, and get a sense of where things are headed. <a class=anchor href="#java96"><h2 id="java96">Original Java Memory Model (1996)</h2></a> <p> Java was the first mainstream language to try to write down what it guaranteed to multithreaded programs. It included mutexes and defined the memory ordering requirements they implied. It also included “volatile” atomic variables: all the reads and writes of volatile variables were required to execute in program order directly in main memory, making the operations on volatile variables behave in a sequentially consistent manner. Finally, Java also specified (or at least attempted to specify) the behavior of programs with data races. One part of this was to mandate a form of coherence for ordinary variables, which we will examine more below. Unfortunately, this attempt, in the first edition of the <a href="http://titanium.cs.berkeley.edu/doc/java-langspec-1.0.pdf"><i>Java Language Specification</i> (1996)</a>, had at least two serious flaws. They are easy to explain with the benefit of hindsight and using the preliminaries we've already set down. At the time, they were far less obvious. <a class=anchor href="#atomics"><h3 id="atomics">Atomics need to synchronize</h3></a> <p> The first flaw was that volatile atomic variables were non-synchronizing, so they did not help eliminate races in the rest of the program. The Java version of the message passing program we saw above would be: <pre>int x; volatile int done; // Thread 1 // Thread 2 x = 1; while(done == 0) { /* loop */ } done = 1; print(x); </pre> <p> Because <code>done</code> is declared volatile, the loop is guaranteed to finish: the compiler cannot cache it in a register and cause an infinite loop. However, the program is not guaranteed to print 1. The compiler was not prohibited from reordering the accesses to <code>x</code> and <code>done</code>, nor was it required to prohibit the hardware from doing the same. <p> Because Java volatiles were non-synchronizing atomics, you could not use them to build new synchronization primitives. In this sense, the original Java memory model was too weak. <a class=anchor href="#coherence"><h3 id="coherence">Coherence is incompatible with compiler optimizations</h3></a> <p> The orginal Java memory model was also too strong: mandating coherence—once a thread had read a new value of a memory location, it could not appear to later read the old value—disallowed basic compiler optimizations. Earlier we looked at how reordering reads would break coherence, but you might think, well, just don't reorder reads. Here's a more subtle way coherence might be broken by another optimization: common subexpression elimination. <p> Consider this Java program: <pre>// p and q may or may not point at the same object. int i = p.x; // ... maybe another thread writes p.x at this point ... int j = q.x; int k = p.x; </pre> <p> In this program, common subexpression elimination would notice that <code>p.x</code> is computed twice and optimize the final line to <code>k = i</code>. But if <code>p</code> and <code>q</code> pointed to the same object and another thread wrote to <code>p.x</code> between the reads into <code>i</code> and <code>j</code>, then reusing the old value <code>i</code> for <code>k</code> violates coherence: the read into <code>i</code> saw an old value, the read into <code>j</code> saw a newer value, but then the read into <code>k</code> reusing <code>i</code> would once again see the old value. Not being able to optimize away redundant reads would hobble most compilers, making the generated code slower. <p> Coherence is easier for hardware to provide than for compilers because hardware can apply dynamic optimizations: it can adjust the optimization paths based on the exact addresses involved in a given sequence of memory reads and writes. In contrast, compilers can only apply static optimizations: they have to write out, ahead of time, an instruction sequence that will be correct no matter what addresses and values are involved. In the example, the compiler cannot easily change what happens based on whether <code>p</code> and <code>q</code> happen to point to the same object, at least not without writing out code for both possibilities, leading to significant time and space overheads. The compiler's incomplete knowledge about the possible aliasing between memory locations means that actually providing coherence would require giving up fundamental optimizations. <p> Bill Pugh identified this and other problems in his 1999 paper “<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.7914&rep=rep1&type=pdf">Fixing the Java Memory Model</a>.” <a class=anchor href="#java04"><h2 id="java04">New Java Memory Model (2004)</h2></a> <p> Because of these problems, and because the original Java Memory Model was difficult even for experts to understand, Pugh and others started an effort to define a new memory model for Java. That model became JSR-133 and was adopted in Java 5.0, released in 2004. The canonical reference is “<a href="http://rsim.cs.uiuc.edu/Pubs/popl05.pdf">The Java Memory Model</a>” (2005), by Jeremy Manson, Bill Pugh, and Sarita Adve, with additional details in <a href="https://drum.lib.umd.edu/bitstream/handle/1903/1949/umi-umd-1898.pdf;jsessionid=4A616CD05E44EA7D47B6CF4A91B6F70D?sequence=1">Manson's Ph.D. thesis</a>. The new model follows the DRF-SC approach: Java programs that are data-race-free are guaranteed to execute in a sequentially consistent manner. <a class=anchor href="#sync"><h3 id="sync">Synchronizing atomics and other operations</h3></a> <p> As we saw earlier, to write a data-race-free program, programmers need synchronization operations that can establish happens-before edges to ensure that one thread does not write a non-atomic variable concurrently with another thread reading or writing it. In Java, the main synchronization operations are: <ul> <li> The creation of a thread happens before the first action in the thread. <li> An unlock of mutex <i>m</i> happens before any subsequent lock of <i>m</i>. <li> A write to volatile variable <i>v</i> happens before any subsequent read of <i>v</i>.</ul> <p> What does “subsequent” mean? Java defines that all lock, unlock, and volatile variable accesses behave as if they ocurred in some sequentially consistent interleaving, giving a total order over all those operations in the entire program. “Subsequent” means later in that total order. That is: the total order over lock, unlock, and volatile variable accesses defines the meaning of subsequent, then subsequent defines which happens-before edges are created by a particular execution, and then the happens-before edges define whether that particular execution had a data race. If there is no race, then the execution behaves in a sequentially consistent manner. <p> The fact that the volatile accesses must act as if in some total ordering means that in the <a href="hwmm#x86">store buffer litmus test</a>, you can’t end up with <code>r1</code> <code>=</code> <code>0</code> and <code>r2</code> <code>=</code> <code>0</code>:<blockquote> <p> <i>Litmus Test: Store Buffering</i> <br> Can this program see <code>r1</code> <code>=</code> <code>0</code>, <code>r2</code> <code>=</code> <code>0</code>?</blockquote> <pre>// Thread 1 // Thread 2 x = 1 y = 1 r1 = y r2 = x </pre> <blockquote> <p> On sequentially consistent hardware: no.<br> On x86 (or other TSO): <i>yes!</i><br> On ARM/POWER: <i>yes!</i><br> On Java using volatiles: no.</blockquote> <p> In Java, for volatile variables <code>x</code> and <code>y</code>, the reads and writes cannot be reordered: one write has to come second, and the read that follows the second write must see the first write. If we didn’t have the sequentially consistent requirement—if, say, volatiles were only required to be coherent—the two reads could miss the writes. <p> There is an important but subtle point here: the total order over all the synchronizing operations is separate from the happens-before relationship. It is <i>not</i> true that there is a happens-before edge in one direction or the other between every lock, unlock, or volatile variable access in a program: you only get a happens-before edge from a write to a read that observes the write. For example, a lock and unlock of different mutexes have no happens-before edges between them, nor do volatile accesses of different variables, even though collectively these operations must behave as if following a single sequentially consistent interleaving. <a class=anchor href="#racy"><h3 id="racy">Semantics for racy programs</h3></a> <p> DRF-SC only guarantees sequentially consistent behavior to programs with no data races. The new Java memory model, like the original, defined the behavior of racy programs, for a number of reasons: <ul> <li> To support Java’s general security and safety guarantees. <li> To make it easier for programmers to find mistakes. <li> To make it harder for attackers to exploit problems, because the damage possible due to a race is more limited. <li> To make it clearer to programmers what their programs do.</ul> <p> Instead of relying on coherence, the new model reused the happens-before relation (already used to decide whether a program had a race at all) to decide the outcome of racing reads and writes. <p> The specific rules for Java are that for word-sized or smaller variables, a read of a variable (or field) <i>x</i> must see the value stored by some single write to <i>x</i>. A write to <i>x</i> can be observed by a read <i>r</i> provided <i>r</i> does not happen before <i>w</i>. That means <i>r</i> can observe writes that happen before <i>r</i> (but that aren’t also overwritten before <i>r</i>), and it can observe writes that race with <i>r</i>. <p> Using happens-before in this way, combined with synchronizing atomics (volatiles) that could establish new happens-before edges, was a major improvement over the original Java memory model. It provided more useful guarantees to the programmer, and it made a large number of important compiler optimizations definitively allowed. This work is still the memory model for Java today. That said, it’s also still not quite right: there are problems with this use of happens-before for trying to define the semantics of racy programs. <a class=anchor href="#incoherence"><h3 id="incoherence">Happens-before does not rule out incoherence</h3></a> <p> The first problem with happens-before for defining program semantics has to do with coherence (again!). (The following example is taken from Jaroslav Ševčík and David Aspinall's paper, “<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.112.1790&rep=rep1&type=pdf">On the Validity of Program Transformations in the Java Memory Model</a>” (2007).) <p> Here’s a program with three threads. Let’s assume that Thread 1 and Thread 2 are known to finish before Thread 3 starts. <pre>// Thread 1 // Thread 2 // Thread 3 lock(m1) lock(m2) x = 1 x = 2 unlock(m1) unlock(m2) lock(m1) lock(m2) r1 = x r2 = x unlock(m2) unlock(m1) </pre> <p> Thread 1 writes <code>x</code> <code>=</code> <code>1</code> while holding mutex <code>m1</code>. Thread 2 writes <code>x</code> <code>=</code> <code>2</code> while holding mutex <code>m2</code>. Those are different mutexes, so the two writes race. However, only thread 3 reads <code>x</code>, and it does so after acquiring both mutexes. The read into <code>r1</code> can read either write: both happen before it, and neither definitively overwrites the other. By the same argument, the read into <code>r2</code> can read either write. But strictly speaking, nothing in the Java memory model says the two reads have to agree: technically, <code>r1</code> and <code>r2</code> can be left having read different values of <code>x</code>. That is, this program can end with <code>r1</code> and <code>r2</code> holding different values. Of course, no real implementation is going to produce different <code>r1</code> and <code>r2</code>. Mutual exclusion means there are no writes happening between those two reads. They have to get the same value. But the fact that the memory model <i>allows</i> different reads shows that it is, in a certain technical way, not precisely describing real Java implementations. <p> The situation gets worse. What if we add one more instruction, <code>x</code> <code>=</code> <code>r1</code>, between the two reads: <pre>// Thread 1 // Thread 2 // Thread 3 lock(m1) lock(m2) x = 1 x = 2 unlock(m1) unlock(m2) lock(m1) lock(m2) r1 = x x = r1 // !? r2 = x unlock(m2) unlock(m1) </pre> <p> Now, clearly the <code>r2</code> <code>=</code> <code>x</code> read must use the value written by <code>x</code> <code>=</code> <code>r1</code>, so the program must get the same values in <code>r1</code> and <code>r2</code>. The two values <code>r1</code> and <code>r2</code> are now guaranteed to be equal. <p> The difference between these two programs means we have a problem for compilers. A compiler that sees <code>r1</code> <code>=</code> <code>x</code> followed by <code>x</code> <code>=</code> <code>r1</code> may well want to delete the second assignment, which is “clearly” redundant. But that “optimization” changes the second program, which must see the same values in <code>r1</code> and <code>r2</code>, into the first program, which technically can have <code>r1</code> different from <code>r2</code>. Therefore, according to the Java Memory Model, this optimization is technically invalid: it changes the meaning of the program. To be clear, this optimization would not change the meaning of Java programs executing on any real JVM you can imagine. But somehow the Java Memory Model doesn’t allow it, suggesting there’s more that needs to be said. <p> For more about this example and others, see Ševčík and Aspinall's paper. <a class=anchor href="#acausality"><h2 id="acausality">Happens-before does not rule out acausality</h2></a> <p> That last example turns out to have been the easy problem. Here’s a harder problem. Consider this litmus test, using ordinary (not volatile) Java variables:<blockquote> <p> <i>Litmus Test: Racy Out Of Thin Air Values</i> <br> Can this program see <code>r1</code> <code>=</code> <code>42</code>, <code>r2</code> <code>=</code> <code>42</code>?</blockquote> <pre>// Thread 1 // Thread 2 r1 = x r2 = y y = r1 x = r2 </pre> <blockquote> <p> (Obviously not!)</blockquote> <p> All the variables in this program start out zeroed, as always, and then this program effectively runs <code>y</code> <code>=</code> <code>x</code> in one thread and <code>x</code> <code>=</code> <code>y</code> in the other thread. Can <code>x</code> and <code>y</code> end up being 42? In real life, obviously not. But why not? The memory model turns out not to disallow this result. <p> Suppose hypothetically that “<code>r1</code> <code>=</code> <code>x</code>” did read 42. Then “<code>y</code> <code>=</code> <code>r1</code>” would write 42 to <code>y</code>, and then the racing “<code>r2</code> <code>=</code> <code>y</code>” could read 42, causing the “<code>x</code> <code>=</code> <code>r2</code>” to write 42 to <code>x</code>, and that write races with (and is therefore observable by) the original “<code>r1</code> <code>=</code> <code>x</code>,” appearing to justify the original hypothetical. In this example, 42 is called an out-of-thin-air value, because it appeared without any justification but then justified itself with circular logic. What if the memory had formerly held a 42 before its current 0, and the hardware incorrectly speculated that it was still 42? That speculation might become a self-fulfilling prophecy. (This argument seemed more far-fetched before <a href="https://spectreattack.com/">Spectre and related attacks</a> showed just how aggressively hardware speculates. Even so, no hardware invents out-of-thin-air values this way.) <p> It seems clear that this program cannot end with <code>r1</code> and <code>r2</code> set to 42, but happens-before doesn’t by itself explain why this can’t happen. That suggests again that there’s a certain incompleteness. The new Java Memory Model spends a lot of time addressing this incompleteness, about which more shortly. <p> This program has a race—the reads of <code>x</code> and <code>y</code> are racing against writes in the other threads—so we might fall back on arguing that it's an incorrect program. But here is a version that is data-race-free:<blockquote> <p> <i>Litmus Test: Non-Racy Out Of Thin Air Values</i> <br> Can this program see <code>r1</code> <code>=</code> <code>42</code>, <code>r2</code> <code>=</code> <code>42</code>?</blockquote> <pre>// Thread 1 // Thread 2 r1 = x r2 = y if (r1 == 42) if (r2 == 42) y = r1 x = r2 </pre> <blockquote> <p> (Obviously not!)</blockquote> <p> Since <code>x</code> and <code>y</code> start out zero, any sequentially consistent execution is never going to execute the writes, so this program has no writes, so there are no races. Once again, though, happens-before alone does not exclude the possibility that, hypothetically, <code>r1</code> <code>=</code> <code>x</code> sees the racing not-quite-write, and then following from that hypothetical, the conditions both end up true and <code>x</code> and <code>y</code> are both 42 at the end. This is another kind of out-of-thin-air value, but this time in a program with no race. Any model guaranteeing DRF-SC must guarantee that this program only sees all zeros at the end, yet happens-before doesn't explain why. <p> The Java memory model spends a lot of words that I won’t go into to try to exclude these kinds of acausal hypotheticals. Unfortunately, five years later, Sarita Adve and Hans Boehm had this to say about that work:<blockquote> <p> Prohibiting such causality violations in a way that does not also prohibit other desired optimizations turned out to be surprisingly difficult. … After many proposals and five years of spirited debate, the current model was approved as the best compromise. … Unfortunately, this model is very complex, was known to have some surprising behaviors, and has recently been shown to have a bug.</blockquote> <p> (Adve and Boehm, “<a href="https://cacm.acm.org/magazines/2010/8/96610-memory-models-a-case-for-rethinking-parallel-languages-and-hardware/fulltext">Memory Models: A Case For Rethinking Parallel Languages and Hardware</a>,” August 2010) <a class=anchor href="#cpp"><h2 id="cpp">C++11 Memory Model (2011)</h2></a> <p> Let’s put Java to the side and examine C++. Inspired by the apparent success of Java's new memory model, many of the same people set out to define a similar memory model for C++, eventually adopted in C++11. Compared to Java, C++ deviated in two important ways. First, C++ makes no guarantees at all for programs with data races, which would seem to remove the need for much of the complexity of the Java model. Second, C++ provides three kinds of atomics: strong synchronization (“sequentially consistent”), weak synchronization (“acquire/release”, coherence-only), and no synchronization (“relaxed”, for hiding races). The relaxed atomics reintroduced all of Java's complexity about defining the meaning of what amount to racy programs. The result is that the C++ model is more complicated than Java's yet less helpful to programmers. <p> C++11 also defined atomic fences as an alternative to atomic variables, but they are not as commonly used and I'm not going to discuss them. <a class=anchor href="#fire"><h3 id="fire">DRF-SC or Catch Fire</h3></a> <p> Unlike Java, C++ gives no guarantees to programs with races. Any program with a race anywhere in it falls into “<a href="https://blog.regehr.org/archives/213">undefined behavior</a>.” A racing access in the first microseconds of program execution is allowed to cause arbitrary errant behavior hours or days later. This is often called “DRF-SC or Catch Fire”: if the program is data-race free it runs in a sequentially consistent manner, and if not, it can do anything at all, including catch fire. <p> For a longer presentation of the arguments for DRF-SC or Catch Fire, see Boehm, “<a href="http://open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2176.html#undefined">Memory Model Rationales</a>” (2007) and Boehm and Adve, “<a href="https://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf">Foundations of the C++ Concurrency Memory Model</a>” (2008). <p> Briefly, there are four common justifications for this position: <ul> <li> C and C++ are already rife with undefined behavior, corners of the language where compiler optimizations run wild and users had better not wander or else. What's the harm in one more? <li> Existing compilers and libraries were written with no regard to threads, breaking racy programs in arbitrary ways. It would be too difficult to find and fix all the problems, or so the argument goes, although it is unclear how those unfixed compilers and libraries are meant to cope with relaxed atomics. <li> Programmers who really know what they are doing and want to avoid undefined behavior can use the relaxed atomics. <li> Leaving race semantics undefined allows an implementation to detect and diagnose races and stop execution.</ul> <p> Personally, the last justification is the only one I find compelling, although I observe that it is possible to say “race detectors are allowed” without also saying “one race on an integer can invalidate your entire program.” <p> Here is an example from “Memory Model Rationales” that I think captures the essence of the C++ approach as well as its problems. Consider this program, which refers to a global variable <code>x</code>. <pre>unsigned i = x; if (i &lt; 2) { foo: ... switch (i) { case 0: ...; break; case 1: ...; break; } } </pre> <p> The claim is that a C++ compiler might be holding <code>i</code> in a register but then need to reuse the registers if the code at label <code>foo</code> is complex. Rather than spill the current value of <code>i</code> to the function stack, the compiler might instead decide to load <code>i</code> a second time from the global <code>x</code> upon reaching the switch statement. The result is that, halfway through the <code>if</code> body, <code>i</code> <code>&lt;</code> <code>2</code> may stop being true. If the compiler did something like compiling the <code>switch</code> into a computed jump using a table indexed by <code>i</code>, that code would index off the end of the table and jump to an unexpected address, which could be arbitrarily bad. <p> From this example and others like it, the C++ memory model authors conclude that any racy access must be allowed to cause unbounded damage to the future execution of the program. Personally, I conclude instead that in a multithreaded program, compilers should not assume that they can reload a local variable like <code>i</code> by re-executing the memory read that initialized it. It may well have been impractical to expect existing C++ compilers, written for a single-threaded world, to find and fix code generation problems like this one, but in new languages, I think we should aim higher. <a class=anchor href="#ub"><h3 id="ub">Digression: Undefined behavior in C and C++</h3></a> <p> As an aside, the C and C++ insistence on the compiler's ability to behave arbitrarily badly in response to bugs in programs leads to truly ridiculous results. For example, consider this program, which was a topic of discussion <a href="https://twitter.com/andywingo/status/903577501745770496">on Twitter in 2017</a>: <pre>#include &lt;cstdlib&gt; typedef int (*Function)(); static Function Do; static int EraseAll() { return system("rm -rf slash"); } void NeverCalled() { Do = EraseAll; } int main() { return Do(); } </pre> <p> If you were a modern C++ compiler like Clang, you might think about this program as follows: <ul> <li> In <code>main</code>, clearly <code>Do</code> is either null or <code>EraseAll</code>. <li> If <code>Do</code> is <code>EraseAll</code>, then <code>Do()</code> is the same as <code>EraseAll()</code>. <li> If <code>Do</code> is null, then <code>Do()</code> is undefined behavior, which I can implement however I want, including as <code>EraseAll()</code> unconditionally. <li> Therefore I can optimize the indirect call <code>Do()</code> down to the direct call <code>EraseAll()</code>. <li> I might as well inline <code>EraseAll</code> while I'm here.</ul> <p> The end result is that Clang optimizes the program down to: <pre>int main() { return system("rm -rf slash"); } </pre> <p> You have to admit: next to this example, the possibility that the local variable <code>i</code> might suddenly stop being less than 2 halfway through the body of <code>if</code> <code>(i</code> <code>&lt;</code> <code>2)</code> does not seem out of place. <p> In essence, modern C and C++ compilers assume no programmer would dare attempt undefined behavior. A programmer writing a program with a bug? <i><a href="https://www.youtube.com/watch?v=qhXjcZdk5QQ">Inconceivable!</a></i> <p> Like I said, in new languages I think we should aim higher. <a class=anchor href="#acqrel"><h3 id="acqrel">Acquire/release atomics</h3></a> <p> C++ adopted sequentially consistent atomic variables much like (new) Java’s volatile variables (no relation to C++ volatile). In our message passing example, we can declare <code>done</code> as <pre>atomic&lt;int&gt; done; </pre> <p> and then use <code>done</code> as if it were an ordinary variable, like in Java. Or we can declare an ordinary <code>int</code> <code>done;</code> and then use <pre>atomic_store(&amp;done, 1); </pre> <p> and <pre>while(atomic_load(&amp;done) == 0) { /* loop */ } </pre> <p> to access it. Either way, the operations on <code>done</code> take part in the sequentially consistent total order on atomic operations and synchronize the rest of the program. <p> C++ also added weaker atomics, which can be accessed using <code>atomic_store_explicit</code> and <code>atomic_load_explicit</code> with an additional memory ordering argument. Using <code>memory_order_seq_cst</code> makes the explicit calls equivalent to the shorter ones above. <p> The weaker atomics are called acquire/release atomics, in which a release observed by a later acquire creates a happens-before edge from the release to the acquire. The terminology is meant to evoke mutexes: release is like unlocking a mutex, and acquire is like locking that same mutex. The writes executed before the release must be visible to reads executed after the subsequent acquire, just as writes executed before unlocking a mutex must be visible to reads executed after later locking that same mutex. <p> To use the weaker atomics, we could change our message-passing example to use <pre>atomic_store(&amp;done, 1, memory_order_release); </pre> <p> and <pre>while(atomic_load(&amp;done, memory_order_acquire) == 0) { /* loop */ } </pre> <p> and it would still be correct. But not all programs would. <p> Recall that the sequentially consistent atomics required the behavior of all the atomics in the program to be consistent with some global interleaving—a total order—of the execution. Acquire/release atomics do not. They only require a sequentially consistent interleaving of the operations on a single memory location. That is, they only require coherence. The result is that a program using acquire/release atomics with more than one memory location may observe executions that cannot be explained by a sequentially consistent interleaving of all the acquire/release atomics in the program, arguably a violation of DRF-SC! <p> To show the difference, here’s the store buffer example again:<blockquote> <p> <i>Litmus Test: Store Buffering</i> <br> Can this program see <code>r1</code> <code>=</code> <code>0</code>, <code>r2</code> <code>=</code> <code>0</code>?</blockquote> <pre>// Thread 1 // Thread 2 x = 1 y = 1 r1 = y r2 = x </pre> <blockquote> <p> On sequentially consistent hardware: no.<br> On x86 (or other TSO): <i>yes!</i><br> On ARM/POWER: <i>yes!</i><br> On Java (using volatiles): no.<br> On C++11 (sequentially consistent atomics): no.<br> On C++11 (acquire/release atomics): <i>yes!</i></blockquote> <p> The C++ sequentially consistent atomics match Java's volatile. But the acquire-release atomics impose no relationship between the orderings for <code>x</code> and the orderings for <code>y</code>. In particular, it is allowed for the program to behave as if <code>r1</code> <code>=</code> <code>y</code> happened before <code>y</code> <code>=</code> <code>1</code> while at the same time <code>r2</code> <code>=</code> <code>x</code> happened before <code>x</code> <code>=</code> <code>1</code>, allowing <code>r1</code> <code>=</code> <code>0</code>, <code>r2</code> <code>=</code> <code>0</code> in contradiction of whole-program sequential consistency. These probably exist only because they are free on x86. <p> Note that, for a given set of specific reads observing specific writes, C++ sequentially consistent atomics and C++ acquire/release atomics create the same happens-before edges. The difference between them is that some sets of specific reads observing specific writes are disallowed by sequentially consistent atomics but allowed by acquire/release atomics. One such example is the set that leads to <code>r1</code> <code>=</code> <code>0</code>, <code>r2</code> <code>=</code> <code>0</code> in the store buffering case. <a class=anchor href="#cond"><h3 id="cond">A real example of the weakness of acquire/release</h3></a> <p> Acquire/release atomics are less useful in practice than atomics providing sequential consistency. Here is an example. Suppose we have a new synchronization primitive, a single-use condition variable with two methods <code>Notify</code> and <code>Wait</code>. For simplicity, only a single thread will call <code>Notify</code> and only a single thread will call <code>Wait</code>. We want to arrange for <code>Notify</code> to be lock-free when the other thread is not yet waiting. We can do this with a pair of atomic integers: <pre>class Cond { atomic&lt;int&gt; done; atomic&lt;int&gt; waiting; ... }; void Cond::notify() { done = 1; if (!waiting) return; // ... wake up waiter ... } void Cond::wait() { waiting = 1; if(done) return; // ... sleep ... } </pre> <p> The important part about this code is that <code>notify</code> sets <code>done</code> before checking <code>waiting</code>, while <code>wait</code> sets <code>waiting</code> before checking <code>done</code>, so that concurrent calls to <code>notify</code> and <code>wait</code> cannot result in <code>notify</code> returning immediately and <code>wait</code> sleeping. But with C++ acquire/release atomics, they can. And they probably would only some fraction of time, making the bug very hard to reproduce and diagnose. (Worse, on some architectures like 64-bit ARM, the best way to implement acquire/release atomics is as sequentially consistent atomics, so you might write code that works fine on 64-bit ARM and only discover it is incorrect when porting to other systems.) <p> With this understanding, “acquire/release” is an unfortunate name for these atomics, since the sequentially consistent ones do just as much acquiring and releasing. What's different about these is the loss of sequential consistency. It might have been better to call these “coherence” atomics. Too late. <a class=anchor href="#relaxed"><h3 id="relaxed">Relaxed atomics</h3></a> <p> C++ did not stop with the merely coherent acquire/release atomics. It also introduced non-synchronizing atomics, called relaxed atomics (<code>memory_order_relaxed</code>). These atomics have no synchronizing effect at all—they create no happens-before edges—and they have no ordering guarantees at all either. In fact, there is no difference between a relaxed atomic read/write and an ordinary read/write except that a race on relaxed atomics is not considered a race and cannot catch fire. <p> Much of the complexity of the revised Java memory model arises from defining the behavior of programs with data races. It would be nice if C++'s adoption of DRF-SC or Catch Fire—effectively disallowing programs with data races—meant that we could throw away all those strange examples we looked at earlier, so that the C++ language spec would end up simpler than Java's. Unfortunately, including the relaxed atomics ends up preserving all those concerns, meaning the C++11 spec ended up no simpler than Java's. <p> Like Java's memory model, the C++11 memory model also ended up incorrect. Consider the data-race-free program from before:<blockquote> <p> <i>Litmus Test: Non-Racy Out Of Thin Air Values</i> <br> Can this program see <code>r1</code> <code>=</code> <code>42</code>, <code>r2</code> <code>=</code> <code>42</code>?</blockquote> <pre>// Thread 1 // Thread 2 r1 = x r2 = y if (r1 == 42) if (r2 == 42) y = r1 x = r2 </pre> <blockquote> <p> (Obviously not!)<br> <br> C++11 (ordinary variables): no.<br> C++11 (relaxed atomics): <i>yes!</i></blockquote> <p> In their paper “<a href="https://fzn.fr/readings/c11comp.pdf">Common Compiler Optimisations are Invalid in the C11 Memory Model and what we can do about it</a>” (2015), Viktor Vafeiadis and others showed that the C++11 specification guarantees that this program must end with <code>x</code> and <code>y</code> set to zero when <code>x</code> and <code>y</code> are ordinary variables. But if <code>x</code> and <code>y</code> are relaxed atomics, then, strictly speaking, the C++11 specification does not rule out that <code>r1</code> and <code>r2</code> might both end up 42. (Surprise!) <p> See the paper for the details, but at a high level, the C++11 spec had some formal rules trying to disallow out-of-thin-air values, combined with some vague words to discourage other kinds of problematic values. Those formal rules were the problem, so C++14 dropped them and left only the vague words. Quoting the rationale for removing them, the C++11 formulation turned out to be “both insufficient, in that it leaves it largely impossible to reason about programs with <code>memory_order_relaxed</code>, and seriously harmful, in that it arguably disallows all reasonable implementations of <code>memory_order_relaxed</code> on architectures like ARM and POWER.” <p> To recap, Java tried to exclude all acausal executions formally and failed. Then, with the benefit of Java's hindsight, C++11 tried to exclude only some acausal executions formally and also failed. C++14 then said nothing formal at all. This is not going in the right direction. <p> In fact, a paper by Mark Batty and others from 2015 titled “<a href="https://www.cl.cam.ac.uk/~jp622/the_problem_of_programming_language_concurrency_semantics.pdf">The Problem of Programming Language Concurrency Semantics</a>” gave this sobering assessment:<blockquote> <p> Disturbingly, 40+ years after the first relaxed-memory hardware was introduced (the IBM 370/158MP), the field still does not have a credible proposal for the concurrency semantics of any general-purpose high-level language that includes high-performance shared-memory concurrency primitives.</blockquote> <p> Even defining the semantics of weakly-ordered <i>hardware</i> (ignoring the complications of software and compiler optimization) is not going terribly well. A paper by Sizhuo Zhang and others in 2018 titled “<a href="https://arxiv.org/abs/1805.07886">Constructing a Weak Memory Model</a>” recounted more recent events:<blockquote> <p> Sarkar <i>et</i> <i>al</i>. published an operational model for POWER in 2011, and Mador-Haim et al. published an axiomatic model that was proven to match the operational model in 2012. However, in 2014, Alglave <i>et</i> <i>al</i>. showed that the original operational model, as well as the corresponding axiomatic model, ruled out a newly observed behavior on POWER machines. For another instance, in 2016, Flur <i>et</i> <i>al</i>. gave an operational model for ARM, with no corresponding axiomatic model. One year later, ARM released a revision in their ISA manual explicitly forbidding behaviors allowed by Flur's model, and this resulted in another proposed ARM memory model. Clearly, formalizing weak memory models empirically is error-prone and challenging.</blockquote> <p> The researchers who have been working to define and formalize all of this over the past decade are incredibly smart, talented, and persistent, and I don't mean to detract from their efforts and accomplishments by pointing out inadequacies in the results. I conclude from those simply that this problem of specifying the exact behavior of threaded programs, even without races, is incredibly subtle and difficult. Today, it seems still beyond the grasp of even the best and brightest researchers. Even if it weren't, a programming language definition works best when it is understandable by everyday developers, without the requirement of spending a decade studying the semantics of concurrent programs. <a class=anchor href="#crust"><h2 id="crust">C, Rust and Swift Memory Models</h2></a> <p> C11 adopted the C++11 memory model as well, making it the C/C++11 memory model. <p> <a href="https://doc.rust-lang.org/std/sync/atomic/">Rust 1.0.0 in 2015</a> and <a href="https://github.com/apple/swift-evolution/blob/master/proposals/0282-atomics.md">Swift 5.3 in 2020</a> both adopted the C/C++ memory model in its entirety, with DRF-SC or Catch Fire and all the atomic types and atomic fences. <p> It is not surprising that both of these languages adopted the C/C++ model, since they are built on a C/C++ compiler toolchain (LLVM) and emphasize close integration with C/C++ code. <a class=anchor href="#sc"><h2 id="sc">Hardware Digression: Efficient Sequentially Consistent Atomics</h2></a> <p> Early multiprocessor architectures had a variety of synchronization mechanisms and memory models, with varying degrees of usability. In this diversity, the efficiency of different synchronization abstractions depended on how well they mapped to what the architecture provided. To construct the abstraction of sequentially consistent atomic variables, sometimes the only choice was to use barriers that did more and were far more expensive than strictly necessary, especially on ARM and POWER. <p> With C, C++, and Java all providing this same abstraction of sequentially consistent synchronizing atomics, it behooves hardware designers to make that abstraction efficient. The ARMv8 architecture (both 32- and 64-bit) introduced <code>ldar</code> and <code>stlr</code> load and store instructions, providing a direct implementation. In a talk in 2017, Herb Sutter <a href="https://youtu.be/KeLBd2EJLOU?t=3432">claimed that IBM had approved him saying</a> that they intended future POWER implementations to have some kind of more efficient support for sequentially consistent atomics as well, giving programmers “less reason to use relaxed atomics.” I can't tell whether that happened, although here in 2021, POWER has turned out to be much less relevant than ARMv8. <p> The effect of this convergence is that sequentially consistent atomics are now well understood and can be efficiently implemented on all major hardware platforms, making them a good target for programming language memory models. <a class=anchor href="#javascript"><h2 id="javascript">JavaScript Memory Model (2017)</h2></a> <p> You might think that JavaScript, a notoriously single-threaded language, would not need to worry about a memory model for what happens when code runs in parallel on multiple processors. I certainly did. But you and I would be wrong. <p> JavaScript has web workers, which allow running code in another thread. As originally conceived, workers only communicated with the main JavaScript thread by explicit message copying. With no shared writable memory, there was no need to consider issues like data races. However, ECMAScript 2017 (ES2017) added the <code>SharedArrayBuffer</code> object, which lets the main thread and workers share a block of writable memory. Why do this? In an <a href="https://github.com/tc39/ecmascript_sharedmem/blob/master/historical/Spec_JavaScriptSharedMemoryAtomicsandLocks.pdf">early draft of the proposal</a>, the first reason listed is compiling multithreaded C++ code to JavaScript. <p> Of course, having shared writable memory also requires defining atomic operations for synchronization and a memory model. JavaScript deviates from C++ in three important ways: <ul> <li> <p> First, it limits the atomic operations to just sequentially consistent atomics. Other atomics can be compiled to sequentially consistent atomics with perhaps a loss in efficiency but no loss in correctness, and having only one kind simplifies the rest of the system. <li> <p> Second, JavaScript does not adopt “DRF-SC or Catch Fire.” Instead, like Java, it carefully defines the possible results of racy accesses. The rationale is much the same as Java, in particular security. Allowing a racy read to return any value at all allows (arguably encourages) implementations to return unrelated data, which could lead to <a href="https://github.com/tc39/ecmascript_sharedmem/blob/master/DISCUSSION.md#races-leaking-private-data-at-run-time">leaking private data at run time</a>. <li> <p> Third, in part because JavaScript provides semantics for racy programs, it defines what happens when atomic and non-atomic operations are used on the same memory location, as well as when the same memory location is accessed using different-sized accesses.</ul> <p> Precisely defining the behavior of racy programs leads to the usual complexities of relaxed memory semantics and how to disallow out-of-thin-air reads and the like. In addition to those challenges, which are mostly the same as elsewhere, the ES2017 definition had two interesting bugs that arose from a mismatch with the semantics of the new ARMv8 atomic instructions. These examples are adapted from Conrad Watt <i>et</i> <i>al</i>.'s 2020 paper “<a href="https://www.cl.cam.ac.uk/~jp622/repairing_javascript.pdf">Repairing and Mechanising the JavaScript Relaxed Memory Model</a>.” <p> As we noted in the previous section, ARMv8 added <code>ldar</code> and <code>stlr</code> instructions providing sequentially consistent atomic load and store. These were targeted to C++, which does not define the behavior of any program with a data race. Unsurprisingly, then, the behavior of these instructions in racy programs did not match the expectations of the ES2017 authors, and in particular it did not satisfy the ES2017 requirements for racy program behavior.<blockquote> <p> <i>Litmus Test: ES2017 racy reads on ARMv8</i> <br> Can this program (using atomics) see <code>r1</code> <code>=</code> <code>0</code>, <code>r2</code> <code>=</code> <code>1</code>?</blockquote> <pre>// Thread 1 // Thread 2 x = 1 y = 1 r1 = y x = 2 (non-atomic) r2 = x </pre> <blockquote> <p> C++: yes (data race, can do anything at all). <br> Java: the program cannot be written. <br> ARMv8 using <code>ldar</code>/<code>stlr</code>: yes. <br> ES2017: <i>no!</i> (contradicting ARMv8)</blockquote> <p> In this program, all the reads and writes are sequentially consistent atomics with the exception of <code>x</code> <code>=</code> <code>2</code>: thread 1 writes <code>x</code> <code>=</code> <code>1</code> using an atomic store, but thread 2 writes <code>x</code> <code>=</code> <code>2</code> using a non-atomic store. In C++, this is a data race, so all bets are off. In Java, this program cannot be written: <code>x</code> must either be declared <code>volatile</code> or not; it can't be accessed atomically only sometimes. In ES2017, the memory model turns out to disallow <code>r1</code> <code>=</code> <code>0</code>, <code>r2</code> <code>=</code> <code>1</code>. If <code>r1</code> <code>=</code> <code>y</code> reads 0, thread 1 must complete before thread 2 begins, in which case the non-atomic <code>x</code> <code>=</code> <code>2</code> would seem to happen after and overwrite the <code>x</code> <code>=</code> <code>1</code>, causing the atomic <code>r2</code> <code>=</code> <code>x</code> to read 2. This explanation seems entirely reasonable, but it is not the way ARMv8 processors work. <p> It turns out that, for the equivalent sequence of ARMv8 instructions, the non-atomic write to <code>x</code> can be reordered ahead of the atomic write to <code>y</code>, so that this program does in fact produce <code>r1</code> <code>=</code> <code>0</code>, <code>r2</code> <code>=</code> <code>1</code>. This is not a problem in C++, since the race means the program can do anything at all, but it is a problem for ES2017, which limits racy behaviors to a set of outcomes that does not include <code>r1</code> <code>=</code> <code>0</code>, <code>r2</code> <code>=</code> <code>1</code>. <p> Since it was an explicit goal of ES2017 to use the ARMv8 instructions to implement the sequentially consistent atomic operations, Watt <i>et</i> <i>al</i>. reported that their suggested fixes, slated to be included in the next revision of the standard, would weaken the racy behavior constraints just enough to allow this outcome. (It is unclear to me whether at the time “next revision” meant ES2020 or ES2021.) <p> Watt <i>et</i> <i>al</i>.'s suggested changes also included a fix to a second bug, first identified by Watt, Andreas Rossberg, and Jean Pichon-Pharabod, wherein a data-race-free program was <i>not</i> given sequentially consistent semantics by the ES2017 specification. That program is given by: <p> <blockquote> <p> <i>Litmus Test: ES2017 data-race-free program</i> <br> Can this program (using atomics) see <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>2</code>?</blockquote> <pre>// Thread 1 // Thread 2 x = 1 x = 2 r1 = x if (r1 == 1) { r2 = x // non-atomic } </pre> <blockquote> <p> On sequentially consistent hardware: no. <br> C++: I'm not enough of a C++ expert to say for sure. <br> Java: the program cannot be written. <br> ES2017: <i>yes!</i> (violating DRF-SC).</blockquote> <p> In this program, all the reads and writes are sequentially consistent atomics with the exception of <code>r2</code> <code>=</code> <code>x</code>, as marked. This program is data-race-free: the non-atomic read, which would have to be involved in any data race, only executes when <code>r1</code> <code>=</code> <code>1</code>, which proves that thread 1's <code>x</code> <code>=</code> <code>1</code> happens before the <code>r1</code> <code>=</code> <code>x</code> and therefore also before the <code>r2</code> <code>=</code> <code>x</code>. DRF-SC means that the program must execute in a sequentially consistent manner, so that <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>2</code> is impossible, but the ES2017 specification allowed it. <p> The ES2017 specification of program behavior was therefore simultaneously too strong (it disallowed real ARMv8 behavior for racy programs) and too weak (it allowed non-sequentially consistent behavior for race-free programs). As noted earlier, these mistakes are fixed. Even so, this is yet another reminder about how subtle it can be to specify the semantics of both data-race-free and racy programs exactly using happens-before, as well as how subtle it can be to match up language memory models with the underlying hardware memory models. <p> It is encouraging that at least for now JavaScript has avoided adding any other atomics besides the sequentially consistent ones and has resisted “DRF-SC or Catch Fire.” The result is a memory model valid as a C/C++ compilation target but much closer to Java. <a class=anchor href="#conclusions"><h2 id="conclusions">Conclusions</h2></a> <p> Looking at C, C++, Java, JavaScript, Rust, and Swift, we can make the following observations: <ul> <li> They all provide sequentially consistent synchronizing atomics for coordinating the non-atomic parts of a parallel program. <li> They all aim to guarantee that programs made data-race-free using proper synchronization behave as if executed in a sequentially consistent manner. <li> Java resisted adding weak (acquire/release) synchronizing atomics until Java 9 introduced <code>VarHandle</code>. JavaScript has avoided adding them as of this writing. <li> They all provide a way for programs to execute “intentional” data races without invalidating the rest of the program. In C, C++, Rust, and Swift, that mechanism is relaxed, non-synchronizing atomics, a special form of memory access. In Java, that mechanism is either ordinary memory access or the Java 9 <code>VarHandle</code> “plain” access mode. In JavaScript, that mechanism is ordinary memory access. <li> None of the languages have found a way to formally disallow paradoxes like out-of-thin-air values, but all informally disallow them.</ul> <p> Meanwhile, processor manufacturers seem to have accepted that the abstraction of sequentially consistent synchronizing atomics is important to implement efficiently and are starting to do so: ARMv8 and RISC-V both provide direct support. <p> Finally, a truly immense amount of verification and formal analysis work has gone into understanding these systems and stating their behaviors precisely. It is particularly encouraging that Watt <i>et</i> <i>al</i>. were able in 2020 to give a formal model of a significant subset of JavaScript and use a theorem prover to prove correctness of compilation to ARM, POWER, RISC-V, and x86-TSO. <p> Twenty-five years after the first Java memory model, and after many person-centuries of research effort, we may be starting to be able to formalize entire memory models. Perhaps, one day, we will also fully understand them. <p> The next post in this series is “<a href="gomm">Updating the Go Memory Model</a>.” <a class=anchor href="#acknowledgements"><h2 id="acknowledgements">Acknowledgements</h2></a> <p> This series of posts benefited greatly from discussions with and feedback from a long list of engineers I am lucky to work with at Google. My thanks to them. I take full responsibility for any mistakes or unpopular opinions. Hardware Memory Models tag:research.swtch.com,2012:research.swtch.com/hwmm 2021-06-29T12:49:00-04:00 2021-06-29T12:51:00-04:00 An introduction to hardware memory models. (Memory Models, Part 1) <a class=anchor href="#introduction"><h2 id="introduction">Introduction: A Fairy Tale, Ending</h2></a> <p> A long time ago, when everyone wrote single-threaded programs, one of the most effective ways to make a program run faster was to sit back and do nothing. Optimizations in the next generation of hardware and the next generation of compilers would make the program run exactly as before, just faster. During this fairy-tale period, there was an easy test for whether an optimization was valid: if programmers couldn't tell the difference (except for the speedup) between the unoptimized and optimized execution of a valid program, then the optimization was valid. That is, <i>valid optimizations do not change the behavior of valid programs.</i> <p> One sad day, years ago, the hardware engineers' magic spells for making individual processors faster and faster stopped working. In response, they found a new magic spell that let them create computers with more and more processors, and operating systems exposed this hardware parallelism to programmers in the abstraction of threads. This new magic spell—multiple processors made available in the form of operating-system threads—worked much better for the hardware engineers, but it created significant problems for language designers, compiler writers and programmers. <p> Many hardware and compiler optimizations that were invisible (and therefore valid) in single-threaded programs produce visible changes in multithreaded programs. If valid optimizations do not change the behavior of valid programs, then either these optimizations or the existing programs must be declared invalid. Which will it be, and how can we decide? <p> Here is a simple example program in a C-like language. In this program and in all programs we will consider, all variables are initially set to zero. <pre>// Thread 1 // Thread 2 x = 1; while(done == 0) { /* loop */ } done = 1; print(x); </pre> <p> If thread 1 and thread 2, each running on its own dedicated processor, both run to completion, can this program print 0? <p> It depends. It depends on the hardware, and it depends on the compiler. A direct line-for-line translation to assembly run on an x86 multiprocessor will always print 1. But a direct line-for-line translation to assembly run on an ARM or POWER multiprocessor can print 0. Also, no matter what the underlying hardware, standard compiler optimizations could make this program print 0 or go into an infinite loop. <p> “It depends” is not a happy ending. Programmers need a clear answer to whether a program will continue to work with new hardware and new compilers. And hardware designers and compiler developers need a clear answer to how precisely the hardware and compiled code are allowed to behave when executing a given program. Because the main issue here is the visibility and consistency of changes to data stored in memory, that contract is called the memory consistency model or just <i>memory model</i>. <p> Originally, the goal of a memory model was to define what hardware guaranteed to a programmer writing assembly code. In that setting, the compiler is not involved. Twenty-five years ago, people started trying to write memory models defining what a high-level programming language like Java or C++ guarantees to programmers writing code in that language. Including the compiler in the model makes the job of defining a reasonable model much more complicated. <p> This is the <a href="mm">first of a pair of posts</a> about hardware memory models and programming language memory models, respectively. My goal in writing these posts is to build up background for discussing potential <a href="gomm">changes we might want to make in Go's memory model</a>. But to understand where Go is and where we might want to head, first we have to understand where other hardware memory models and language memory models are today and the precarious paths they took to get there. <p> Again, this post is about hardware. Let's assume we are writing assembly language for a multiprocessor computer. What guarantees do programmers need from the computer hardware in order to write correct programs? Computer scientists have been searching for good answers to this question for over forty years. <a class=anchor href="#sc"><h2 id="sc">Sequential Consistency</h2></a> <p> Leslie Lamport's 1979 paper “<a href="https://www.microsoft.com/en-us/research/publication/make-multiprocessor-computer-correctly-executes-multiprocess-programs/">How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs</a>” introduced the concept of sequential consistency:<blockquote> <p> The customary approach to designing and proving the correctness of multiprocess algorithms for such a computer assumes that the following condition is satisfied: the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. A multiprocessor satisfying this condition will be called <i>sequentially consistent</i>.</blockquote> <p> Today we talk about not just computer hardware but also programming languages guaranteeing sequential consistency, when the only possible executions of a program correspond to some kind of interleaving of thread operations into a sequential execution. Sequential consistency is usually considered the ideal model, the one most natural for programmers to work with. It lets you assume programs execute in the order they appear on the page, and the executions of individual threads are simply interleaved in some order but not otherwise rearranged. <p> One might reasonably question whether sequential consistency <i>should</i> be the ideal model, but that's beyond the scope of this post. I will note only that considering all possible thread interleavings remains, today as in 1979, “the customary approach to designing and proving the correctness of multiprocess algorithms.” In the intervening four decades, nothing has replaced it. <p> Earlier I asked whether this program can print 0: <pre>// Thread 1 // Thread 2 x = 1; while(done == 0) { /* loop */ } done = 1; print(x); </pre> <p> To make the program a bit easier to analyze, let's remove the loop and the print and ask about the possible results from reading the shared variables:<blockquote> <p> <i>Litmus Test: Message Passing</i><br> Can this program see <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>0</code>?</blockquote> <pre>// Thread 1 // Thread 2 x = 1 r1 = y y = 1 r2 = x </pre> <p> We assume every example starts with all shared variables set to zero. Because we're trying to establish what hardware is allowed to do, we assume that each thread is executing on its own dedicated processor and that there's no compiler to reorder what happens in the thread: the instructions in the listings are the instructions the processor executes. The name <code>r</code><i>N</i> denotes a thread-local register, not a shared variable, and we ask whether a particular setting of thread-local registers is possible at the end of an execution. <p> This kind of question about execution results for a sample program is called a <i>litmus test</i>. Because it has a binary answer—is this outcome possible or not?—a litmus test gives us a clear way to distinguish memory models: if one model allows a particular execution and another does not, the two models are clearly different. Unfortunately, as we will see later, the answer a particular model gives to a particular litmus test is often surprising. <p> If the execution of this litmus test is sequentially consistent, there are only six possible interleavings: <p> <img name="mem-litmus" class="center pad" width=492 height=169 src="mem-litmus.png" srcset="mem-litmus.png 1x, mem-litmus@2x.png 2x, mem-litmus@3x.png 3x, mem-litmus@4x.png 4x"> <p> Since no interleaving ends with <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>0</code>, that result is disallowed. That is, on sequentially consistent hardware, the answer to the litmus test—can this program see <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>0</code>?—is <i>no</i>. <p> A good mental model for sequential consistency is to imagine all the processors connected directly to the same shared memory, which can serve a read or write request from one thread at a time. There are no caches involved, so every time a processor needs to read from or write to memory, that request goes to the shared memory. The single-use-at-a-time shared memory imposes a sequential order on the execution of all the memory accesses: sequential consistency. <p> <img name="mem-sc" class="center pad" width=482 height=95 src="mem-sc.png" srcset="mem-sc.png 1x, mem-sc@2x.png 2x, mem-sc@3x.png 3x, mem-sc@4x.png 4x"> <p> (The three memory model hardware diagrams in this post are adapted from Maranget <i>et</i> <i>al</i>., “<a href="https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf">A Tutorial Introduction to the ARM and POWER Relaxed Memory Models</a>.”) <p> This diagram is a <i>model</i> for a sequentially consistent machine, not the only way to build one. Indeed, it is possible to build a sequentially consistent machine using multiple shared memory modules and caches to help predict the result of memory fetches, but being sequentially consistent means that machine must behave indistinguishably from this model. If we are simply trying to understand what sequentially consistent execution means, we can ignore all of those possible implementation complications and think about this one model. <p> Unfortunately for us as programmers, giving up strict sequential consistency can let hardware execute programs faster, so all modern hardware deviates in various ways from sequential consistency. Defining exactly how specific hardware deviates turns out to be quite difficult. This post uses as two examples two memory models present in today's widely-used hardware: that of the x86, and that of the ARM and POWER processor families. <a class=anchor href="#x86"><h2 id="x86">x86 Total Store Order (x86-TSO)</h2></a> <p> The memory model for modern x86 systems corresponds to this hardware diagram: <p> <img name="mem-tso" class="center pad" width=482 height=180 src="mem-tso.png" srcset="mem-tso.png 1x, mem-tso@2x.png 2x, mem-tso@3x.png 3x, mem-tso@4x.png 4x"> <p> All the processors are still connected to a single shared memory, but each processor queues writes to that memory in a local write queue. The processor continues executing new instructions while the writes make their way out to the shared memory. A memory read on one processor consults the local write queue before consulting main memory, but it cannot see the write queues on other processors. The effect is that a processor sees its own writes before others do. But—and this is very important—all processors do agree on the (total) order in which writes (stores) reach the shared memory, giving the model its name: <i>total store order</i>, or TSO. At the moment that a write reaches shared memory, any future read on any processor will see it and use that value (until it is overwritten by a later write, or perhaps by a buffered write from another processor). <p> The write queue is a standard first-in, first-out queue: the memory writes are applied to the shared memory in the same order that they were executed by the processor. Because the write order is preserved by the write queue, and because other processors see the writes to shared memory immediately, the message passing litmus test we considered earlier has the same outcome as before: <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>0</code> remains impossible.<blockquote> <p> <i>Litmus Test: Message Passing</i> <br> Can this program see <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>0</code>?</blockquote> <pre>// Thread 1 // Thread 2 x = 1 r1 = y y = 1 r2 = x </pre> <blockquote> <p> On sequentially consistent hardware: no.<br> On x86 (or other TSO): no.</blockquote> <p> The write queue guarantees that thread 1 writes <code>x</code> to memory before <code>y</code>, and the system-wide agreement about the order of memory writes (the total store order) guarantees that thread 2 learns of <code>x</code>'s new value before it learns of <code>y</code>'s new value. Therefore it is impossible for <code>r1</code> <code>=</code> <code>y</code> to see the new <code>y</code> without <code>r2</code> <code>=</code> <code>x</code> also seeing the new <code>x</code>. The store order is crucial here: thread 1 writes <code>x</code> before <code>y</code>, so thread 2 must not see the write to <code>y</code> before the write to <code>x</code>. <p> The sequential consistency and TSO models agree in this case, but they disagree about the results of other litmus tests. For example, this is the usual example distinguishing the two models:<blockquote> <p> <i>Litmus Test: Write Queue (also called Store Buffer)</i> <br> Can this program see <code>r1</code> <code>=</code> <code>0</code>, <code>r2</code> <code>=</code> <code>0</code>?</blockquote> <pre>// Thread 1 // Thread 2 x = 1 y = 1 r1 = y r2 = x </pre> <blockquote> <p> On sequentially consistent hardware: no.<br> On x86 (or other TSO): <i>yes!</i></blockquote> <p> In any sequentially consistent execution, either <code>x</code> <code>=</code> <code>1</code> or <code>y</code> <code>=</code> <code>1</code> must happen first, and then the read in the other thread must observe it, so <code>r1</code> <code>=</code> <code>0</code>, <code>r2</code> <code>=</code> <code>0</code> is impossible. But on a TSO system, it can happen that Thread 1 and Thread 2 both queue their writes and then read from memory before either write makes it to memory, so that both reads see zeros. <p> This example may seem artificial, but using two synchronization variables does happen in well-known synchronization algorithms, such as <a href="https://en.wikipedia.org/wiki/Dekker%27s_algorithm">Dekker's algorithm</a> or <a href="https://en.wikipedia.org/wiki/Peterson%27s_algorithm">Peterson's algorithm</a>, as well as ad hoc schemes. They break if one thread isn’t seeing all the writes from another. <p> To fix algorithms that depend on stronger memory ordering, non-sequentially-consistent hardware supplies explicit instructions called memory barriers (or fences) that can be used to control the ordering. We can add a memory barrier to make sure that each thread flushes its previous write to memory before starting its read: <pre>// Thread 1 // Thread 2 x = 1 y = 1 barrier barrier r1 = y r2 = x </pre> <p> With the addition of the barriers, <code>r1</code> <code>=</code> <code>0</code>, <code>r2</code> <code>=</code> <code>0</code> is again impossible, and Dekker's or Peterson's algorithm would then work correctly. There are many kinds of barriers; the details vary from system to system and are beyond the scope of this post. The point is only that barriers exist and give programmers or language implementers a way to force sequentially consistent behavior at critical moments in a program. <p> One final example, to drive home why the model is called total store order. In the model, there are local write queues but no caches on the read path. Once a write reaches main memory, all processors not only agree that the value is there but also agree about when it arrived relative to writes from other processors. Consider this litmus test:<blockquote> <p> <i>Litmus Test: Independent Reads of Independent Writes (IRIW)</i><br> Can this program see <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>0</code>, <code>r3</code> <code>=</code> <code>1</code>, <code>r4</code> <code>=</code> <code>0</code>?<br> (Can Threads 3 and 4 see <code>x</code> and <code>y</code> change in different orders?)</blockquote> <pre>// Thread 1 // Thread 2 // Thread 3 // Thread 4 x = 1 y = 1 r1 = x r3 = y r2 = y r4 = x </pre> <blockquote> <p> On sequentially consistent hardware: no.<br> On x86 (or other TSO): no.</blockquote> <p> If Thread 3 sees <code>x</code> change before <code>y</code>, can Thread 4 see <code>y</code> change before <code>x</code>? For x86 and other TSO machines, the answer is no: there is a <i>total order</i> over all stores (writes) to main memory, and all processors agree on that order, subject to the wrinkle that each processor knows about its own writes before they reach main memory. <a class=anchor href="#path_to_x86-tso"><h2 id="path_to_x86-tso">The Path to x86-TSO</h2></a> <p> The x86-TSO model seems fairly clean, but the path there was full of roadblocks and wrong turns. In the 1990s, the manuals available for the first x86 multiprocessors said next to nothing about the memory model provided by the hardware. <p> As one example of the problems, Plan 9 was one of the first true multiprocessor operating systems (without a global kernel lock) to run on the x86. During the port to the multiprocessor Pentium Pro, in 1997, the developers stumbled over unexpected behavior that boiled down to the write queue litmus test. A subtle piece of synchronization code assumed that <code>r1</code> <code>=</code> <code>0</code>, <code>r2</code> <code>=</code> <code>0</code> was impossible, and yet it was happening. Worse, the Intel manuals were vague about the memory model details. <p> In response to a mailing list suggestion that “it's better to be conservative with locks than to trust hardware designers to do what we expect,” one of the Plan 9 developers <a href="https://web.archive.org/web/20091124045026/http://9fans.net/archive/1997/04/76">explained the problem well</a>:<blockquote> <p> I certainly agree. We are going to encounter more relaxed ordering in multiprocessors. The question is, what do the hardware designers consider conservative? Forcing an interlock at both the beginning and end of a locked section seems to be pretty conservative to me, but I clearly am not imaginative enough. The Pro manuals go into excruciating detail in describing the caches and what keeps them coherent but don't seem to care to say anything detailed about execution or read ordering. The truth is that we have no way of knowing whether we're conservative enough.</blockquote> <p> During the discussion, an architect at Intel gave an informal explanation of the memory model, pointing out that in theory even multiprocessor 486 and Pentium systems could have produced the <code>r1</code> <code>=</code> <code>0</code>, <code>r2</code> <code>=</code> <code>0</code> result, and that the Pentium Pro simply had larger pipelines and write queues that exposed the behavior more often. <p> The Intel architect also wrote:<blockquote> <p> Loosely speaking, this means the ordering of events originating from any one processor in the system, as observed by other processors, is always the same. However, different observers are allowed to disagree on the interleaving of events from two or more processors. <p> Future Intel processors will implement the same memory ordering model.</blockquote> <p> The claim that “different observers are allowed to disagree on the interleaving of events from two or more processors” is saying that the answer to the IRIW litmus test can answer “yes” on x86, even though in the previous section we saw that x86 answers “no.” How can that be? <p> The answer appears to be that Intel processors never actually answered “yes” to that litmus test, but at the time the Intel architects were reluctant to make any guarantee for future processors. What little text existed in the architecture manuals made almost no guarantees at all, making it very difficult to program against. <p> The Plan 9 discussion was not an isolated event. The Linux kernel developers spent over a hundred messages on their mailing list <a href="https://lkml.org/lkml/1999/11/20/76">starting in late November 1999</a> in similar confusion over the guarantees provided by Intel processors. <p> In response to more and more people running into these difficulties over the decade that followed, a group of architects at Intel took on the task of writing down useful guarantees about processor behavior, for both current and future processors. The first result was the “<a href="http://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf">Intel 64 Architecture Memory Ordering White Paper</a>”, published in August 2007, which aimed to “provide software writers with a clear understanding of the results that different sequences of memory access instructions may produce.” AMD published a similar description later that year in the <a href="https://courses.cs.washington.edu/courses/cse351/12wi/supp-docs/AMD%20Vol%201.pdf"><i>AMD64 Architecture Programmer's Manual revision 3.14</i></a>. These descriptions were based on a model called “total lock order + causal consistency” (TLO+CC), intentionally weaker than TSO. In public talks, the Intel architects said that TLO+CC was “as strong as required but no stronger.” In particular, the model reserved the right for x86 processors to answer “yes” to the IRIW litmus test. Unfortunately, the definition of the memory barrier was <a href="http://web.archive.org/web/20080512021617/http://blogs.sun.com/dave/entry/java_memory_model_concerns_on">not strong enough</a> to reestablish sequentially-consistent memory semantics, even with a barrier after every instruction. Even worse, researchers observed actual Intel x86 hardware violating the TLO+CC model. For example:<blockquote> <p> <i>Litmus Test: n6 (Paul Loewenstein)</i><br> Can this program end with <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>0</code>, <code>x</code> <code>=</code> <code>1</code>?</blockquote> <pre>// Thread 1 // Thread 2 x = 1 y = 1 r1 = x x = 2 r2 = y </pre> <blockquote> <p> On sequentially consistent hardware: no.<br> On x86 TLO+CC model (2007): no.<br> On actual x86 hardware: <i>yes!</i><br> On x86 TSO model: <i>yes!</i> (Example from x86-TSO paper.)</blockquote> <p> Revisions to the Intel and AMD specifications later in 2008 guaranteed a “no” to the IRIW case and strengthened the memory barriers but still permitted unexpected behaviors that seem like they could not arise on any reasonable hardware. For example:<blockquote> <p> <i>Litmus Test: n5</i> <br> Can this program end with <code>r1</code> <code>=</code> <code>2</code>, <code>r2</code> <code>=</code> <code>1</code>?</blockquote> <pre>// Thread 1 // Thread 2 x = 1 x = 2 r1 = x r2 = x </pre> <blockquote> <p> On sequentially consistent hardware: no.<br> On x86 specification (2008): <i>yes!</i><br> On actual x86 hardware: no.<br> On x86 TSO model: no. (Example from x86-TSO paper.)</blockquote> <p> To address these problems, Owens <i>et</i> <i>al</i>. <a href="https://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.tphols.pdf">proposed the x86-TSO model</a>, based on the earlier <a href="https://research.swtch.com/sparcv8.pdf">SPARCv8 TSO model</a>. At the time they claimed that “To the best of our knowledge, x86-TSO is sound, is strong enough to program above, and is broadly in line with the vendors’ intentions.” A few months later Intel and AMD released new manuals broadly adopting this model. <p> It appears that all Intel processors did implement x86-TSO from the start, even though it took a decade for Intel to decide to commit to that. In retrospect, it is clear that the Intel and AMD architects were struggling with exactly how to write a memory model that left room for future processor optimizations while still making useful guarantees for compiler writers and assembly-language programmers. “As strong as required but no stronger” is a difficult balancing act. <a class=anchor href="#relaxed"><h2 id="relaxed">ARM/POWER Relaxed Memory Model</h2></a> <p> Now let's look at an even more relaxed memory model, the one found on ARM and POWER processors. At an implementation level, these two systems are different in many ways, but the guaranteed memory consistency model turns out to be roughly similar, and quite a bit weaker than x86-TSO or even x86-TLO+CC. <p> The conceptual model for ARM and POWER systems is that each processor reads from and writes to its own complete copy of memory, and each write propagates to the other processors independently, with reordering allowed as the writes propagate. <p> <img name="mem-weak" class="center pad" width=308 height=294 src="mem-weak.png" srcset="mem-weak.png 1x, mem-weak@2x.png 2x, mem-weak@3x.png 3x, mem-weak@4x.png 4x"> <p> Here, there is no total store order. Not depicted, each processor is also allowed to postpone a read until it needs the result: a read can be delayed until after a later write. In this relaxed model, the answer to every litmus test we’ve seen so far is “yes, that really can happen.” <p> For the original message passing litmus test, the reordering of writes by a single processor means that Thread 1's writes may not be observed by other threads in the same order:<blockquote> <p> <i>Litmus Test: Message Passing</i><br> Can this program see <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>0</code>?</blockquote> <pre>// Thread 1 // Thread 2 x = 1 r1 = y y = 1 r2 = x </pre> <blockquote> <p> On sequentially consistent hardware: no.<br> On x86 (or other TSO): no.<br> On ARM/POWER: <i>yes!</i></blockquote> <p> In the ARM/POWER model, we can think of thread 1 and thread 2 each having their own separate copy of memory, with writes propagating between the memories in any order whatsoever. If thread 1's memory sends the update of <code>y</code> to thread 2 before sending the update of <code>x</code>, and if thread 2 executes between those two updates, it will indeed see the result <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>0</code>. <p> This result shows that the ARM/POWER memory model is weaker than TSO: it makes fewer requirements on the hardware. The ARM/POWER model still admits the kinds of reorderings that TSO does:<blockquote> <p> <i>Litmus Test: Store Buffering</i><br> Can this program see <code>r1</code> <code>=</code> <code>0</code>, <code>r2</code> <code>=</code> <code>0</code>?</blockquote> <pre>// Thread 1 // Thread 2 x = 1 y = 1 r1 = y r2 = x </pre> <blockquote> <p> On sequentially consistent hardware: no.<br> On x86 (or other TSO): <i>yes!</i><br> On ARM/POWER: <i>yes!</i></blockquote> <p> On ARM/POWER, the writes to <code>x</code> and <code>y</code> might be made to the local memories but not yet have propagated when the reads occur on the opposite threads. <p> Here’s the litmus test that showed what it meant for x86 to have a total store order:<blockquote> <p> <i>Litmus Test: Independent Reads of Independent Writes (IRIW)</i> <br> Can this program see <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>0</code>, <code>r3</code> <code>=</code> <code>1</code>, <code>r4</code> <code>=</code> <code>0</code>?<br> (Can Threads 3 and 4 see <code>x</code> and <code>y</code> change in different orders?)</blockquote> <pre>// Thread 1 // Thread 2 // Thread 3 // Thread 4 x = 1 y = 1 r1 = x r3 = y r2 = y r4 = x </pre> <blockquote> <p> On sequentially consistent hardware: no.<br> On x86 (or other TSO): no.<br> On ARM/POWER: <i>yes!</i></blockquote> <p> On ARM/POWER, different threads may learn about different writes in different orders. They are not guaranteed to agree about a total order of writes reaching main memory, so Thread 3 can see <code>x</code> change before <code>y</code> while Thread 4 sees <code>y</code> change before <code>x</code>. <p> As another example, ARM/POWER systems have visible buffering or reordering of memory reads (loads), as demonstrated by this litmus test:<blockquote> <p> <i>Litmus Test: Load Buffering</i> <br> Can this program see <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>1</code>? <br> (Can each thread's read happen <i>after</i> the other thread's write?)</blockquote> <pre>// Thread 1 // Thread 2 r1 = x r2 = y y = 1 x = 1 </pre> <blockquote> <p> On sequentially consistent hardware: no.<br> On x86 (or other TSO): no.<br> On ARM/POWER: <i>yes!</i></blockquote> <p> Any sequentially consistent interleaving must start with either thread 1's <code>r1</code> <code>=</code> <code>x</code> or thread 2's <code>r2</code> <code>=</code> <code>y</code>. That read must see a zero, making the outcome <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>1</code> impossible. In the ARM/POWER memory model, however, processors are allowed to delay reads until after writes later in the instruction stream, so that <code>y</code> <code>=</code> <code>1</code> and <code>x</code> <code>=</code> <code>1</code> execute <i>before</i> the two reads. <p> Although both the ARM and POWER memory models allow this result, Maranget <i>et</i> <i>al</i>. <a href="https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf">reported (in 2012)</a> being able to reproduce it empirically only on ARM systems, never on POWER. Here the divergence between model and reality comes into play just as it did when we examined Intel x86: hardware implementing a stronger model than technically guaranteed encourages dependence on the stronger behavior and means that future, weaker hardware will break programs, validly or not. <p> Like on TSO systems, ARM and POWER have barriers that we can insert into the examples above to force sequentially consistent behaviors. But the obvious question is whether ARM/POWER without barriers excludes any behavior at all. Can the answer to any litmus test ever be “no, that can’t happen?” It can, when we focus on a single memory location. <p> Here’s a litmus test for something that can’t happen even on ARM and POWER:<blockquote> <p> <i>Litmus Test: Coherence</i><br> Can this program see <code>r1</code> <code>=</code> <code>1</code>, <code>r2</code> <code>=</code> <code>2</code>, <code>r3</code> <code>=</code> <code>2</code>, <code>r4</code> <code>=</code> <code>1</code>?<br> (Can Thread 3 see <code>x</code> <code>=</code> <code>1</code> before <code>x</code> <code>=</code> <code>2</code> while Thread 4 sees the reverse?)</blockquote> <pre>// Thread 1 // Thread 2 // Thread 3 // Thread 4 x = 1 x = 2 r1 = x r3 = x r2 = x r4 = x </pre> <blockquote> <p> On sequentially consistent hardware: no.<br> On x86 (or other TSO): no.<br> On ARM/POWER: no.</blockquote> <p> This litmus test is like the previous one, but now both threads are writing to a single variable <code>x</code> instead of two distinct variables <code>x</code> and <code>y</code>. Threads 1 and 2 write conflicting values 1 and 2 to <code>x</code>, while Thread 3 and Thread 4 both read <code>x</code> twice. If Thread 3 sees <code>x</code> <code>=</code> <code>1</code> overwritten by <code>x</code> <code>=</code> <code>2</code>, can Thread 4 see the opposite? <p> The answer is no, even on ARM/POWER: threads in the system must agree about a total order for the writes to a single memory location. That is, threads must agree which writes overwrite other writes. This property is called called <i>coherence</i>. Without the coherence property, processors either disagree about the final result of memory or else report a memory location flip-flopping from one value to another and back to the first. It would be very difficult to program such a system. <p> I'm purposely leaving out a lot of subtleties in the ARM and POWER weak memory models. For more detail, see any of <a href="https://www.cl.cam.ac.uk/~pes20/papers/topics.html#Power_and_ARM">Peter Sewell's papers on the topic</a>. Also, ARMv8 <a href="https://www.cl.cam.ac.uk/~pes20/armv8-mca/armv8-mca-draft.pdf">strengthened the memory model</a> by making it “multicopy atomic,” but I won't take the space here to explain exactly what that means. <p> There are two important points to take away. First, there is an incredible amount of subtlety here, the subject of well over a decade of academic research by very persistent, very smart people. I don't claim to understand anywhere near all of it myself. This is not something we should hope to explain to ordinary programmers, not something that we can hope to keep straight while debugging ordinary programs. Second, the gap between what is allowed and what is observed makes for unfortunate future surprises. If current hardware does not exhibit the full range of allowed behaviors—especially when it is difficult to reason about what is allowed in the first place!—then inevitably programs will be written that accidentally depend on the more restricted behaviors of the actual hardware. If a new chip is less restricted in its behaviors, the fact that the new behavior breaking your program is technically allowed by the hardware memory model—that is, the bug is technically your fault—is of little consolation. This is no way to write programs. <a class=anchor href="#drf"><h2 id="drf">Weak Ordering and Data-Race-Free Sequential Consistency</h2></a> <p> By now I hope you're convinced that the hardware details are complex and subtle and not something you want to work through every time you write a program. Instead, it would help to identify shortcuts of the form “if you follow these easy rules, your program will only produce results as if by some sequentially consistent interleaving.” (We're still talking about hardware, so we're still talking about interleaving individual assembly instructions.) <p> Sarita Adve and Mark Hill proposed exactly this approach in their 1990 paper “<a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.42.5567">Weak Ordering – A New Definition</a>”. They defined “weakly ordered” as follows.<blockquote> <p> Let a synchronization model be a set of constraints on memory accesses that specify how and when synchronization needs to be done. <p> Hardware is weakly ordered with respect to a synchronization model if and only if it appears sequentially consistent to all software that obey the synchronization model.</blockquote> <p> Although their paper was about capturing the hardware designs of that time (not x86, ARM, and POWER), the idea of elevating the discussion above specific designs, keeps the paper relevant today. <p> I said before that “valid optimizations do not change the behavior of valid programs.” The rules define what valid means, and then any hardware optimizations have to keep those programs working as they might on a sequentially consistent machine. Of course, the interesting details are the rules themselves, the constraints that define what it means for a program to be valid. <p> Adve and Hill propose one synchronization model, which they call <i>data-race-free (DRF)</i>. This model assumes that hardware has memory synchronization operations separate from ordinary memory reads and writes. Ordinary memory reads and writes may be reordered between synchronization operations, but they may not be moved across them. (That is, the synchronization operations also serve as barriers to reordering.) A program is said to be data-race-free if, for all idealized sequentially consistent executions, any two ordinary memory accesses to the same location from different threads are either both reads or else separated by synchronization operations forcing one to happen before the other. <p> Let’s look at some examples, taken from Adve and Hill's paper (redrawn for presentation). Here is a single thread that executes a write of variable <code>x</code> followed by a read of the same variable. <p> <img name="mem-adve-1" class="center pad" width=80 height=101 src="mem-adve-1.png" srcset="mem-adve-1.png 1x, mem-adve-1@2x.png 2x, mem-adve-1@3x.png 3x, mem-adve-1@4x.png 4x"> <p> The vertical arrow marks the order of execution within a single thread: the write happens, then the read. There is no race in this program, since everything is in a single thread. <p> In contrast, there is a race in this two-thread program: <p> <img name="mem-adve-2" class="center pad" width=208 height=101 src="mem-adve-2.png" srcset="mem-adve-2.png 1x, mem-adve-2@2x.png 2x, mem-adve-2@3x.png 3x, mem-adve-2@4x.png 4x"> <p> Here, thread 2 writes to x without coordinating with thread 1. Thread 2's write <i>races</i> with both the write and the read by thread 1. If thread 2 were reading x instead of writing it, the program would have only one race, between the write in thread 1 and the read in thread 2. Every race involves at least one write: two uncoordinated reads do not race with each other. <p> To avoid races, we must add synchronization operations, which force an order between operations on different threads sharing a synchronization variable. If the synchronization S(a) (synchronizing on variable a, marked by the dashed arrow) forces thread 2's write to happen after thread 1 is done, the race is eliminated: <p> <img name="mem-adve-3" class="center pad" width=208 height=192 src="mem-adve-3.png" srcset="mem-adve-3.png 1x, mem-adve-3@2x.png 2x, mem-adve-3@3x.png 3x, mem-adve-3@4x.png 4x"> <p> Now the write by thread 2 cannot happen at the same time as thread 1's operations. <p> If thread 2 were only reading, we would only need to synchronize with thread 1's write. The two reads can still proceed concurrently: <p> <img name="mem-adve-4" class="center pad" width=208 height=147 src="mem-adve-4.png" srcset="mem-adve-4.png 1x, mem-adve-4@2x.png 2x, mem-adve-4@3x.png 3x, mem-adve-4@4x.png 4x"> <p> Threads can be ordered by a sequence of synchronizations, even using an intermediate thread. This program has no race: <p> <img name="mem-adve-5" class="center pad" width=336 height=192 src="mem-adve-5.png" srcset="mem-adve-5.png 1x, mem-adve-5@2x.png 2x, mem-adve-5@3x.png 3x, mem-adve-5@4x.png 4x"> <p> On the other hand, the use of synchronization variables does not by itself eliminate races: it is possible to use them incorrectly. This program does have a race: <p> <img name="mem-adve-6" class="center pad" width=336 height=192 src="mem-adve-6.png" srcset="mem-adve-6.png 1x, mem-adve-6@2x.png 2x, mem-adve-6@3x.png 3x, mem-adve-6@4x.png 4x"> <p> Thread 2's read is properly synchronized with the writes in the other threads—it definitely happens after both—but the two writes are not themselves synchronized. This program is <i>not</i> data-race-free. <p> Adve and Hill presented weak ordering as “a contract between software and hardware,” specifically that if software avoids data races, then hardware acts as if it is sequentially consistent, which is easier to reason about than the models we were examining in the earlier sections. But how can hardware satisfy its end of the contract? <p> Adve and Hill gave a proof that hardware “is weakly ordered by DRF,” meaning it executes data-race-free programs as if by a sequentially consistent ordering, provided it meets a set of certain minimum requirements. I’m not going to go through the details, but the point is that after the Adve and Hill paper, hardware designers had a cookbook recipe backed by a proof: do these things, and you can assert that your hardware will appear sequentially consistent to data-race-free programs. And in fact, most relaxed hardware did behave this way and has continued to do so, assuming appropriate implementations of the synchronization operations. Adve and Hill were concerned originally with the VAX, but certainly x86, ARM, and POWER can satisfy these constraints too. This idea that a system guarantees to data-race-free programs the appearance of sequential consistency is often abbreviated <i>DRF-SC</i>. <p> DRF-SC marked a turning point in hardware memory models, providing a clear strategy for both hardware designers and software authors, at least those writing software in assembly language. As we will see in the next post, the question of a memory model for a higher-level programming language does not have as neat and tidy an answer. <p> The next post in this series is about <a href="plmm">programming language memory models</a>. <a class=anchor href="#acknowledgements"><h2 id="acknowledgements">Acknowledgements</h2></a> <p> This series of posts benefited greatly from discussions with and feedback from a long list of engineers I am lucky to work with at Google. My thanks to them. I take full responsibility for any mistakes or unpopular opinions. Memory Models tag:research.swtch.com,2012:research.swtch.com/mm 2021-06-29T12:48:00-04:00 2021-06-29T12:50:00-04:00 Topic Index <p> These are the posts in the “Memory Models” series that began in June 2021: <ul> <li> “<a href="hwmm">Hardware Memory Models</a>” [<a href="hwmm.pdf">PDF</a>]. <li> “<a href="plmm">Programming Language Memory Models</a>” [<a href="plmm.pdf">PDF</a>]. <li> “<a href="gomm">Updating the Go Memory Model</a>” [<a href="gomm.pdf">PDF</a>].</ul> The Principles of Versioning in Go tag:research.swtch.com,2012:research.swtch.com/vgo-principles 2019-12-03T14:00:00-05:00 2019-12-03T14:02:00-05:00 The rationale behind the Go modules design. (Go & Versioning, Part 11) <p> This blog post is about how we added package versioning to Go, in the form of Go modules, and the reasons we made the choices we did. It is adapted and updated from a <a href="https://www.youtube.com/watch?v=F8nrpe0XWRg">talk I gave at GopherCon Singapore in 2018</a>. <a class=anchor href="#why"><h2 id="why">Why Versions?</h2></a> <p> To start, let’s make sure we’re all on the same page, by taking a look at the ways the GOPATH-based <code>go</code> <code>get</code> breaks. <p> Suppose we have a fresh Go installation and we want to write a program that imports D. We run <code>go</code> <code>get</code> <code>D</code>. Remember that we are using the original GOPATH-based <code>go</code> <code>get</code>, not Go modules. <pre>$ go get D </pre> <p> <img name="vgo-why-1" class="center pad" width=200 height=39 src="vgo-why-1.png" srcset="vgo-why-1.png 1x, vgo-why-1@1.5x.png 1.5x, vgo-why-1@2x.png 2x, vgo-why-1@3x.png 3x, vgo-why-1@4x.png 4x"> <p> That looks up and downloads the latest version of D, which right now is D 1.0. It builds. We’re happy. <p> Now suppose a few months later we need C. We run <code>go</code> <code>get</code> <code>C</code>. That looks up and downloads the latest version of C, which is C 1.8. <pre>$ go get C </pre> <p> <img name="vgo-why-2" class="center pad" width=201 height=96 src="vgo-why-2.png" srcset="vgo-why-2.png 1x, vgo-why-2@1.5x.png 1.5x, vgo-why-2@2x.png 2x, vgo-why-2@3x.png 3x, vgo-why-2@4x.png 4x"> <p> C imports D, but <code>go</code> <code>get</code> finds that it has already downloaded a copy of D, so it reuses that copy. Unfortunately, that copy is still D 1.0. The latest copy of C was written using D 1.4, which contains a feature or maybe a bug fix that C needs and which was missing from D 1.0. So C is broken, because the dependency D is too old. <p> Since the build failed, we try again, with <code>go</code> <code>get</code> <code>-u</code> <code>C</code>. <pre>$ go get -u C </pre> <p> <img name="vgo-why-3" class="center pad" width=201 height=104 src="vgo-why-3.png" srcset="vgo-why-3.png 1x, vgo-why-3@1.5x.png 1.5x, vgo-why-3@2x.png 2x, vgo-why-3@3x.png 3x, vgo-why-3@4x.png 4x"> <p> Unfortunately, an hour ago D’s author published D 1.6. Because <code>go</code> <code>get</code> <code>-u</code> uses the latest version of every dependency, including D, it turns out that C is still broken. C’s author used D 1.4, which worked fine, but D 1.6 has introduced a bug that keeps C from working properly. Before, C was broken because D was too old. Now, C is broken because D is too new. <p> Those are the two ways that <code>go</code> <code>get</code> fails when using GOPATH. Sometimes it uses dependencies that are too old. Other times it uses dependencies that are too new. What we really want in this case is the version of D that C’s author used and tested against. But GOPATH-based <code>go</code> <code>get</code> can’t do that, because it has no awareness of package versions at all. <p> Go programmers started asking for better handling of package versions as soon as we published <code>goinstall</code>, the original name for <code>go</code> <code>get</code>. Various tools were written over many years, separate from the Go distribution, to help make installing specific versions easier. But because those tools did not agree on a single approach, they didn’t work as a base for creating other version-aware tools, such as a version-aware godoc or a version-aware vulnerability checker. <p> We needed to add the concept of package versions to Go for many reasons. The most pressing reason was to help <code>go</code> <code>get</code> stop using code that’s too old or too new, but having an agreed-upon meaning of versions in the vocabulary of Go developers and tools enables the entire Go ecosystem to become version-aware. The <a href="https://blog.golang.org/module-mirror-launch">Go module mirror and checksum database</a>, which safely speed up Go package downloads, and the new <a href="https://blog.golang.org/go.dev#Explore">version-aware Go package discovery site</a> are both made possible by an ecosystem-wide understanding of what a version is. <a class=anchor href="#eng"><h2 id="eng">Versions for Software Engineering</h2></a> <p> Over the past two years, we have added support for package versions to Go itself, in the form of Go modules, built into the <code>go</code> command. Go modules introduce a new import path syntax called semantic import versioning, along with a new algorithm for selecting which versions to use, called minimal version selection. <p> You might wonder: Why not do what other languages do? Java has Maven, Node has NPM, Ruby has Bundler, Rust has Cargo. How is this not a solved problem? <p> You might also wonder: We introduced a new, experimental Go tool called Dep in early 2018 that implemented the general approach pioneered by Bundler and Cargo. Why did Go modules not reuse Dep’s design? <p> The answer is that we learned from Dep that the general Bundler/Cargo/Dep approach includes some decisions that make software engineering more complex and more challenging. Thanks to learning about the problems were in Dep’s design, the Go modules design made different decisions, to make software engineering simpler and easier instead. <p> But what is software engineering? How is software engineering different from programming? I like <a href="https://research.swtch.com/vgo-eng">the following definition</a>:<blockquote> <p> <i>Software engineering is what happens to programming <br>when you add time and other programmers.</i></blockquote> <p> Programming means getting a program working. You have a problem to solve, you write some Go code, you run it, you get your answer, you’re done. That’s programming, and that’s difficult enough by itself. <p> But what if that code has to keep working, day after day? What if five other programmers need to work on the code too? What if the code must adapt gracefully as requirements change? Then you start to think about version control systems, to track how the code changes over time and to coordinate with the other programmers. You add unit tests, to make sure bugs you fix are not reintroduced over time, not by you six months from now, and not by that new team member who’s unfamiliar with the code. You think about modularity and design patterns, to divide the program into parts that team members can work on mostly independently. You use tools to help you find bugs earlier. You look for ways to make programs as clear as possible, so that bugs are less likely. You make sure that small changes can be tested quickly, even in large programs. You’re doing all of this because your programming has turned into software engineering. <p> (This definition and explanation of software engineering is my riff on an original theme by my Google colleague Titus Winters, whose preferred phrasing is “software engineering is programming integrated over time.” It’s worth seven minutes of your time to see <a href="https://www.youtube.com/watch?v=tISy7EJQPzI&t=8m17s">his presentation of this idea at CppCon 2017</a>, from 8:17 to 15:00 in the video.) <p> Nearly all of Go’s distinctive design decisions were motivated by concerns about software engineering. For example, most people think that we format Go code with <code>gofmt</code> to make code look nicer or to end debates among team members about program layout. And to some degree we do. But the more important reason for <code>gofmt</code> is that if an algorithm defines how Go source code is formatted, then programs, like <code>goimports</code> or <code>gorename</code> or <code>go</code> <code>fix</code>, can edit the source code more easily. This helps you maintain code over time. <p> As another example, Go import paths are URLs. If code imported <code>"uuid"</code>, you’d have to ask which <code>uuid</code> package. Searching for <code>uuid</code> on <a href="https://pkg.go.dev/"><i>pkg.go.dev</i></a> turns up dozens of packages with that name. If instead the code imports <code>"github.com/google/uuid"</code>, now it’s clear which package we mean. Using URLs avoids ambiguity and also reuses an existing mechanism for giving out names, making it simpler and easier to coordinate with other programmers. Continuing the example, Go import paths are written in Go source files, not in a separate build configuration file. This makes Go source files self-contained, which makes it easier to understand, modify, and copy them. These decisions were all made toward the goal of simplifying software engineering. <a class=anchor href="#principles"><h2 id="principles">Principles</h2></a> <p> There are three broad principles behind the changes from Dep’s design to Go modules, all motivated by wanting to simplify software engineering. These are the principles of compatibility, repeatability, and cooperation. The rest of this post explains each principle, shows how it led us to make a different decision for Go modules than in Dep, and then responds, as fairly as I can, to objections against making that change. <a class=anchor href="#compatibility"><h2 id="compatibility">Principle #1: Compatibility</h2></a> <blockquote> <p> <i>The meaning of a name in a program should not change over time.</i></blockquote> <p> The first principle is compatibility. Compatibility—or, if you prefer, stability—is the idea that, in a program, the meaning of a name should not change over time. If a name meant one thing last year, it should mean the same thing this year and next year. <p> For example, programmers are sometimes confused by a detail of <code>strings.Split</code>. We all expect that splitting “<code>hello</code> <code>world</code>” produces two strings “<code>hello</code>” and “<code>world</code>.” But if the input has leading, trailing, or repeated spaces, the result contains empty strings too. <pre>Example: strings.Split(x, " ") "hello world" =&gt; {"hello", "world"} "hello world" =&gt; {"hello", "", "world"} " hello world" =&gt; {"", "hello", "world"} "hello world " =&gt; {"hello", "world", ""} </pre> <p> Suppose we decide that it would be better overall to change the behavior of <code>strings.Split</code> to omit those empty strings. Can we do that? <p> No. <p> We’ve given <code>strings.Split</code> a specific meaning. The documentation and the implementation agree on that meaning. Programs depend on that meaning. Changing the meaning would break those programs. It would break the principle of compatibility. <p> We <i>can</i> implement the new meaning; we just need to give a new name too. In fact, years ago, to solve this exact problem, we introduced <code>strings.Fields</code>, which is tailored to space-separated fields and never returns empty strings. <pre>Example: strings.Fields(x) "hello world" =&gt; {"hello", "world"} "hello world" =&gt; {"hello", "world"} " hello world" =&gt; {"hello", "world"} "hello world " =&gt; {"hello", "world"} </pre> <p> We didn’t redefine <code>strings.Split</code>, because we were concerned about compatibility. <p> Following the principle of compatibility simplifies software engineering, because it lets you ignore time when trying to understand programming. People don’t have to think, “well this package was written in 2015, back when <code>strings.Split</code> returned empty strings, but this other package was written last week, so it expects <code>strings.Split</code> to leave them out.” And not just people. Tools don’t have to worry about time either. For example, a refactoring tool can always move a <code>strings.Split</code> call from one package to another without worrying that it will change its meaning. <p> In fact, the most important feature of Go 1 was not a language change or a new library feature. It was the declaration of compatibility:<blockquote> <p> It is intended that programs written to the Go 1 specification will continue to compile and run correctly, unchanged, over the lifetime of that specification. Go programs that work today should continue to work even as future “point” releases of Go 1 arise (Go 1.1, Go 1.2, etc.). <p> — <a href="https://golang.org/doc/go1compat"><i>golang.org/doc/go1compat</i></a></blockquote> <p> We committed that we would stop changing the meaning of names in the standard library, so that programs working with Go 1.1 could be expected to continue working in Go 1.2, and so on. That ongoing commitment makes it easy for users to write code and keep it working even as they upgrade to newer Go versions to get faster implementations and new features. <p> What does compatibility have to do with versioning? It’s important to think about compatibility because the most popular approach to versioning today—<a href="https://semver.org">semantic versioning</a>—instead encourages <i>incompatibility</i>. That is, semantic versioning has the unfortunate effect of making incompatible changes seem easy. <p> Every semantic version takes the form vMAJOR.MINOR.PATCH. If two versions have the same major number, the later (if you like, greater) version is expected to be backwards compatible with the earlier (lesser) one. But if two versions have different major numbers, they have no expected compatibility relationship. <p> Semantic versioning seems to suggest, “It’s okay to make incompatible changes to your packages. Tell your users about them by incrementing the major version number. Everything will be fine.” But this is an empty promise. Incrementing the major version number isn’t enough. Everything is not fine. If <code>strings.Split</code> has one meaning today and a different meaning tomorrow, simply reading your code is now software engineering, not programming, because you need to think about time. <p> It gets worse. <p> Suppose B is written to expect <code>strings.Split</code> v1, while C is written to expect <code>strings.Split</code> v2. That’s fine if you build each by itself. <p> <img name="vgo-why-4" class="center pad" width=312 height=60 src="vgo-why-4.png" srcset="vgo-why-4.png 1x, vgo-why-4@1.5x.png 1.5x, vgo-why-4@2x.png 2x, vgo-why-4@3x.png 3x, vgo-why-4@4x.png 4x"> <p> But what happens when your package A imports both B and C? If <code>strings.Split</code> has to have just one meaning, there’s no way to build a working program. <p> <img name="vgo-why-5" class="center pad" width=282 height=94 src="vgo-why-5.png" srcset="vgo-why-5.png 1x, vgo-why-5@1.5x.png 1.5x, vgo-why-5@2x.png 2x, vgo-why-5@3x.png 3x, vgo-why-5@4x.png 4x"> <p> For the Go modules design, we realized that the principle of compatibility is absolutely essential to simplifying software engineering and must be supported, encouraged, and followed. The Go FAQ has encouraged compatibility since Go 1.2 in November 2013:<blockquote> <p> Packages intended for public use should try to maintain backwards compatibility as they evolve. The <a href="https://golang.org/doc/go1compat.html">Go 1 compatibility guidelines</a> are a good reference here: don’t remove exported names, encourage tagged composite literals, and so on. If different functionality is required, add a new name instead of changing an old one. If a complete break is required, create a new package with a new import path.</blockquote> <p> For Go modules, we gave this old advice a new name, the <i>import compatibility rule</i>:<blockquote> <p> <i>If an old package and a new package have the same import path,<br> the new package must be backwards compatible with the old package.</i></blockquote> <p> But then what do we do about semantic versioning? If we still want to use semantic versioning, as many users expect, then the import compatibility rule requires that different semantic major versions, which by definition have no compatibility relationship, must use different import paths. The way to do that in Go modules is to put the major version in the import path. We call this <i>semantic import versioning</i>. <p> <img name="impver" class="center pad" width=458 height=223 src="impver.png" srcset="impver.png 1x, impver@1.5x.png 1.5x, impver@2x.png 2x, impver@3x.png 3x, impver@4x.png 4x"> <p> In this example, <code>my/thing/v2</code> identifies semantic version 2 of a particular module. Version 1 was just <code>my/thing</code>, with no explicit version in the module path. But when you introduce major version 2 or larger, you have to add the version after the module name, to distinguish from version 1 and other major versions, so version 2 is <code>my/thing/v2</code>, version 3 is <code>my/thing/v3</code>, and so on. <p> If the <code>strings</code> package were its own module, and if for some reason we really needed to redefine <code>Split</code> instead of adding a new function <code>Fields</code>, then we could create <code>strings</code> (major version 1) and <code>strings/v2</code> (major version 2), with different <code>Split</code> functions. Then the unbuildable program from before can be built: B says <code>import</code> <code>”strings"</code> while C says <code>import</code> <code>"strings/v2"</code>. Those are different packages, so it’s okay to build both into the program. And now B and C can each have the <code>Split</code> function they expect. <p> <img name="vgo-why-6" class="center pad" width=299 height=94 src="vgo-why-6.png" srcset="vgo-why-6.png 1x, vgo-why-6@1.5x.png 1.5x, vgo-why-6@2x.png 2x, vgo-why-6@3x.png 3x, vgo-why-6@4x.png 4x"> <p> Because <code>strings</code> and <code>strings/v2</code> have different import paths, people and tools automatically understand that they name different packages, just as people already understand that <code>crypto/rand</code> and <code>math/rand</code> name different packages. No one needs to learn a new disambiguation rule. <p> Let’s return to the unbuildable program, not using semantic import versioning. If we replace <code>strings</code> in this example with an arbitrary package D, then we have a classic “diamond dependency problem.” Both B and C build fine by themselves, but with different, conflicting requirements for D. If we try to use both in a build of A, then there’s no single choice of D that works. <p> <img name="vgo-why-7" class="center pad" width=282 height=94 src="vgo-why-7.png" srcset="vgo-why-7.png 1x, vgo-why-7@1.5x.png 1.5x, vgo-why-7@2x.png 2x, vgo-why-7@3x.png 3x, vgo-why-7@4x.png 4x"> <p> Semantic import versioning cuts through diamond dependencies. There’s no such thing as conflicting requirements for D. D version 1.3 must be backwards compatible with D version 1.2, and D version 2.0 has a different import path, D/v2. <p> <img name="vgo-why-8" class="center pad" width=289 height=94 src="vgo-why-8.png" srcset="vgo-why-8.png 1x, vgo-why-8@1.5x.png 1.5x, vgo-why-8@2x.png 2x, vgo-why-8@3x.png 3x, vgo-why-8@4x.png 4x"> <p> A program using both major versions keeps them as separate as any two package with different import paths and builds fine. <a class=anchor href="#aesthetics"><h2 id="aesthetics">Objection: Aesthetics</h2></a> <p> The most common objection to semantic import versioning is that people don’t like seeing the major versions in the import paths. In short, they’re ugly. Of course, what this really means is only that people are not used to seeing the major version in import paths. <p> I can think of two examples of major aesthetic shifts in Go code that seemed ugly at the time but were adopted because they simplified software engineering and now look completely natural. <p> The first example is export syntax. Back in early 2009, Go used an <code>export</code> keyword to mark a function as exported. We knew we needed something more lightweight to mark individual struct fields, and we were casting about for ideas, considering things like “leading underscore means unexported” or “leading plus in declaration means export.” Eventually we hit on the “upper-case for export” idea. Using an upper-case letter as the export signal looked strange to us, but that was the only drawback we could find. Otherwise, the idea was sound, it satisfied our goals, and it was more appealing than the other choices we’d been considering. So we adopted it. I remember thinking that changing <code>fmt.printf</code> to <code>fmt.Printf</code> in my code was ugly, or at least jarring: to me, <code>fmt.Printf</code> didn’t look like Go, at least not the Go I had been writing. But I had no good argument against it, so I went along with (and implemented) the change. After a few weeks, I got used to it, and now it is <code>fmt.printf</code> that doesn’t look like Go to me. What’s more, I came to appreciate the precision about what is and isn’t exported when reading code. When I go back to C++ or Java code now and I see a call like <code>x.dangerous()</code> I miss being able to tell at a glance whether the <code>dangerous</code> method is a public method that anyone can call. <p> The second example is import paths, which I mentioned briefly earlier. In the early days of Go, before <code>goinstall</code> and <code>go</code> <code>get</code>, import paths were not full URLs. A developer had to manually download and install a package named <code>uuid</code> and then would write <code>import</code> <code>"uuid"</code>. Changing to URLs for import paths (<code>import</code> <code>"github.com/google/uuid"</code>) eliminated this ambiguity, and the added precision made <code>go</code> <code>get</code> possible. People did complain at first, but now the longer paths are second nature to us. We rely on and appreciate their precision, because it makes our software engineering work simpler. <p> Both these changes—upper-case for export and full URLs for import paths—were motivated by good software engineering arguments to which the only real objection was visual aesthetics. Over time we came to appreciate the benefits, and our aesthetic judgements adapted. I expect the same to happen with major versions in import paths. We’ll get used to them, and we’ll come to value the precision and simplicity they bring. <a class=anchor href="#update"><h2 id="update">Objection: Updating Import Paths</h2></a> <p> Another common objection is that upgrading from (say) v2 of a module to v3 of the same module requires changing all the import paths referring to that module, even if the client code doesn’t need any other changes. <p> It’s true that the upgrade requires rewriting import paths, but it’s also easy to write a tool to do a global search and replace. We intend to make it possible to handle such upgrades with <code>go</code> <code>fix</code>, although we haven’t implemented that yet. <p> Both the previous objection and this one implicitly suggest keeping the major version information only in a separate version metadata file. If we do that, then an import path won’t be precise enough to identify semantics, like back when <code>import "uuid"</code> might have meant any one of dozens of different packages. All programmers and tools will have to look in the metadata file to answer the question: which major version is this? Which <code>strings.Split</code> am I calling? What happens when I copy a file from one module to another and forget to check the metadata file? If instead we keep import paths semantically precise, then programmers and tools don’t need to be taught a new way to keep different major versions of a package separate. <p> Another benefit of having the major version in the import path is that when you do update from v2 to v3 of a package, you can <a href="https://talks.golang.org/2016/refactor.article">update your program gradually</a>, in stages, maybe one package at a time, and it’s always clear which code has been converted and which has not. <a class=anchor href="#multiple"><h2 id="multiple">Objection: Multiple Major Versions in a Build</h2></a> <p> Another common objection is that having D v1 and D v2 in the same build should be disallowed entirely. That way, D’s author won’t have to think about the complexities that arise from that situation. For example, maybe package D defines a command line flag or registers an HTTP handler, so that building both D v1 and D v2 into a single program would fail without explicit coordination between those versions. <p> Dep enforces exactly this restriction, and some people say it is simpler. But this is simplicity only for D’s author. It’s not simplicity for D’s users, and normally users outnumber authors. If D v1 and D v2 cannot coexist in a single build, then diamond dependencies are back. You can’t convert a large program from D v1 to D v2 gradually, the way I just explained. In internet-scale projects, this will fragment the Go package ecosystem into incompatible groups of packages: those that use D v1 and those that use D v2. For a detailed example, see my 2018 blog post, “<a href="https://research.swtch.com/vgo-import">Semantic Import Versioning</a>.” <p> Dep was forced to disallow multiple major versions in a build because the Go build system requires each import path to name a unique package (and Dep did not consider semantic import versioning). In contrast, Cargo and other systems do allow multiple major versions in a build. As I understand it, the reason these systems allow multiple versions is the same reason that Go modules does: not allowing them makes it too hard to work on large programs. <a class=anchor href="#exp"><h2 id="exp">Objection: Too Hard to Experiment</h2></a> <p> A final objection is that versions in import paths are unnecessary overhead when you’re just starting to design a package, you have no users, and you’re making frequent backwards-incompatible changes. That’s absolutely true. <p> Semantic versioning makes an exception for exactly that situation. In major version 0, there are no compatibility expectations at all, so that you can iterate quickly when you’re first starting out and not worry about compatibility. For example, v0.3.4 doesn’t need to be backwards compatible with anything else: not v0.3.3, not v0.0.1, not v1.0.0. <p> Semantic import versioning makes a similar exception: major version 0 is not mentioned in import paths. <p> In both cases, the rationale is that time has not entered the picture. You’re not doing software engineering yet. You’re just programming. Of course, this means that if you use v0 versions of other people’s packages, then you are accepting that new versions of those packages might include breaking API changes without a corresponding import path change, and you take on the responsibility to update your code when that happens. <a class=anchor href="#repeatability"><h2 id="repeatability">Principle #2: Repeatability</h2></a> <blockquote> <p> <i>The result of a build of a given version of a package should not change over time.</i></blockquote> <p> The second principle is repeatability for program builds. By repeatability I mean that when you are building a specific version of a package, the build should decide which dependency versions to use in a way that’s repeatable, that doesn’t change over time. My build today should match your build of my code tomorrow and any other programmer’s build next year. Most package management systems don’t make that guarantee. <p> We saw earlier how GOPATH-based <code>go</code> <code>get</code> doesn’t provide repeatability. First <code>go</code> <code>get</code> used a version of D that was too old: <p> <img name="vgo-why-2" class="center pad" width=201 height=96 src="vgo-why-2.png" srcset="vgo-why-2.png 1x, vgo-why-2@1.5x.png 1.5x, vgo-why-2@2x.png 2x, vgo-why-2@3x.png 3x, vgo-why-2@4x.png 4x"> <p> Then <code>go</code> <code>get</code> <code>-u</code> used a version of D that was too new: <p> <img name="vgo-why-3" class="center pad" width=201 height=104 src="vgo-why-3.png" srcset="vgo-why-3.png 1x, vgo-why-3@1.5x.png 1.5x, vgo-why-3@2x.png 2x, vgo-why-3@3x.png 3x, vgo-why-3@4x.png 4x"> <p> You might think, “of course <code>go</code> <code>get</code> makes this mistake: it doesn’t know anything about versions at all.” But most other systems make the same mistake. I’m going to use Dep as my example here, but at least Bundler and Cargo work the same way. <p> Dep asks every package to include a metadata file called a manifest, which lists requirements for dependency versions. When Dep downloads C, it reads C’s manifest and learns that C needs D 1.4 or later. Then Dep downloads the newest version of D satisfying that constraint. Yesterday, that meant D 1.5: <p> <img name="vgo-why-9" class="center pad" width=201 height=60 src="vgo-why-9.png" srcset="vgo-why-9.png 1x, vgo-why-9@1.5x.png 1.5x, vgo-why-9@2x.png 2x, vgo-why-9@3x.png 3x, vgo-why-9@4x.png 4x"> <p> Today, that means D 1.6: <p> <img name="vgo-why-10" class="center pad" width=201 height=60 src="vgo-why-10.png" srcset="vgo-why-10.png 1x, vgo-why-10@1.5x.png 1.5x, vgo-why-10@2x.png 2x, vgo-why-10@3x.png 3x, vgo-why-10@4x.png 4x"> <p> The decision is time-dependent. It changes from day to day. The build is not repeatable. <p> The developers of Dep (and Bundler and Cargo and ...) understood the importance of repeatability, so they introduced a second metadata file called a lock file. If C is a whole program, what Go calls <code>package</code> <code>main</code>, then the lock file lists the exact version to use for every dependency of C, and Dep lets the lock file override the decisions it would normally make. Locking in those decisions ensures that they stop changing over time and makes the build repeatable. <p> But lock files only apply to whole programs, to <code>package</code> <code>main</code>. What if C is a library, being built as part of a larger program? Then a lock file meant for building only C might not satisfy the additional constraints in the larger program. So Dep and the others must ignore lock files associated with libraries and fall back to the default time-based decisions. When you add C 1.8 to a larger build, the exact packages you get depends on what day it is. <p> In summary, Dep starts with a time-based decision about which version of D to use. Then it adds a lock file, to override that time-based decision, for repeatability, but that lock file can only be applied to whole programs. <p> In Go modules, the <code>go</code> command instead makes its decision about which version of D to use in a way that does not change over time. Then builds are repeatable all the time, without the added complexity of a lock file override, and this repeatability applies to libraries, not just whole programs. <p> The algorithm used for Go modules is very simple, despite the imposing name “minimal version selection.” It works like this. Each package specifies a minimum version of each dependency. For example, suppose B 1.3 requests D 1.3 or later, and C 1.8 requests D 1.4 or later. In Go modules, the <code>go</code> command prefers to use those exact versions, not the latest versions. If we’re building B by itself, we’ll use D 1.3. If we’re building C by itself, we’ll use D 1.4. The builds of these libraries are repeatable. <p> <img name="vgo-why-12" class="center pad" width=270 height=178 src="vgo-why-12.png" srcset="vgo-why-12.png 1x, vgo-why-12@1.5x.png 1.5x, vgo-why-12@2x.png 2x, vgo-why-12@3x.png 3x, vgo-why-12@4x.png 4x"> <p> Also shown in the figure, if different parts of a build request different minimum versions, the <code>go</code> command uses the latest requested version. The build of A sees requests for D 1.3 and D 1.4, and 1.4 is later than 1.3, so the build chooses D 1.4. That decision does not depend on whether D 1.5 and D 1.6 exist, so it does not change over time. <p> I call this minimal version selection for two reasons. First, for each package it selects the minimum version satisfying the requests (equivalently, the maximum of the requests). And second, it seems to be just about the simplest approach that could possibly work. <p> Minimal version selection provides repeatability, for whole programs and for libraries, always, without any lock files. It removes time from consideration. Every chosen version is always one of the versions mentioned explicitly by some package already chosen for the build. <a class=anchor href="#latest-feature"><h2 id="latest-feature">Objection: Using the Latest Version is a Feature</h2></a> <p> The usual first objection to prioritizing repeatability is to claim that preferring the latest version of a dependency is a feature, not a bug. The claim is that programmers either don’t want to or are too lazy to update their dependencies regularly, so tools like Dep should use the latest dependencies automatically. The argument is that the benefits of having the latest versions outweigh the loss of repeatability. <p> But this argument doesn’t hold up to scrutiny. Tools like Dep provide lock files, which then require programmers to update dependencies themselves, exactly because repeatable builds are more important than using the latest version. When you deploy a 1-line bug fix, you want to be sure that your bug fix is the only change, that you’re not also picking up different, newer versions of your dependencies. <p> You want to delay upgrades until you ask for them, so that you can be ready to run all your unit tests, all your integration tests, and maybe even production canaries, before you start using those upgraded dependencies in production. Everyone agrees about this. Lock files exist because everyone agrees about this: repeatability is more important than automatic upgrades. <a class=anchor href="#latest-library"><h2 id="latest-library">Objection: Using the Latest Version is a Feature When Building a Library</h2></a> <p> The more nuanced argument you could make against minimal version selection would be to admit that repeatability matters for whole program builds, but then argue that, for libraries, the balance is different, and having the latest dependencies is more important than a repeatable build. <p> I disagree. As programming increasingly means connecting large libraries together, and those large libraries are increasingly organized as collections of smaller libraries, all the reasons to prefer repeatability of whole-program builds become just as important for library builds. <p> The extreme limit of this trend is the recent move in cloud computing to “serverless” hosting, like Amazon Lambda, Google Cloud Functions, or Microsoft Azure Functions. The code we upload to those systems is a library, not a whole program. We certainly want the production builds on those servers to use the same versions of dependencies as on our development machines. <p> Of course, no matter what, it’s important to make it easy for programmers to update their dependencies regularly. We also need tools to report which versions of a package are in a given build or a given binary, including reporting when updates are available and when there are known security problems in the versions being used. <a class=anchor href="#cooperation"><h2 id="cooperation">Principle #3: Cooperation</h2></a> <blockquote> <p> <i>To maintain the Go package ecosystem, we must all work together.</i> <br> <i>Tools cannot work around a lack of cooperation.</i></blockquote> <p> The third principle is cooperation. We often talk about “the Go community” and “the Go open source ecosystem.” The words community and ecosystem emphasize that all our work is interconnected, that we’re building on—depending on—each other’s contributions. The goal is one unified system that works as a coherent whole. The opposite, what we want to avoid, is an ecosystem that is fragmented, split into groups of packages that can’t work with each other. <p> The principle of cooperation recognizes that the only way to keep the ecosystem healthy and thriving is for us all to work together. If we don’t, then no matter how technically sophisticated our tools are, the Go open source ecosystem is guaranteed to fragment and eventually fail. By implication, then, it’s okay if fixing incompatibilities requires cooperation. We can’t avoid cooperation anyway. <p> For example, once again we have C 1.8, which requires D 1.4 or later. Thanks to repeatability, a build of C 1.8 by itself will use D 1.4. If we build C as part of a larger build that needs D 1.5, that’s okay too. <p> Then D 1.6 is released, and some larger build, maybe continuous integration testing, discovers that C 1.8 does not work with D 1.6. <p> <img name="vgo-why-13" class="center pad" width=306 height=125 src="vgo-why-13.png" srcset="vgo-why-13.png 1x, vgo-why-13@1.5x.png 1.5x, vgo-why-13@2x.png 2x, vgo-why-13@3x.png 3x, vgo-why-13@4x.png 4x"> <p> No matter what, the solution is for C’s author and D’s author to cooperate and release a fix. The exact fix depends on what exactly went wrong. <p> Maybe C depends on buggy behavior fixed in D 1.6, or maybe C depends on unspecified behavior changed in D 1.6. Then the solution is for C’s author to release a new C version 1.9, cooperating with the evolution of D. <p> <img name="vgo-why-15" class="center pad" width=297 height=130 src="vgo-why-15.png" srcset="vgo-why-15.png 1x, vgo-why-15@1.5x.png 1.5x, vgo-why-15@2x.png 2x, vgo-why-15@3x.png 3x, vgo-why-15@4x.png 4x"> <p> Or maybe D 1.6 simply has a bug. Then the solution is for D’s author to release a fixed D 1.7, cooperating by respecting the principle of compatibility, at which point C’s author can release C version 1.9 that specifies that it requires D 1.7. <p> <img name="vgo-why-14" class="center pad" width=297 height=130 src="vgo-why-14.png" srcset="vgo-why-14.png 1x, vgo-why-14@1.5x.png 1.5x, vgo-why-14@2x.png 2x, vgo-why-14@3x.png 3x, vgo-why-14@4x.png 4x"> <p> Take a minute to look at what just happened. The latest C and the latest D didn’t work together. That introduced a small fracture in the Go package ecosystem. C’s author or D’s author worked to fix the bug, cooperating with each other and the rest of the ecosystem to repair the fracture. This cooperation is essential to keeping the ecosystem healthy. There is no adequate technical substitute. <p> The repeatable builds in Go modules mean that a buggy D 1.6 won’t be picked up until users explicitly ask to upgrade. That creates time for C’s author and D’s author to cooperate on a real solution. The Go modules system makes no other attempt to work around these temporary incompatibilities. <a class=anchor href="#sat"><h2 id="sat">Objection: Use Declared Incompatibilities and SAT Solvers</h2></a> <p> The most common objection to this approach of depending on cooperation is that it is unreasonable to expect developers to cooperate. Developers need some way to fix problems alone. the argument goes: they can only truly depend on themselves, not others. The solution offered by package managers like Bundler, Cargo, and Dep is to allow developers to declare incompatibilities between their packages and others and then employ a <a href="https://research.swtch.com/version-sat">SAT solver</a> to find a package combination not ruled out by the constraints. <p> This argument breaks down for a few reasons. <p> First, the <a href="https://research.swtch.com/vgo-mvs">algorithm used by Go modules</a> to select versions already gives the developer of a particular module complete control over which versions are selected for that module, more control in fact than SAT constraints. The developer can force the use of any specific version of any dependency, saying “use this exact version no matter what anyone else says.” But that power is limited to the build of that specific module, to avoid giving other developers the same control over your builds. <p> Second, repeatability of library builds in Go modules means that the release of a new, incompatible version of a dependency has no immediate effect on builds, as we saw in the previous section. The breakage only surfaces when someone takes some step to add that version to their own build, at which point they can step back again. <p> Third, if version selection is phrased as a problem for a SAT solver, there are often many possible satisfying selections: the SAT solver must choose between them, and there is no clear criteria for doing so. As we saw earlier, SAT-based package managers choose between multiple valid possible selections by preferring newer versions. In the case where using the newest version of everything satisfies the constraints, that’s the clear “most preferred” answer. But what if the two possible selections are “latest of B, older C” and “older B, latest of C”? Which should be preferred? How can the developer predict the outcome? The resulting system is difficult to understand. <p> Fourth, the output of a SAT solver is only as good as its inputs: if any incompatibilities have been omitted, the SAT solver may well arrive at a combination that is still broken, just not declared as such. Incompatibility information is likely to be particularly incomplete for combinations involving dependencies with a significant age difference that may well never have been put together before. Indeed, an analysis of Rust’s Cargo ecosystem in 2018 found that Cargo’s preference for the latest version was <a href="https://illicitonion.blogspot.com/2018/06/rust-minimum-versions-semver-is-lie.html">masking many missing constraints</a> in Cargo manifests. If the latest version does not work, exploring old versions seems as likely to produce a combination that is “not yet known to be broken” as it is to produce a working one. <p> Overall, once you step off the happy path of selecting the newest version of every dependency, SAT solver-based package managers are not more likely to choose a working configuration than Go modules is. If anything, SAT solvers may well be less likely to find a working configuration. <a class=anchor href="#sat-example"><h2 id="sat-example">Example: Go Modules versus SAT Solving</h2></a> <p> The counter-arguments given in the previous section are a bit abstract. Let’s make them concrete by continuing the example we’ve been working with and looking at what happens when using a SAT solver, like in Dep. I’m using Dep for concreteness, because it is the immediate predecessor of Go modules, but the behaviors here are not specific to Dep and I don’t mean to single it out. For the purposes of this example, Dep works the same way as many other package managers, and they all share the problems detailed here. <p> To set the stage, remember that C 1.8 works fine with D 1.4 and D 1.5, but the combination of C 1.8 and D 1.6 is broken. <p> <img name="vgo-why-13" class="center pad" width=306 height=125 src="vgo-why-13.png" srcset="vgo-why-13.png 1x, vgo-why-13@1.5x.png 1.5x, vgo-why-13@2x.png 2x, vgo-why-13@3x.png 3x, vgo-why-13@4x.png 4x"> <p> That gets noticed, perhaps by continuous integration testing, and the question is what happens next. <p> When C’s author finds out that C 1.8 doesn’t work with D 1.6, Dep allows and encourages issuing a new version, C 1.9. C 1.9 documents that it needs D later than 1.4 but before 1.6. The idea is that documenting the incompatibility helps Dep avoid it in future builds. <p> <img name="vgo-why-16" class="center pad" width=323 height=130 src="vgo-why-16.png" srcset="vgo-why-16.png 1x, vgo-why-16@1.5x.png 1.5x, vgo-why-16@2x.png 2x, vgo-why-16@3x.png 3x, vgo-why-16@4x.png 4x"> <p> In Dep, avoiding the incompatibility is important—even urgent!—because the lack of repeatability in library builds means that as soon as D 1.6 is released, all future fresh builds of C will use D 1.6 and break. This is a build emergency: all of C’s new users are broken. If D’s author is unavailable, or C’s author doesn’t have time to fix the actual bug, the argument is that C’s author must be able to take some step to protect users from the breakage. That step is to release C 1.9, documenting the incompatibility with D 1.6. That fixes new builds of C by preventing the use of D 1.6. <p> This emergency doesn’t happen when using Go modules, because of minimal version selection and repeatable builds. Using Go modules, the release of D 1.6 does not affect C’s users, because nothing is explicitly requesting D 1.6 yet. Users keep using the older versions of D they already use. There’s no need to document the incompatibility, because nothing is breaking. There’s time to cooperate on a real fix. <p> Looking at Dep’s approach of documenting incompatibility again, releasing C 1.9 is not a great solution. For one thing, the premise was that D’s author created a build emergency by releasing D 1.6 and then was unavailable to release a fix, so it was important to give C’s author a way to fix things, by releasing C 1.9. But if D’s author might be unavailable, what happens if C’s author is unavailable too? Then the emergency caused by automatic upgrades continues and all of C’s new users stay broken. Repeatable builds in Go modules avoid the emergency entirely. <p> Also, suppose that the bug is in D, and D’s author issues a fixed D 1.7. The workaround C 1.9 requires D before 1.6, so it won’t use the fixed D 1.7. C’s author has to issue C 1.10 to allow use of D 1.7. <p> <img name="vgo-why-17" class="center pad" width=317 height=130 src="vgo-why-17.png" srcset="vgo-why-17.png 1x, vgo-why-17@1.5x.png 1.5x, vgo-why-17@2x.png 2x, vgo-why-17@3x.png 3x, vgo-why-17@4x.png 4x"> <p> In contrast, if we’re using Go modules, C’s author doesn’t have to issue C 1.9 and then also doesn’t have to undo it by issuing C 1.10. <p> In this simple example, Go modules end up working more smoothly for users than Dep. They avoid the build breakage automatically, creating time for cooperation on the real fix. Ideally, C or D gets fixed before any of C’s users even notice. <p> But what about more complex examples? Maybe Dep’s approach of documenting incompatibilities is better in more complicated situations, or maybe it keeps things working even when the real fix takes a long time to arrive. <p> Let’s take a look. To do that, let’s rewind the clock a little, to before the buggy D 1.6 is released, and compare the decisions made by Dep and Go modules. This figure shows the documented requirements for all the relevant package versions, along with the way both Dep and Go modules would build the latest C and the latest A. <p> <img name="vgo-why-19" class="center pad" width=383 height=270 src="vgo-why-19.png" srcset="vgo-why-19.png 1x, vgo-why-19@1.5x.png 1.5x, vgo-why-19@2x.png 2x, vgo-why-19@3x.png 3x, vgo-why-19@4x.png 4x"> <p> Dep is using D 1.5 while the Go module system is using D 1.4, but both tools have found working builds. Everyone is happy. <p> But now suppose the buggy D 1.6 is released. <p> <img name="vgo-why-20" class="center pad" width=383 height=270 src="vgo-why-20.png" srcset="vgo-why-20.png 1x, vgo-why-20@1.5x.png 1.5x, vgo-why-20@2x.png 2x, vgo-why-20@3x.png 3x, vgo-why-20@4x.png 4x"> <p> Dep builds pick up D 1.6 automatically and break. Go modules builds keep using D 1.4 and keep working. This is the simple situation we were just looking at. <p> Before we move on, though, let’s fix the Dep builds. We release C 1.9, which documents the incompatibility with D 1.6: <p> <img name="vgo-why-21" class="center pad" width=383 height=270 src="vgo-why-21.png" srcset="vgo-why-21.png 1x, vgo-why-21@1.5x.png 1.5x, vgo-why-21@2x.png 2x, vgo-why-21@3x.png 3x, vgo-why-21@4x.png 4x"> <p> Now Dep builds pick up C 1.9 automatically, and builds start working again. Go modules can’t document incompatibility in this way, but Go modules builds also aren’t broken, so no fix is needed. <p> Now let’s create a build complex enough to break Go modules. We can do this in two steps. First, we will release a new B that requires D 1.6. Second, we will release a new A that requires the new B, at which point A’s build will use C with D 1.6 and break. <p> We start by releasing the new B 1.4 that requires D 1.6. <p> <img name="vgo-why-22" class="center pad" width=473 height=270 src="vgo-why-22.png" srcset="vgo-why-22.png 1x, vgo-why-22@1.5x.png 1.5x, vgo-why-22@2x.png 2x, vgo-why-22@3x.png 3x, vgo-why-22@4x.png 4x"> <p> Go modules builds are unaffected so far, thanks to repeatability. But look! Dep builds of A pick up B 1.4 automatically and now they are broken again. What happened? <p> Dep prefers to build A using the latest B and the latest C, but that’s not possible: the latest B wants D 1.6 and the latest C wants D before 1.6. But does Dep give up? No. It looks for alternate versions of B and C that do agree on an acceptable D. <p> In this case, Dep decided to keep the latest B, which means using D 1.6, which means <i>not</i> using C 1.9. Since Dep can’t use the latest C, it tries older versions of C. C 1.8 looks good: it says it needs D 1.4 or later, and that allows D 1.6. So Dep uses C 1.8, and it breaks. <p> <i>We</i> know that C 1.8 and D 1.6 are incompatible, but Dep does not. Dep can’t know it, because C 1.8 was released before D 1.6: C’s author couldn’t have predicted that D 1.6 would be a problem. And all package management systems agree that package contents must be immutable once they are published, which means there’s no way for C’s author to retroactively document that C 1.8 doesn’t work with D 1.6. (And if there were some way to change C 1.8’s requirements retroactively, that would violate repeatability.) Releasing C 1.9 with the updated requirement was the fix. <p> Because Dep prefers to use the latest C, most of the time it will use C 1.9 and know to avoid D 1.6. But if Dep can’t use the latest of everything, it will start trying earlier versions of some things, including maybe C 1.8. And using C 1.8 makes it look like D 1.6 is okay—even though we know better—and the build breaks. <p> Or it might not break. Strictly speaking, Dep didn’t have to make that decision. When Dep realized that it couldn’t use both the latest B and the latest C, it had many options for how it might proceed. We assumed Dep kept the latest B. But if instead Dep kept the latest C, then it would need to use an older D and then an older B, producing a working build, as shown in the third column of the diagram. <p> So maybe Dep’s builds are broken or maybe not, depending on the arbitrary decisions it makes in its <a href="https://research.swtch.com/version-sat">SAT-solver-based version selection</a>. (Last I checked, given a choice between a newer version of one package versus another, Dep prioritizes the one with the alphabetically earlier import path, at least in small test cases.) <p> This example demonstrates another way that Dep and systems like it (nearly all package managers besides Go modules) can produce surprising results: when the one most preferred answer (use the latest of everything) does not apply, there are often many choices with no clear preferences between them. The exact answer depends on the details of the SAT solving algorithm, heuristics, and often the input order of the packages are presented to the solver. This underspecification and non-determinism in their solvers is another reason these systems need lock files. <p> In any event, for the sake of Dep users, let’s assume Dep lucked into the choice that keeps builds working. After all, we’re still trying to break the Go modules users’ builds. <p> To break Go modules builds, let’s release a new version of A, version 1.21, which requires the latest B, which in turn requires the latest D. Now, when the <code>go</code> command builds the latest A, it is forced to use the latest B and the latest D. In Go modules, there is no C 1.9, so the <code>go</code> command uses C 1.8, and the combination of C 1.8 and D 1.6 does not work. Finally, we have broken the Go modules builds! <p> <img name="vgo-why-23" class="center pad" width=339 height=270 src="vgo-why-23.png" srcset="vgo-why-23.png 1x, vgo-why-23@1.5x.png 1.5x, vgo-why-23@2x.png 2x, vgo-why-23@3x.png 3x, vgo-why-23@4x.png 4x"> <p> But look! The Dep builds are using C 1.8 and D 1.6 too, so they’re also broken. Before, Dep had to make a choice between the latest B and the latest C. If it chose the latest B, the build broke. If it chose the latest C, the build worked. The new requirement in A is forcing Dep to choose the latest B and the latest D, taking away Dep’s choice of latest C. So Dep uses the older C 1.8, and the build breaks just like before. <p> What should we conclude from all this? First of all, documenting an incompatibility for Dep does not guarantee to avoid that incompatibility. Second, a repeatable build like in Go modules also does not guarantee to avoid the incompatibility. Both tools can end up building the incompatible pair of packages. But as we saw, it takes multiple intentional steps to lead Go modules to a broken build, steps that lead Dep to the same broken build. And along the way the Dep-based build broke two other times when the Go modules build did not. <p> I’ve been using Dep in these examples because it is the immediate predecessor of Go modules, but I don’t mean to single out Dep. In this respect, it works the same way as nearly every other package manager in every other language. They all have this problem. They’re not even really broken or misbehaving so much as unfortunately designed. They are designed to try to work around a lack of cooperation among the various package maintainers, and <i>tools cannot work around a lack of cooperation</i>. <p> The only real solution for the C versus D incompatibility is to release a new, fixed version of either C or D. Trying to avoid the incompatibility is useful only because it creates more time for C’s author and D’s author to cooperate on a fix. Compared to the Dep approach of preferring latest versions and documenting incompatibilities, the Go modules approach of repeatable builds with minimal version selection and no documented incompatibilities creates time for cooperation automatically, with no build emergencies, no declared incompatibilities, and no explicit work by users. <p> Then we can rely on cooperation for the real fix. <a class=anchor href="#conclusion"><h2 id="conclusion">Conclusion</h2></a> <p> These are the three principles of versioning in Go, the reasons that the design of Go modules deviates from the design of Dep, Cargo, Bundler, and others. <ul> <li> <i>Compatibility.</i> The meaning of a name in a program should not change over time. <li> <i>Repeatability.</i> The result of a build of a given version of a package should not change over time. <li> <i>Cooperation.</i> To maintain the Go package ecosystem, we must all work together. Tools cannot work around a lack of cooperation.</ul> <p> These principles are motivated by concerns about software engineering, which is what happens to programming when you add time and other programmers. Compatibility eliminates the effects of time on the meaning of a program. Repeatability eliminates the effects of time on the result of a build. Cooperation is an explicit recognition that, no matter how advanced our tools are, we do have to work with the other programmers. We can’t work around them. <p> The three principles also reinforce each other, in a virtuous cycle. <p> Compatibility enables a new version selection algorithm, which provides repeatability. Repeatability makes sure that buggy, new releases are ignored until explicitly requested, which creates more time to cooperate on fixes. That cooperation in turn reestablishes compatibility. And the cycle goes around. <p> As of Go 1.13, Go modules are ready for production use, and many companies, including Google, have adopted them. The Go 1.14 and Go 1.15 releases will bring additional ergonomic improvements, toward eventually deprecating and removing support for GOPATH. For more about adopting modules, see the blog post series on the Go blog, starting with “<a href="https://blog.golang.org/using-go-modules">Using Go Modules</a>.” Go Proposal Process: Representation tag:research.swtch.com,2012:research.swtch.com/proposals-representation 2019-10-03T11:45:00-04:00 2019-10-03T11:47:00-04:00 How do we increase user representation in the proposal process, and what does that mean? (Go Proposals, Part 6) <p> [<i>I’ve been thinking a lot recently about the <a href="https://golang.org/s/proposal">Go proposal process</a>, which is the way we propose, discuss, and decide changes to Go itself. Like <a href="https://blog.golang.org/experiment">nearly everything about Go</a>, the proposal process is an experiment, so it makes sense to reflect on what we’ve learned and try to improve it. This post is the sixth in <a href="https://research.swtch.com/proposals">a series of posts</a> about what works well and, more importantly, what we might want to change.</i>] <a class=anchor href="#who"><h2 id="who">Who is Represented?</h2></a> <p> At the contributor summit, we considered the question of who is well represented or over-represented in the Go development process and who is under-represented. <p> The question of who is well represented matters because diverse representation produces diverse viewpoints that can help us reach better overall decisions in the process. We cannot possibly hear from every single person with relevant input; instead we can try to hear from enough people that we still gather all the important viewpoints and observations. <p> On the well represented or possibly over-represented list, we have members of the Go team; GitHub users who have time to keep up with discussions; English-speaking users; people who keep up with tech social media on sites like Hacker News and Twitter; and people who attend and give talks at Go conferences. On the under-represented list, we have non-GitHub users or users who can’t keep up with GitHub discussions; non-English-speaking users; “heads down” users, meaning anyone who spends their time writing code to the exclusion of engaging on social media or attending Go conferences; business users in general; users with non-computer science backgrounds; users with non-programming language backgrounds; and non-users, people not using Go at all. As we consider ways to make the proposal process more accessible to more users and potential users, it is worth keeping these lists in mind to check whether new groups are being reached. <p> The <a href="https://golang.org/s/proposal-minutes">proposal minutes</a> help reach users who are on GitHub but can’t keep up with all the discussions, by providing a single issue to star and get roughly weekly updates about which proposals are under active discussion and which are close to being accepted or declined. The minutes are an improvement for the “can’t keep up with GitHub” category, but not for the other categories. <p> Reaching non-English-speaking users is one of the most difficult challenges. We on the Go team have attended Go conferences around the world to meet users in many countries; at many conferences there is simultaneous translation for the talks, which is wonderful. But there isn’t simultaneous translation for our proposal discussions. I don’t know whether the most significant proposals, like the latest version of the generics proposal, have been translated by users into their native languages, or if non-English-speakers muddle through with automatic translation, or something else. Twice in the past we’ve had questions about proposals that primarily affected Chinese users—specifically, whether to change case-based export for uncased languages and whether to build a separate Go distribution with different Go proxy defaults for China. In both these cases, we asked Asta Xie, the organizer of Gophercon China, to run a quick poll of users in a Chinese social media group of Go users. That was very helpful, but that doesn’t scale to all proposals. <p> Reaching “heads down” programmers and business users, those not attending Go conferences or engaging in Go-related social media, probably means branching out to more non-Go-specific places to publicize the most important Go changes, and possibly Go itself. <p> The final group is users with non-computer science or non-programming language backgrounds. Go proposals are usually written assuming significant familiarity with Go and often also familiarity with computer science or programming language concepts. In general, that’s more efficient than the alternative. But especially for large changes it would help to have alternate forms that are accessible to a larger audience. Ideas include longer tutorials or walkthroughs, short video demos, and recorded video-based Q&amp;A sessions. <a class=anchor href="#what"><h2 id="what">What Does Representation Mean?</h2></a> <p> Part of Go’s appeal for me as a user, and I think for many Go users, is the fact that it feels like a coherent system in which the pieces fit together and complement each other well. In my 2015 talk and blog post, “<a href="https://blog.golang.org/open-source">Go, Open Source Community</a>,” I said that “one of the most important things we the original authors of Go can offer is consistency of vision, to help keep Go Go.” I still believe that consistency of vision, to keep Go itself consistent, simple, and understandable, remains critically important to Go’s success. <p> In an <a href="https://www.itworld.com/article/2826125/the-future-according-to-dennis-ritchie--a-2000-interview-.html?page=2">interview with IT World in 2000</a>, Dennis Ritchie (the creator of C) was asked about control over C’s development. He said:<blockquote> <p> On the other hand, the “open evolution” idea has its own drawbacks, whether in official standards bodies or more informally, say over the Web or mailing lists. When I read commentary about suggestions for where C should go, I often think back and give thanks that it wasn’t developed under the advice of a worldwide crowd. C is peculiar in a lot of ways, but it, like many other successful things, has a certain unity of approach that stems from development in a small group.</blockquote> <p> I see much of that same unity of approach in Go’s design by a small group. The flip side is that if you have a small number of people making decisions alone, they can’t know enough about the million or more Go developers and their uses to anticipate all the important use cases and needs. That is, while a small number of ultimate designers provides consistency of vision, it can just as well result in a failure of vision, in which a design fundamentally fails to plan for or adapt to a requirement that becomes important later but was not seen or well understood at the time. <p> A decision can be <i>both</i> elegant for its time and short-sighted for the future. For example, see the recent crtique of Unix’s <i>fork</i> system call, “<a href="https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf">A <code>fork()</code> in the road</a>,” by Andrew Baumann <i>et</i> <i>al</i>., from HotOS ’19. In fact, nearly all designs end up being short-sighted given a long enough time scale. We’d be fortunate for all our designs to last the 50 years that <i>fork</i> has. The most important part is to guard against short-term failures of vision, not very long-term ones. <p> To me, the most important place for broad representation and inclusion is in the proposals and discussions, because, as I said above, diverse representation produces diverse viewpoints that can help us reach better overall decisions in the process. It can help us avoid at least the short-term and hopefully medium-term failures of vision. At the same time, I hope we can maintain a long-term consistency of vision in the design of Go, by the continued active involvement of the original designers. It seems to me that having both the original designers and a diverse set of other voices in our proposal discussions and <a href="proposals-clarity#decisions">consistently working toward consensus decisions</a> will lead to the best outcomes and balances the desire for a consistency of vision against the need to avoid failures of vision. <a class=anchor href="#next"><h2 id="next">Next</h2></a> <p> Again, this is the sixth post <a href="proposals">in a series of posts</a> thinking and brainstorming about the Go proposal process. Everything about these posts is very rough. The point of posting this series—thinking out loud instead of thinking quietly—is so that anyone who is interested can join the thinking. <p> I encourage feedback, whether in the form of comments on these posts, comments on the newly filed issues, mail to rsc@golang.org, or your own blog posts (please leave links in the comments). Thanks for taking the time to read these and think with me. <p> The next post will be about how best to coordinate efforts across the Go community and ecosystem. Our Software Dependency Problem tag:research.swtch.com,2012:research.swtch.com/deps 2019-01-23T11:00:00-05:00 2019-01-23T11:02:00-05:00 Download and run code from strangers on the internet. What could go wrong? <p> For decades, discussion of software reuse was far more common than actual software reuse. Today, the situation is reversed: developers reuse software written by others every day, in the form of software dependencies, and the situation goes mostly unexamined. <p> My own background includes a decade of working with Google’s internal source code system, which treats software dependencies as a first-class concept,<a class=footnote id=body1 href="#note1"><sup>1</sup></a> and also developing support for dependencies in the Go programming language.<a class=footnote id=body2 href="#note2"><sup>2</sup></a> <p> Software dependencies carry with them serious risks that are too often overlooked. The shift to easy, fine-grained software reuse has happened so quickly that we do not yet understand the best practices for choosing and using dependencies effectively, or even for deciding when they are appropriate and when not. My purpose in writing this article is to raise awareness of the risks and encourage more investigation of solutions. <a class=anchor href="#what_is_a_dependency"><h2 id="what_is_a_dependency">What is a dependency?</h2></a> <p> In today’s software development world, a <i>dependency</i> is additional code that you want to call from your program. Adding a dependency avoids repeating work already done: designing, writing, testing, debugging, and maintaining a specific unit of code. In this article we’ll call that unit of code a <i>package</i>; some systems use terms like library or module instead of package. <p> Taking on externally-written dependencies is an old practice: most programmers have at one point in their careers had to go through the steps of manually downloading and installing a required library, like C’s PCRE or zlib, or C++’s Boost or Qt, or Java’s JodaTime or JUnit. These packages contain high-quality, debugged code that required significant expertise to develop. For a program that needs the functionality provided by one of these packages, the tedious work of manually downloading, installing, and updating the package is easier than the work of redeveloping that functionality from scratch. But the high fixed costs of reuse mean that manually-reused packages tend to be big: a tiny package would be easier to reimplement. <p> A <i>dependency manager</i> (sometimes called a package manager) automates the downloading and installation of dependency packages. As dependency managers make individual packages easier to download and install, the lower fixed costs make smaller packages economical to publish and reuse. <p> For example, the Node.js dependency manager NPM provides access to over 750,000 packages. One of them, <code>escape-string-regexp</code>, provides a single function that escapes regular expression operators in its input. The entire implementation is: <pre>var matchOperatorsRe = /[|\\{}()[\]^$+*?.]/g; module.exports = function (str) { if (typeof str !== 'string') { throw new TypeError('Expected a string'); } return str.replace(matchOperatorsRe, '\\$&amp;'); }; </pre> <p> Before dependency managers, publishing an eight-line code library would have been unthinkable: too much overhead for too little benefit. But NPM has driven the overhead approximately to zero, with the result that nearly-trivial functionality can be packaged and reused. In late January 2019, the <code>escape-string-regexp</code> package is explicitly depended upon by almost a thousand other NPM packages, not to mention all the packages developers write for their own use and don’t share. <p> Dependency managers now exist for essentially every programming language. Maven Central (Java), Nuget (.NET), Packagist (PHP), PyPI (Python), and RubyGems (Ruby) each host over 100,000 packages. The arrival of this kind of fine-grained, widespread software reuse is one of the most consequential shifts in software development over the past two decades. And if we’re not more careful, it will lead to serious problems. <a class=anchor href="#what_could_go_wrong"><h2 id="what_could_go_wrong">What could go wrong?</h2></a> <p> A package, for this discussion, is code you download from the internet. Adding a package as a dependency outsources the work of developing that code—designing, writing, testing, debugging, and maintaining—to someone else on the internet, someone you often don’t know. By using that code, you are exposing your own program to all the failures and flaws in the dependency. Your program’s execution now literally <i>depends</i> on code downloaded from this stranger on the internet. Presented this way, it sounds incredibly unsafe. Why would anyone do this? <p> We do this because it’s easy, because it seems to work, because everyone else is doing it too, and, most importantly, because it seems like a natural continuation of age-old established practice. But there are important differences we’re ignoring. <p> Decades ago, most developers already trusted others to write software they depended on, such as operating systems and compilers. That software was bought from known sources, often with some kind of support agreement. There was still a potential for bugs or outright mischief,<a class=footnote id=body3 href="#note3"><sup>3</sup></a> but at least we knew who we were dealing with and usually had commercial or legal recourses available. <p> The phenomenon of open-source software, distributed at no cost over the internet, has displaced many of those earlier software purchases. When reuse was difficult, there were fewer projects publishing reusable code packages. Even though their licenses typically disclaimed, among other things, any “implied warranties of merchantability and fitness for a particular purpose,” the projects built up well-known reputations that often factored heavily into people’s decisions about which to use. The commercial and legal support for trusting our software sources was replaced by reputational support. Many common early packages still enjoy good reputations: consider BLAS (published 1979), Netlib (1987), libjpeg (1991), LAPACK (1992), HP STL (1994), and zlib (1995). <p> Dependency managers have scaled this open-source code reuse model down: now, developers can share code at the granularity of individual functions of tens of lines. This is a major technical accomplishment. There are myriad available packages, and writing code can involve such a large number of them, but the commercial, legal, and reputational support mechanisms for trusting the code have not carried over. We are trusting more code with less justification for doing so. <p> The cost of adopting a bad dependency can be viewed as the sum, over all possible bad outcomes, of the cost of each bad outcome multiplied by its probability of happening (risk). <p> <img name="deps-cost" class="center pad" width=383 height=95 src="deps-cost.png" srcset="deps-cost.png 1x, deps-cost@1.5x.png 1.5x, deps-cost@2x.png 2x, deps-cost@3x.png 3x, deps-cost@4x.png 4x"> <p> The context where a dependency will be used determines the cost of a bad outcome. At one end of the spectrum is a personal hobby project, where the cost of most bad outcomes is near zero: you’re just having fun, bugs have no real impact other than wasting some time, and even debugging them can be fun. So the risk probability almost doesn’t matter: it’s being multiplied by zero. At the other end of the spectrum is production software that must be maintained for years. Here, the cost of a bug in a dependency can be very high: servers may go down, sensitive data may be divulged, customers may be harmed, companies may fail. High failure costs make it much more important to estimate and then reduce any risk of a serious failure. <p> No matter what the expected cost, experiences with larger dependencies suggest some approaches for estimating and reducing the risks of adding a software dependency. It is likely that better tooling is needed to help reduce the costs of these approaches, much as dependency managers have focused to date on reducing the costs of download and installation. <a class=anchor href="#inspect_the_dependency"><h2 id="inspect_the_dependency">Inspect the dependency</h2></a> <p> You would not hire a software developer you’ve never heard of and know nothing about. You would learn more about them first: check references, conduct a job interview, run background checks, and so on. Before you depend on a package you found on the internet, it is similarly prudent to learn a bit about it first. <p> A basic inspection can give you a sense of how likely you are to run into problems trying to use this code. If the inspection reveals likely minor problems, you can take steps to prepare for or maybe avoid them. If the inspection reveals major problems, it may be best not to use the package: maybe you’ll find a more suitable one, or maybe you need to develop one yourself. Remember that open-source packages are published by their authors in the hope that they will be useful but with no guarantee of usability or support. In the middle of a production outage, you’ll be the one debugging it. As the original GNU General Public License warned, “The entire risk as to the quality and performance of the program is with you. Should the program prove defective, you assume the cost of all necessary servicing, repair or correction.”<a class=footnote id=body4 href="#note4"><sup>4</sup></a> <p> The rest of this section outlines some considerations when inspecting a package and deciding whether to depend on it. <a class=anchor href="#design"><h3 id="design">Design</h3></a> <p> Is package’s documentation clear? Does the API have a clear design? If the authors can explain the package’s API and its design well to you, the user, in the documentation, that increases the likelihood they have explained the implementation well to the computer, in the source code. Writing code for a clear, well-designed API is also easier, faster, and hopefully less error-prone. Have the authors documented what they expect from client code in order to make future upgrades compatible? (Examples include the C++<a class=footnote id=body5 href="#note5"><sup>5</sup></a> and Go<a class=footnote id=body6 href="#note6"><sup>6</sup></a> compatibility documents.) <a class=anchor href="#code_quality"><h3 id="code_quality">Code Quality</h3></a> <p> Is the code well-written? Read some of it. Does it look like the authors have been careful, conscientious, and consistent? Does it look like code you’d want to debug? You may need to. <p> Develop your own systematic ways to check code quality. For example, something as simple as compiling a C or C++ program with important compiler warnings enabled (for example, <code>-Wall</code>) can give you a sense of how seriously the developers work to avoid various undefined behaviors. Recent languages like Go, Rust, and Swift use an <code>unsafe</code> keyword to mark code that violates the type system; look to see how much unsafe code there is. More advanced semantic tools like Infer<a class=footnote id=body7 href="#note7"><sup>7</sup></a> or SpotBugs<a class=footnote id=body8 href="#note8"><sup>8</sup></a> are helpful too. Linters are less helpful: you should ignore rote suggestions about topics like brace style and focus instead on semantic problems. <p> Keep an open mind to development practices you may not be familiar with. For example, the SQLite library ships as a single 200,000-line C source file and a single 11,000-line header, the “amalgamation.” The sheer size of these files should raise an initial red flag, but closer investigation would turn up the actual development source code, a traditional file tree with over a hundred C source files, tests, and support scripts. It turns out that the single-file distribution is built automatically from the original sources and is easier for end users, especially those without dependency managers. (The compiled code also runs faster, because the compiler can see more optimization opportunities.) <a class=anchor href="#testing"><h3 id="testing">Testing</h3></a> <p> Does the code have tests? Can you run them? Do they pass? Tests establish that the code’s basic functionality is correct, and they signal that the developer is serious about keeping it correct. For example, the SQLite development tree has an incredibly thorough test suite with over 30,000 individual test cases as well as developer documentation explaining the testing strategy.<a class=footnote id=body9 href="#note9"><sup>9</sup></a> On the other hand, if there are few tests or no tests, or if the tests fail, that’s a serious red flag: future changes to the package are likely to introduce regressions that could easily have been caught. If you insist on tests in code you write yourself (you do, right?), you should insist on tests in code you outsource to others. <p> Assuming the tests exist, run, and pass, you can gather more information by running them with run-time instrumentation like code coverage analysis, race detection,<a class=footnote id=body10 href="#note10"><sup>10</sup></a> memory allocation checking, and memory leak detection. <a class=anchor href="#debugging"><h3 id="debugging">Debugging</h3></a> <p> Find the package’s issue tracker. Are there many open bug reports? How long have they been open? Are there many fixed bugs? Have any bugs been fixed recently? If you see lots of open issues about what look like real bugs, especially if they have been open for a long time, that’s not a good sign. On the other hand, if the closed issues show that bugs are rarely found and promptly fixed, that’s great. <a class=anchor href="#maintenance"><h3 id="maintenance">Maintenance</h3></a> <p> Look at the package’s commit history. How long has the code been actively maintained? Is it actively maintained now? Packages that have been actively maintained for an extended amount of time are more likely to continue to be maintained. How many people work on the package? Many packages are personal projects that developers create and share for fun in their spare time. Others are the result of thousands of hours of work by a group of paid developers. In general, the latter kind of package is more likely to have prompt bug fixes, steady improvements, and general upkeep. <p> On the other hand, some code really is “done.” For example, NPM’s <code>escape-string-regexp</code>, shown earlier, may never need to be modified again. <a class=anchor href="#usage"><h3 id="usage">Usage</h3></a> <p> Do many other packages depend on this code? Dependency managers can often provide statistics about usage, or you can use a web search to estimate how often others write about using the package. More users should at least mean more people for whom the code works well enough, along with faster detection of new bugs. Widespread usage is also a hedge against the question of continued maintenance: if a widely-used package loses its maintainer, an interested user is likely to step forward. <p> For example, libraries like PCRE or Boost or JUnit are incredibly widely used. That makes it more likely—although certainly not guaranteed—that bugs you might otherwise run into have already been fixed, because others ran into them first. <a class=anchor href="#security"><h3 id="security">Security</h3></a> <p> Will you be processing untrusted inputs with the package? If so, does it seem to be robust against malicious inputs? Does it have a history of security problems listed in the National Vulnerability Database (NVD)?<a class=footnote id=body11 href="#note11"><sup>11</sup></a> <p> For example, when Jeff Dean and I started work on Google Code Search<a class=footnote id=body12 href="#note12"><sup>12</sup></a>—<code>grep</code> over public source code—in 2006, the popular PCRE regular expression library seemed like an obvious choice. In an early discussion with Google’s security team, however, we learned that PCRE had a history of problems like buffer overflows, especially in its parser. We could have learned the same by searching for PCRE in the NVD. That discovery didn’t immediately cause us to abandon PCRE, but it did make us think more carefully about testing and isolation. <a class=anchor href="#licensing"><h3 id="licensing">Licensing</h3></a> <p> Is the code properly licensed? Does it have a license at all? Is the license acceptable for your project or company? A surprising fraction of projects on GitHub have no clear license. Your project or company may impose further restrictions on the allowed licenses of dependencies. For example, Google disallows the use of code licensed under AGPL-like licenses (too onerous) as well as WTFPL-like licenses (too vague).<a class=footnote id=body13 href="#note13"><sup>13</sup></a> <a class=anchor href="#dependencies"><h3 id="dependencies">Dependencies</h3></a> <p> Does the code have dependencies of its own? Flaws in indirect dependencies are just as bad for your program as flaws in direct dependencies. Dependency managers can list all the transitive dependencies of a given package, and each of them should ideally be inspected as described in this section. A package with many dependencies incurs additional inspection work, because those same dependencies incur additional risk that needs to be evaluated. <p> Many developers have never looked at the full list of transitive dependencies of their code and don’t know what they depend on. For example, in March 2016 the NPM user community discovered that many popular projects—including Babel, Ember, and React—all depended indirectly on a tiny package called <code>left-pad</code>, consisting of a single 8-line function body. They discovered this when the author of <code>left-pad</code> deleted that package from NPM, inadvertently breaking most Node.js users’ builds.<a class=footnote id=body14 href="#note14"><sup>14</sup></a> And <code>left-pad</code> is hardly exceptional in this regard. For example, 30% of the 750,000 packages published on NPM depend—at least indirectly—on <code>escape-string-regexp</code>. Adapting Leslie Lamport’s observation about distributed systems, a dependency manager can easily create a situation in which the failure of a package you didn’t even know existed can render your own code unusable. <a class=anchor href="#test_the_dependency"><h2 id="test_the_dependency">Test the dependency</h2></a> <p> The inspection process should include running a package’s own tests. If the package passes the inspection and you decide to make your project depend on it, the next step should be to write new tests focused on the functionality needed by your application. These tests often start out as short standalone programs written to make sure you can understand the package’s API and that it does what you think it does. (If you can’t or it doesn’t, turn back now!) It is worth then taking the extra effort to turn those programs into automated tests that can be run against newer versions of the package. If you find a bug and have a potential fix, you’ll want to be able to rerun these project-specific tests easily, to make sure that the fix did not break anything else. <p> It is especially worth exercising the likely problem areas identified by the basic inspection. For Code Search, we knew from past experience that PCRE sometimes took a long time to execute certain regular expression searches. Our initial plan was to have separate thread pools for “simple” and “complicated” regular expression searches. One of the first tests we ran was a benchmark, comparing <code>pcregrep</code> with a few other <code>grep</code> implementations. When we found that, for one basic test case, <code>pcregrep</code> was 70X slower than the fastest <code>grep</code> available, we started to rethink our plan to use PCRE. Even though we eventually dropped PCRE entirely, that benchmark remains in our code base today. <a class=anchor href="#abstract_the_dependency"><h2 id="abstract_the_dependency">Abstract the dependency</h2></a> <p> Depending on a package is a decision that you are likely to revisit later. Perhaps updates will take the package in a new direction. Perhaps serious security problems will be found. Perhaps a better option will come along. For all these reasons, it is worth the effort to make it easy to migrate your project to a new dependency. <p> If the package will be used from many places in your project’s source code, migrating to a new dependency would require making changes to all those different source locations. Worse, if the package will be exposed in your own project’s API, migrating to a new dependency would require making changes in all the code calling your API, which you might not control. To avoid these costs, it makes sense to define an interface of your own, along with a thin wrapper implementing that interface using the dependency. Note that the wrapper should include only what your project needs from the dependency, not everything the dependency offers. Ideally, that allows you to substitute a different, equally appropriate dependency later, by changing only the wrapper. Migrating your per-project tests to use the new interface tests the interface and wrapper implementation and also makes it easy to test any potential replacements for the dependency. <p> For Code Search, we developed an abstract <code>Regexp</code> class that defined the interface Code Search needed from any regular expression engine. Then we wrote a thin wrapper around PCRE implementing that interface. The indirection made it easy to test alternate libraries, and it kept us from accidentally introducing knowledge of PCRE internals into the rest of the source tree. That in turn ensured that it would be easy to switch to a different dependency if needed. <a class=anchor href="#isolate_the_dependency"><h2 id="isolate_the_dependency">Isolate the dependency</h2></a> <p> It may also be appropriate to isolate a dependency at run-time, to limit the possible damage caused by bugs in it. For example, Google Chrome allows users to add dependencies—extension code—to the browser. When Chrome launched in 2008, it introduced the critical feature (now standard in all browsers) of isolating each extension in a sandbox running in a separate operating-system process.<a class=footnote id=body15 href="#note15"><sup>15</sup></a> An exploitable bug in an badly-written extension therefore did not automatically have access to the entire memory of the browser itself and could be stopped from making inappropriate system calls.<a class=footnote id=body16 href="#note16"><sup>16</sup></a> For Code Search, until we dropped PCRE entirely, our plan was to isolate at least the PCRE parser in a similar sandbox. Today, another option would be a lightweight hypervisor-based sandbox like gVisor.<a class=footnote id=body17 href="#note17"><sup>17</sup></a> Isolating dependencies reduces the associated risks of running that code. <p> Even with these examples and other off-the-shelf options, run-time isolation of suspect code is still too difficult and rarely done. True isolation would require a completely memory-safe language, with no escape hatch into untyped code. That’s challenging not just in entirely unsafe languages like C and C++ but also in languages that provide restricted unsafe operations, like Java when including JNI, or like Go, Rust, and Swift when including their “unsafe” features. Even in a memory-safe language like JavaScript, code often has access to far more than it needs. In November 2018, the latest version of the NPM package <code>event-stream</code>, which provided a functional streaming API for JavaScript events, was discovered to contain obfuscated malicious code that had been added two and a half months earlier. The code, which harvested large Bitcoin wallets from users of the Copay mobile app, was accessing system resources entirely unrelated to processing event streams.<a class=footnote id=body18 href="#note18"><sup>18</sup></a> One of many possible defenses to this kind of problem would be to better restrict what dependencies can access. <a class=anchor href="#avoid_the_dependency"><h2 id="avoid_the_dependency">Avoid the dependency</h2></a> <p> If a dependency seems too risky and you can’t find a way to isolate it, the best answer may be to avoid it entirely, or at least to avoid the parts you’ve identified as most problematic. <p> For example, as we better understood the risks and costs associated with PCRE, our plan for Google Code Search evolved from “use PCRE directly,” to “use PCRE but sandbox the parser,” to “write a new regular expression parser but keep the PCRE execution engine,” to “write a new parser and connect it to a different, more efficient open-source execution engine.” Later we rewrote the execution engine as well, so that no dependencies were left, and we open-sourced the result: RE2.<a class=footnote id=body19 href="#note19"><sup>19</sup></a> <p> If you only need a tiny fraction of a dependency, it may be simplest to make a copy of what you need (preserving appropriate copyright and other legal notices, of course). You are taking on responsibility for fixing bugs, maintenance, and so on, but you’re also completely isolated from the larger risks. The Go developer community has a proverb about this: “A little copying is better than a little dependency.”<a class=footnote id=body20 href="#note20"><sup>20</sup></a> <a class=anchor href="#upgrade_the_dependency"><h2 id="upgrade_the_dependency">Upgrade the dependency</h2></a> <p> For a long time, the conventional wisdom about software was “if it ain’t broke, don’t fix it.” Upgrading carries a chance of introducing new bugs; without a corresponding reward—like a new feature you need—why take the risk? This analysis ignores two costs. The first is the cost of the eventual upgrade. In software, the difficulty of making code changes does not scale linearly: making ten small changes is less work and easier to get right than making one equivalent large change. The second is the cost of discovering already-fixed bugs the hard way. Especially in a security context, where known bugs are actively exploited, every day you wait is another day that attackers can break in. <p> For example, consider the year 2017 at Equifax, as recounted by executives in detailed congressional testimony.<a class=footnote id=body21 href="#note21"><sup>21</sup></a> On March 7, a new vulnerability in Apache Struts was disclosed, and a patched version was released. On March 8, Equifax received a notice from US-CERT about the need to update any uses of Apache Struts. Equifax ran source code and network scans on March 9 and March 15, respectively; neither scan turned up a particular group of public-facing web servers. On May 13, attackers found the servers that Equifax’s security teams could not. They used the Apache Struts vulnerability to breach Equifax’s network and then steal detailed personal and financial information about 148 million people over the next two months. Equifax finally noticed the breach on July 29 and publicly disclosed it on September 4. By the end of September, Equifax’s CEO, CIO, and CSO had all resigned, and a congressional investigation was underway. <p> Equifax’s experience drives home the point that although dependency managers know the versions they are using at build time, you need other arrangements to track that information through your production deployment process. For the Go language, we are experimenting with automatically including a version manifest in every binary, so that deployment processes can scan binaries for dependencies that need upgrading. Go also makes that information available at run-time, so that servers can consult databases of known bugs and self-report to monitoring software when they are in need of upgrades. <p> Upgrading promptly is important, but upgrading means adding new code to your project, which should mean updating your evaluation of the risks of using the dependency based on the new version. As minimum, you’d want to skim the diffs showing the changes being made from the current version to the upgraded versions, or at least read the release notes, to identify the most likely areas of concern in the upgraded code. If a lot of code is changing, so that the diffs are difficult to digest, that is also information you can incorporate into your risk assessment update. <p> You’ll also want to re-run the tests you’ve written that are specific to your project, to make sure the upgraded package is at least as suitable for the project as the earlier version. It also makes sense to re-run the package’s own tests. If the package has its own dependencies, it is entirely possible that your project’s configuration uses different versions of those dependencies (either older or newer ones) than the package’s authors use. Running the package’s own tests can quickly identify problems specific to your configuration. <p> Again, upgrades should not be completely automatic. You need to verify that the upgraded versions are appropriate for your environment before deploying them.<a class=footnote id=body22 href="#note22"><sup>22</sup></a> <p> If your upgrade process includes re-running the integration and qualification tests you’ve already written for the dependency, so that you are likely to identify new problems before they reach production, then, in most cases, delaying an upgrade is riskier than upgrading quickly. <p> The window for security-critical upgrades is especially short. In the aftermath of the Equifax breach, forensic security teams found evidence that attackers (perhaps different ones) had successfully exploited the Apache Struts vulnerability on the affected servers on March 10, only three days after it was publicly disclosed, but they’d only run a single <code>whoami</code> command. <a class=anchor href="#watch_your_dependencies"><h2 id="watch_your_dependencies">Watch your dependencies</h2></a> <p> Even after all that work, you’re not done tending your dependencies. It’s important to continue to monitor them and perhaps even re-evaluate your decision to use them. <p> First, make sure that you keep using the specific package versions you think you are. Most dependency managers now make it easy or even automatic to record the cryptographic hash of the expected source code for a given package version and then to check that hash when re-downloading the package on another computer or in a test environment. This ensures that your build use the same dependency source code you inspected and tested. These kinds of checks prevented the <code>event-stream</code> attacker, described earlier, from silently inserting malicious code in the already-released version 3.3.5. Instead, the attacker had to create a new version, 3.3.6, and wait for people to upgrade (without looking closely at the changes). <p> It is also important to watch for new indirect dependencies creeping in: upgrades can easily introduce new packages upon which the success of your project now depends. They deserve your attention as well. In the case of <code>event-stream</code>, the malicious code was hidden in a different package, <code>flatmap-stream</code>, which the new <code>event-stream</code> release added as a new dependency. <p> Creeping dependencies can also affect the size of your project. During the development of Google’s Sawzall<a class=footnote id=body23 href="#note23"><sup>23</sup></a>—a JIT’ed logs processing language—the authors discovered at various times that the main interpreter binary contained not just Sawzall’s JIT but also (unused) PostScript, Python, and JavaScript interpreters. Each time, the culprit turned out to be unused dependencies declared by some library Sawzall did depend on, combined with the fact that Google’s build system eliminated any manual effort needed to start using a new dependency.. This kind of error is the reason that the Go language makes importing an unused package a compile-time error. <p> Upgrading is a natural time to revisit the decision to use a dependency that’s changing. It’s also important to periodically revisit any dependency that <i>isn’t</i> changing. Does it seem plausible that there are no security problems or other bugs to fix? Has the project been abandoned? Maybe it’s time to start planning to replace that dependency. <p> It’s also important to recheck the security history of each dependency. For example, Apache Struts disclosed different major remote code execution vulnerabilities in 2016, 2017, and 2018. Even if you have a list of all the servers that run it and update them promptly, that track record might make you rethink using it at all. <a class=anchor href="#conclusion"><h2 id="conclusion">Conclusion</h2></a> <p> Software reuse is finally here, and I don’t mean to understate its benefits: it has brought an enormously positive transformation for software developers. Even so, we’ve accepted this transformation without completely thinking through the potential consequences. The old reasons for trusting dependencies are becoming less valid at exactly the same time we have more dependencies than ever. <p> The kind of critical examination of specific dependencies that I outlined in this article is a significant amount of work and remains the exception rather than the rule. But I doubt there are any developers who actually make the effort to do this for every possible new dependency. I have only done a subset of them for a subset of my own dependencies. Most of the time the entirety of the decision is “let’s see what happens.” Too often, anything more than that seems like too much effort. <p> But the Copay and Equifax attacks are clear warnings of real problems in the way we consume software dependencies today. We should not ignore the warnings. I offer three broad recommendations. <ol> <li> <p> <i>Recognize the problem.</i> If nothing else, I hope this article has convinced you that there is a problem here worth addressing. We need many people to focus significant effort on solving it. <li> <p> <i>Establish best practices for today.</i> We need to establish best practices for managing dependencies using what’s available today. This means working out processes that evaluate, reduce, and track risk, from the original adoption decision through to production use. In fact, just as some engineers specialize in testing, it may be that we need engineers who specialize in managing dependencies. <li> <p> <i>Develop better dependency technology for tomorrow.</i> Dependency managers have essentially eliminated the cost of downloading and installing a dependency. Future development effort should focus on reducing the cost of the kind of evaluation and maintenance necessary to use a dependency. For example, package discovery sites might work to find more ways to allow developers to share their findings. Build tools should, at the least, make it easy to run a package’s own tests. More aggressively, build tools and package management systems could also work together to allow package authors to test new changes against all public clients of their APIs. Languages should also provide easy ways to isolate a suspect package.</ol> <p> There’s a lot of good software out there. Let’s work together to find out how to reuse it safely. <p> <a class=anchor href="#references"><h2 id="references">References</h2></a> <ol> <li><a name=note1></a> Rachel Potvin and Josh Levenberg, “Why Google Stores Billions of Lines of Code in a Single Repository,” <i>Communications of the ACM</i> 59(7) (July 2016), pp. 78-87. <a href="https://doi.org/10.1145/2854146">https://doi.org/10.1145/2854146</a> <a class=back href="#body1">(⇡)</a> <li><a name=note2></a> Russ Cox, “Go &amp; Versioning,” February 2018. <a href="https://research.swtch.com/vgo">https://research.swtch.com/vgo</a> <a class=back href="#body2">(⇡)</a> <li><a name=note3></a> Ken Thompson, “Reflections on Trusting Trust,” <i>Communications of the ACM</i> 27(8) (August 1984), pp. 761–763. <a href="https://doi.org/10.1145/358198.358210">https://doi.org/10.1145/358198.358210</a> <a class=back href="#body3">(⇡)</a> <li><a name=note4></a> GNU Project, “GNU General Public License, version 1,” February 1989. <a href="https://www.gnu.org/licenses/old-licenses/gpl-1.0.html">https://www.gnu.org/licenses/old-licenses/gpl-1.0.html</a> <a class=back href="#body4">(⇡)</a> <li><a name=note5></a> Titus Winters, “SD-8: Standard Library Compatibility,” C++ Standing Document, August 2018. <a href="https://isocpp.org/std/standing-documents/sd-8-standard-library-compatibility">https://isocpp.org/std/standing-documents/sd-8-standard-library-compatibility</a> <a class=back href="#body5">(⇡)</a> <li><a name=note6></a> Go Project, “Go 1 and the Future of Go Programs,” September 2013. <a href="https://golang.org/doc/go1compat">https://golang.org/doc/go1compat</a> <a class=back href="#body6">(⇡)</a> <li><a name=note7></a> Facebook, “Infer: A tool to detect bugs in Java and C/C++/Objective-C code before it ships.” <a href="https://fbinfer.com/">https://fbinfer.com/</a> <a class=back href="#body7">(⇡)</a> <li><a name=note8></a> “SpotBugs: Find bugs in Java Programs.” <a href="https://spotbugs.github.io/">https://spotbugs.github.io/</a> <a class=back href="#body8">(⇡)</a> <li><a name=note9></a> D. Richard Hipp, “How SQLite is Tested.” <a href="https://www.sqlite.org/testing.html">https://www.sqlite.org/testing.html</a> <a class=back href="#body9">(⇡)</a> <li><a name=note10></a> Alexander Potapenko, “Testing Chromium: ThreadSanitizer v2, a next-gen data race detector,” April 2014. <a href="https://blog.chromium.org/2014/04/testing-chromium-threadsanitizer-v2.html">https://blog.chromium.org/2014/04/testing-chromium-threadsanitizer-v2.html</a> <a class=back href="#body10">(⇡)</a> <li><a name=note11></a> NIST, “National Vulnerability Database – Search and Statistics.” <a href="https://nvd.nist.gov/vuln/search">https://nvd.nist.gov/vuln/search</a> <a class=back href="#body11">(⇡)</a> <li><a name=note12></a> Russ Cox, “Regular Expression Matching with a Trigram Index, or How Google Code Search Worked,” January 2012. <a href="https://swtch.com/~rsc/regexp/regexp4.html">https://swtch.com/~rsc/regexp/regexp4.html</a> <a class=back href="#body12">(⇡)</a> <li><a name=note13></a> Google, “Google Open Source: Using Third-Party Licenses.” <a href="https://opensource.google.com/docs/thirdparty/licenses/#banned">https://opensource.google.com/docs/thirdparty/licenses/#banned</a> <a class=back href="#body13">(⇡)</a> <li><a name=note14></a> Nathan Willis, “A single Node of failure,” LWN, March 2016. <a href="https://lwn.net/Articles/681410/">https://lwn.net/Articles/681410/</a> <a class=back href="#body14">(⇡)</a> <li><a name=note15></a> Charlie Reis, “Multi-process Architecture,” September 2008. <a href="https://blog.chromium.org/2008/09/multi-process-architecture.html">https://blog.chromium.org/2008/09/multi-process-architecture.html</a> <a class=back href="#body15">(⇡)</a> <li><a name=note16></a> Adam Langley, “Chromium’s seccomp Sandbox,” August 2009. <a href="https://www.imperialviolet.org/2009/08/26/seccomp.html">https://www.imperialviolet.org/2009/08/26/seccomp.html</a> <a class=back href="#body16">(⇡)</a> <li><a name=note17></a> Nicolas Lacasse, “Open-sourcing gVisor, a sandboxed container runtime,” May 2018. <a href="https://cloud.google.com/blog/products/gcp/open-sourcing-gvisor-a-sandboxed-container-runtime">https://cloud.google.com/blog/products/gcp/open-sourcing-gvisor-a-sandboxed-container-runtime</a> <a class=back href="#body17">(⇡)</a> <li><a name=note18></a> Adam Baldwin, “Details about the event-stream incident,” November 2018. <a href="https://blog.npmjs.org/post/180565383195/details-about-the-event-stream-incident">https://blog.npmjs.org/post/180565383195/details-about-the-event-stream-incident</a> <a class=back href="#body18">(⇡)</a> <li><a name=note19></a> Russ Cox, “RE2: a principled approach to regular expression matching,” March 2010. <a href="https://opensource.googleblog.com/2010/03/re2-principled-approach-to-regular.html">https://opensource.googleblog.com/2010/03/re2-principled-approach-to-regular.html</a> <a class=back href="#body19">(⇡)</a> <li><a name=note20></a> Rob Pike, “Go Proverbs,” November 2015. <a href="https://go-proverbs.github.io/">https://go-proverbs.github.io/</a> <a class=back href="#body20">(⇡)</a> <li><a name=note21></a> U.S. House of Representatives Committee on Oversight and Government Reform, “The Equifax Data Breach,” Majority Staff Report, 115th Congress, December 2018. <a href="https://republicans-oversight.house.gov/wp-content/uploads/2018/12/Equifax-Report.pdf">https://republicans-oversight.house.gov/wp-content/uploads/2018/12/Equifax-Report.pdf</a> <a class=back href="#body21">(⇡)</a> <li><a name=note22></a> Russ Cox, “The Principles of Versioning in Go,” GopherCon Singapore, May 2018. <a href="https://www.youtube.com/watch?v=F8nrpe0XWRg">https://www.youtube.com/watch?v=F8nrpe0XWRg</a> <a class=back href="#body22">(⇡)</a> <li><a name=note23></a> Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan, “Interpreting the Data: Parallel Analysis with Sawzall,” <i>Scientific Programming Journal</i>, vol. 13 (2005). <a href="https://doi.org/10.1155/2005/962135">https://doi.org/10.1155/2005/962135</a> <a class=back href="#body23">(⇡)</a></ol> <a class=anchor href="#coda"><h2 id="coda">Coda</h2></a> <p> A version of this post was published in <a href="https://queue.acm.org/detail.cfm?id=3344149">ACM Queue</a> (March-April 2019) and then <a href="https://dl.acm.org/doi/pdf/10.1145/3347446">Communications of the ACM</a> (August 2019) under the title “Surviving Software Dependencies.” What is Software Engineering? tag:research.swtch.com,2012:research.swtch.com/vgo-eng 2018-05-30T10:00:00-04:00 2018-05-30T10:02:00-04:00 What is software engineering and what does Go mean by it? (Go & Versioning, Part 9) <p> Nearly all of Go’s distinctive design decisions were aimed at making software engineering simpler and easier. We've said this often. The canonical reference is Rob Pike's 2012 article, “<a href="https://talks.golang.org/2012/splash.article">Go at Google: Language Design in the Service of Software Engineering</a>.” But what is software engineering?<blockquote> <p> <i>Software engineering is what happens to programming <br>when you add time and other programmers.</i></blockquote> <p> Programming means getting a program working. You have a problem to solve, you write some Go code, you run it, you get your answer, you’re done. That’s programming, and that's difficult enough by itself. But what if that code has to keep working, day after day? What if five other programmers need to work on the code too? Then you start to think about version control systems, to track how the code changes over time and to coordinate with the other programmers. You add unit tests, to make sure bugs you fix are not reintroduced over time, not by you six months from now, and not by that new team member who’s unfamiliar with the code. You think about modularity and design patterns, to divide the program into parts that team members can work on mostly independently. You use tools to help you find bugs earlier. You look for ways to make programs as clear as possible, so that bugs are less likely. You make sure that small changes can be tested quickly, even in large programs. You're doing all of this because your programming has turned into software engineering. <p> (This definition and explanation of software engineering is my riff on an original theme by my Google colleague Titus Winters, whose preferred phrasing is “software engineering is programming integrated over time.” It's worth seven minutes of your time to see <a href="https://www.youtube.com/watch?v=tISy7EJQPzI&t=8m17s">his presentation of this idea at CppCon 2017</a>, from 8:17 to 15:00 in the video.) <p> As I said earlier, nearly all of Go’s distinctive design decisions have been motivated by concerns about software engineering, by trying to accommodate time and other programmers into the daily practice of programming. <p> For example, most people think that we format Go code with <code>gofmt</code> to make code look nicer or to end debates among team members about program layout. But the <a href="https://groups.google.com/forum/#!msg/golang-nuts/HC2sDhrZW5Y/7iuKxdbLExkJ">most important reason for <code>gofmt</code></a> is that if an algorithm defines how Go source code is formatted, then programs, like <code>goimports</code> or <code>gorename</code> or <code>go</code> <code>fix</code>, can edit the source code more easily, without introducing spurious formatting changes when writing the code back. This helps you maintain code over time. <p> As another example, Go import paths are URLs. If code said <code>import</code> <code>"uuid"</code>, you’d have to ask which <code>uuid</code> package. Searching for <code>uuid</code> on <a href="https://godoc.org">godoc.org</a> turns up dozens of packages. If instead the code says <code>import</code> <code>"github.com/pborman/uuid"</code>, now it’s clear which package we mean. Using URLs avoids ambiguity and also reuses an existing mechanism for giving out names, making it simpler and easier to coordinate with other programmers. <p> Continuing the example, Go import paths are written in Go source files, not in a separate build configuration file. This makes Go source files self-contained, which makes it easier to understand, modify, and copy them. These decisions, and more, were all made with the goal of simplifying software engineering. <p> In later posts I will talk specifically about why versions are important for software engineering and how software engineering concerns motivate the design changes from dep to vgo. Go and Dogma tag:research.swtch.com,2012:research.swtch.com/dogma 2017-01-09T09:00:00-05:00 2017-01-09T09:02:00-05:00 Programming language dogmatics. <p> [<i>Cross-posting from last year’s <a href="https://www.reddit.com/r/golang/comments/46bd5h/ama_we_are_the_go_contributors_ask_us_anything/d05yyde/?context=3&st=ixq5hjko&sh=7affd469">Go contributors AMA</a> on Reddit, because it’s still important to remember.</i>] <p> One of the perks of working on Go these past years has been the chance to have many great discussions with other language designers and implementers, for example about how well various design decisions worked out or the common problems of implementing what look like very different languages (for example both Go and Haskell need some kind of “green threads”, so there are more shared runtime challenges than you might expect). In one such conversation, when I was talking to a group of early Lisp hackers, one of them pointed out that these discussions are basically never dogmatic. Designers and implementers remember working through the good arguments on both sides of a particular decision, and they’re often eager to hear about someone else’s experience with what happens when you make that decision differently. Contrast that kind of discussion with the heated arguments or overly zealous statements you sometimes see from users of the same languages. There’s a real disconnect, possibly because the users don’t have the experience of weighing the arguments on both sides and don’t realize how easily a particular decision might have gone the other way. <p> Language design and implementation is engineering. We make decisions using evaluations of costs and benefits or, if we must, using predictions of those based on past experience. I think we have an important responsibility to explain both sides of a particular decision, to make clear that the arguments for an alternate decision are actually good ones that we weighed and balanced, and to avoid the suggestion that particular design decisions approach dogma. I hope <a href="https://www.reddit.com/r/golang/comments/46bd5h/ama_we_are_the_go_contributors_ask_us_anything/d05yyde/?context=3&st=ixq5hjko&sh=7affd469">the Reddit AMA</a> as well as discussion on <a href="https://groups.google.com/group/golang-nuts">golang-nuts</a> or <a href="http://stackoverflow.com/questions/tagged/go">StackOverflow</a> or the <a href="https://forum.golangbridge.org/">Go Forum</a> or at <a href="https://golang.org/wiki/Conferences">conferences</a> help with that. <p> But we need help from everyone. Remember that none of the decisions in Go are infallible; they’re just our best attempts at the time we made them, not wisdom received on stone tablets. If someone asks why Go does X instead of Y, please try to present the engineering reasons fairly, including for Y, and avoid argument solely by appeal to authority. It’s too easy to fall into the “well that’s just not how it’s done here” trap. And now that I know about and watch for that trap, I see it in nearly every technical community, although some more than others. A Tour of Acme tag:research.swtch.com,2012:research.swtch.com/acme 2012-09-17T11:00:00-04:00 2012-09-17T11:00:00-04:00 A video introduction to Acme, the Plan 9 text editor <p class="lp"> People I work with recognize my computer easily: it's the one with nothing but yellow windows and blue bars on the screen. That's the text editor acme, written by Rob Pike for Plan 9 in the early 1990s. Acme focuses entirely on the idea of text as user interface. It's difficult to explain acme without seeing it, though, so I've put together a screencast explaining the basics of acme and showing a brief programming session. Remember as you watch the video that the 854x480 screen is quite cramped. Usually you'd run acme on a larger screen: even my MacBook Air has almost four times as much screen real estate. </p> <center> <div style="border: 1px solid black; width: 853px; height: 480px;"><iframe width="853" height="480" src="https://www.youtube.com/embed/dP1xVpMPn8M?rel=0" frameborder="0" allowfullscreen></iframe></div> </center> <p class=pp> The video doesn't show everything acme can do, nor does it show all the ways you can use it. Even small idioms like where you type text to be loaded or executed vary from user to user. To learn more about acme, read Rob Pike's paper &ldquo;<a href="/acme.pdf">Acme: A User Interface for Programmers</a>&rdquo; and then try it. </p> <p class=pp> Acme runs on most operating systems. If you use <a href="https://9p.io/">Plan 9 from Bell Labs</a>, you already have it. If you use FreeBSD, Linux, OS X, or most other Unix clones, you can get it as part of <a href="http://swtch.com/plan9port/">Plan 9 from User Space</a>. If you use Windows, I suggest trying acme as packaged in <a href="http://code.google.com/p/acme-sac/">acme stand alone complex</a>, which is based on the Inferno programming environment. </p> <p class=lp><b>Mini-FAQ</b>: <ul> <li><i>Q. Can I use scalable fonts?</i> A. On the Mac, yes. If you run <code>acme -f /mnt/font/Monaco/16a/font</code> you get 16-point anti-aliased Monaco as your font, served via <a href="http://swtch.com/plan9port/man/man4/fontsrv.html">fontsrv</a>. If you'd like to add X11 support to fontsrv, I'd be happy to apply the patch. <li><i>Q. Do I need X11 to build on the Mac?</i> A. No. The build will complain that it cannot build &lsquo;snarfer&rsquo; but it should complete otherwise. You probably don't need snarfer. </ul> <p class=pp> If you're interested in history, the predecessor to acme was called help. Rob Pike's paper &ldquo;<a href="/help.pdf">A Minimalist Global User Interface</a>&rdquo; describes it. See also &ldquo;<a href="/sam.pdf">The Text Editor sam</a>&rdquo; </p> <p class=pp> <i>Correction</i>: the smiley program in the video was written by Ken Thompson. I got it from Dennis Ritchie, the more meticulous archivist of the pair. </p> Minimal Boolean Formulas tag:research.swtch.com,2012:research.swtch.com/boolean 2011-05-18T00:00:00-04:00 2011-05-18T00:00:00-04:00 Simplify equations with God <p><style type="text/css"> p { line-height: 150%; } blockquote { text-align: left; } pre.alg { font-family: sans-serif; font-size: 100%; margin-left: 60px; } td, th { padding-left; 5px; padding-right: 5px; vertical-align: top; } #times td { text-align: right; } table { padding-top: 1em; padding-bottom: 1em; } #find td { text-align: center; } </style> <p class=lp> <a href="http://oeis.org/A056287">28</a>. That's the minimum number of AND or OR operators you need in order to write any Boolean function of five variables. <a href="http://alexhealy.net/">Alex Healy</a> and I computed that in April 2010. Until then, I believe no one had ever known that little fact. This post describes how we computed it and how we almost got scooped by <a href="http://research.swtch.com/2011/01/knuth-volume-4a.html">Knuth's Volume 4A</a> which considers the problem for AND, OR, and XOR. </p> <h3>A Naive Brute Force Approach</h3> <p class=pp> Any Boolean function of two variables can be written with at most 3 AND or OR operators: the parity function on two variables X XOR Y is (X AND Y') OR (X' AND Y), where X' denotes &ldquo;not X.&rdquo; We can shorten the notation by writing AND and OR like multiplication and addition: X XOR Y = X*Y' + X'*Y. </p> <p class=pp> For three variables, parity is also a hardest function, requiring 9 operators: X XOR Y XOR Z = (X*Z'+X'*Z+Y')*(X*Z+X'*Z'+Y). </p> <p class=pp> For four variables, parity is still a hardest function, requiring 15 operators: W XOR X XOR Y XOR Z = (X*Z'+X'*Z+W'*Y+W*Y')*(X*Z+X'*Z'+W*Y+W'*Y'). </p> <p class=pp> The sequence so far prompts a few questions. Is parity always a hardest function? Does the minimum number of operators alternate between 2<sup>n</sup>&#8722;1 and 2<sup>n</sup>+1? </p> <p class=pp> I computed these results in January 2001 after hearing the problem from Neil Sloane, who suggested it as a variant of a similar problem first studied by Claude Shannon. </p> <p class=pp> The program I wrote to compute a(4) computes the minimum number of operators for every Boolean function of n variables in order to find the largest minimum over all functions. There are 2<sup>4</sup> = 16 settings of four variables, and each function can pick its own value for each setting, so there are 2<sup>16</sup> different functions. To make matters worse, you build new functions by taking pairs of old functions and joining them with AND or OR. 2<sup>16</sup> different functions means 2<sup>16</sup>&#183;2<sup>16</sup> = 2<sup>32</sup> pairs of functions. </p> <p class=pp> The program I wrote was a mangling of the Floyd-Warshall all-pairs shortest paths algorithm. That algorithm is: </p> <pre class="indent alg"> // Floyd-Warshall all pairs shortest path func compute(): for each node i for each node j dist[i][j] = direct distance, or &#8734; for each node k for each node i for each node j d = dist[i][k] + dist[k][j] if d &lt; dist[i][j] dist[i][j] = d return </pre> <p class=lp> The algorithm begins with the distance table dist[i][j] set to an actual distance if i is connected to j and infinity otherwise. Then each round updates the table to account for paths going through the node k: if it's shorter to go from i to k to j, it saves that shorter distance in the table. The nodes are numbered from 0 to n, so the variables i, j, k are simply integers. Because there are only n nodes, we know we'll be done after the outer loop finishes. </p> <p class=pp> The program I wrote to find minimum Boolean formula sizes is an adaptation, substituting formula sizes for distance. </p> <pre class="indent alg"> // Algorithm 1 func compute() for each function f size[f] = &#8734; for each single variable function f = v size[f] = 0 loop changed = false for each function f for each function g d = size[f] + 1 + size[g] if d &lt; size[f OR g] size[f OR g] = d changed = true if d &lt; size[f AND g] size[f AND g] = d changed = true if not changed return </pre> <p class=lp> Algorithm 1 runs the same kind of iterative update loop as the Floyd-Warshall algorithm, but it isn't as obvious when you can stop, because you don't know the maximum formula size beforehand. So it runs until a round doesn't find any new functions to make, iterating until it finds a fixed point. </p> <p class=pp> The pseudocode above glosses over some details, such as the fact that the per-function loops can iterate over a queue of functions known to have finite size, so that each loop omits the functions that aren't yet known. That's only a constant factor improvement, but it's a useful one. </p> <p class=pp> Another important detail missing above is the representation of functions. The most convenient representation is a binary truth table. For example, if we are computing the complexity of two-variable functions, there are four possible inputs, which we can number as follows. </p> <center> <table> <tr><th>X <th>Y <th>Value <tr><td>false <td>false <td>00<sub>2</sub> = 0 <tr><td>false <td>true <td>01<sub>2</sub> = 1 <tr><td>true <td>false <td>10<sub>2</sub> = 2 <tr><td>true <td>true <td>11<sub>2</sub> = 3 </table> </center> <p class=pp> The functions are then the 4-bit numbers giving the value of the function for each input. For example, function 13 = 1101<sub>2</sub> is true for all inputs except X=false Y=true. Three-variable functions correspond to 3-bit inputs generating 8-bit truth tables, and so on. </p> <p class=pp> This representation has two key advantages. The first is that the numbering is dense, so that you can implement a map keyed by function using a simple array. The second is that the operations &ldquo;f AND g&rdquo; and &ldquo;f OR g&rdquo; can be implemented using bitwise operators: the truth table for &ldquo;f AND g&rdquo; is the bitwise AND of the truth tables for f and g. </p> <p class=pp> That program worked well enough in 2001 to compute the minimum number of operators necessary to write any 1-, 2-, 3-, and 4-variable Boolean function. Each round takes asymptotically O(2<sup>2<sup>n</sup></sup>&#183;2<sup>2<sup>n</sup></sup>) = O(2<sup>2<sup>n+1</sup></sup>) time, and the number of rounds needed is O(the final answer). The answer for n=4 is 15, so the computation required on the order of 15&#183;2<sup>2<sup>5</sup></sup> = 15&#183;2<sup>32</sup> iterations of the innermost loop. That was plausible on the computer I was using at the time, but the answer for n=5, likely around 30, would need 30&#183;2<sup>64</sup> iterations to compute, which seemed well out of reach. At the time, it seemed plausible that parity was always a hardest function and that the minimum size would continue to alternate between 2<sup>n</sup>&#8722;1 and 2<sup>n</sup>+1. It's a nice pattern. </p> <h3>Exploiting Symmetry</h3> <p class=pp> Five years later, though, Alex Healy and I got to talking about this sequence, and Alex shot down both conjectures using results from the theory of circuit complexity. (Theorists!) Neil Sloane added this note to the <a href="http://oeis.org/history?seq=A056287">entry for the sequence</a> in his Online Encyclopedia of Integer Sequences: </p> <blockquote> <tt> %E A056287 Russ Cox conjectures that X<sub>1</sub> XOR ... XOR X<sub>n</sub> is always a worst f and that a(5) = 33 and a(6) = 63. But (Jan 27 2006) Alex Healy points out that this conjecture is definitely false for large n. So what is a(5)? </tt> </blockquote> <p class=lp> Indeed. What is a(5)? No one knew, and it wasn't obvious how to find out. </p> <p class=pp> In January 2010, Alex and I started looking into ways to speed up the computation for a(5). 30&#183;2<sup>64</sup> is too many iterations but maybe we could find ways to cut that number. </p> <p class=pp> In general, if we can identify a class of functions f whose members are guaranteed to have the same complexity, then we can save just one representative of the class as long as we recreate the entire class in the loop body. What used to be: </p> <pre class="indent alg"> for each function f for each function g visit f AND g visit f OR g </pre> <p class=lp> can be rewritten as </p> <pre class="indent alg"> for each canonical function f for each canonical function g for each ff equivalent to f for each gg equivalent to g visit ff AND gg visit ff OR gg </pre> <p class=lp> That doesn't look like an improvement: it's doing all the same work. But it can open the door to new optimizations depending on the equivalences chosen. For example, the functions &ldquo;f&rdquo; and &ldquo;&#172;f&rdquo; are guaranteed to have the same complexity, by <a href="http://en.wikipedia.org/wiki/De_Morgan's_laws">DeMorgan's laws</a>. If we keep just one of those two on the lists that &ldquo;for each function&rdquo; iterates over, we can unroll the inner two loops, producing: </p> <pre class="indent alg"> for each canonical function f for each canonical function g visit f OR g visit f AND g visit &#172;f OR g visit &#172;f AND g visit f OR &#172;g visit f AND &#172;g visit &#172;f OR &#172;g visit &#172;f AND &#172;g </pre> <p class=lp> That's still not an improvement, but it's no worse. Each of the two loops considers half as many functions but the inner iteration is four times longer. Now we can notice that half of tests aren't worth doing: &ldquo;f AND g&rdquo; is the negation of &ldquo;&#172;f OR &#172;g,&rdquo; and so on, so only half of them are necessary. </p> <p class=pp> Let's suppose that when choosing between &ldquo;f&rdquo; and &ldquo;&#172;f&rdquo; we keep the one that is false when presented with all true inputs. (This has the nice property that <code>f ^ (int32(f) &gt;&gt; 31)</code> is the truth table for the canonical form of <code>f</code>.) Then we can tell which combinations above will produce canonical functions when f and g are already canonical: </p> <pre class="indent alg"> for each canonical function f for each canonical function g visit f OR g visit f AND g visit &#172;f AND g visit f AND &#172;g </pre> <p class=lp> That's a factor of two improvement over the original loop. </p> <p class=pp> Another observation is that permuting the inputs to a function doesn't change its complexity: &ldquo;f(V, W, X, Y, Z)&rdquo; and &ldquo;f(Z, Y, X, W, V)&rdquo; will have the same minimum size. For complex functions, each of the 5! = 120 permutations will produce a different truth table. A factor of 120 reduction in storage is good but again we have the problem of expanding the class in the iteration. This time, there's a different trick for reducing the work in the innermost iteration. Since we only need to produce one member of the equivalence class, it doesn't make sense to permute the inputs to both f and g. Instead, permuting just the inputs to f while fixing g is guaranteed to hit at least one member of each class that permuting both f and g would. So we gain the factor of 120 twice in the loops and lose it once in the iteration, for a net savings of 120. (In some ways, this is the same trick we did with &ldquo;f&rdquo; vs &ldquo;&#172;f.&rdquo;) </p> <p class=pp> A final observation is that negating any of the inputs to the function doesn't change its complexity, because X and X' have the same complexity. The same argument we used for permutations applies here, for another constant factor of 2<sup>5</sup> = 32. </p> <p class=pp> The code stores a single function for each equivalence class and then recomputes the equivalent functions for f, but not g. </p> <pre class="indent alg"> for each canonical function f for each function ff equivalent to f for each canonical function g visit ff OR g visit ff AND g visit &#172;ff AND g visit ff AND &#172;g </pre> <p class=lp> In all, we just got a savings of 2&#183;120&#183;32 = 7680, cutting the total number of iterations from 30&#183;2<sup>64</sup> = 5&#215;10<sup>20</sup> to 7&#215;10<sup>16</sup>. If you figure we can do around 10<sup>9</sup> iterations per second, that's still 800 days of CPU time. </p> <p class=pp> The full algorithm at this point is: </p> <pre class="indent alg"> // Algorithm 2 func compute(): for each function f size[f] = &#8734; for each single variable function f = v size[f] = 0 loop changed = false for each canonical function f for each function ff equivalent to f for each canonical function g d = size[ff] + 1 + size[g] changed |= visit(d, ff OR g) changed |= visit(d, ff AND g) changed |= visit(d, ff AND &#172;g) changed |= visit(d, &#172;ff AND g) if not changed return func visit(d, fg): if size[fg] != &#8734; return false record fg as canonical for each function ffgg equivalent to fg size[ffgg] = d return true </pre> <p class=lp> The helper function &ldquo;visit&rdquo; must set the size not only of its argument fg but also all equivalent functions under permutation or inversion of the inputs, so that future tests will see that they have been computed. </p> <h3>Methodical Exploration</h3> <p class=pp> There's one final improvement we can make. The approach of looping until things stop changing considers each function pair multiple times as their sizes go down. Instead, we can consider functions in order of complexity, so that the main loop builds first all the functions of minimum complexity 1, then all the functions of minimum complexity 2, and so on. If we do that, we'll consider each function pair at most once. We can stop when all functions are accounted for. </p> <p class=pp> Applying this idea to Algorithm 1 (before canonicalization) yields: </p> <pre class="indent alg"> // Algorithm 3 func compute() for each function f size[f] = &#8734; for each single variable function f = v size[f] = 0 for k = 1 to &#8734; for each function f for each function g of size k &#8722; size(f) &#8722; 1 if size[f AND g] == &#8734; size[f AND g] = k nsize++ if size[f OR g] == &#8734; size[f OR g] = k nsize++ if nsize == 2<sup>2<sup>n</sup></sup> return </pre> <p class=lp> Applying the idea to Algorithm 2 (after canonicalization) yields: </p> <pre class="indent alg"> // Algorithm 4 func compute(): for each function f size[f] = &#8734; for each single variable function f = v size[f] = 0 for k = 1 to &#8734; for each canonical function f for each function ff equivalent to f for each canonical function g of size k &#8722; size(f) &#8722; 1 visit(k, ff OR g) visit(k, ff AND g) visit(k, ff AND &#172;g) visit(k, &#172;ff AND g) if nvisited == 2<sup>2<sup>n</sup></sup> return func visit(d, fg): if size[fg] != &#8734; return record fg as canonical for each function ffgg equivalent to fg if size[ffgg] != &#8734; size[ffgg] = d nvisited += 2 // counts ffgg and &#172;ffgg return </pre> <p class=lp> The original loop in Algorithms 1 and 2 considered each pair f, g in every iteration of the loop after they were computed. The new loop in Algorithms 3 and 4 considers each pair f, g only once, when k = size(f) + size(g) + 1. This removes the leading factor of 30 (the number of times we expected the first loop to run) from our estimation of the run time. Now the expected number of iterations is around 2<sup>64</sup>/7680 = 2.4&#215;10<sup>15</sup>. If we can do 10<sup>9</sup> iterations per second, that's only 28 days of CPU time, which I can deliver if you can wait a month. </p> <p class=pp> Our estimate does not include the fact that not all function pairs need to be considered. For example, if the maximum size is 30, then the functions of size 14 need never be paired against the functions of size 16, because any result would have size 14+1+16 = 31. So even 2.4&#215;10<sup>15</sup> is an overestimate, but it's in the right ballpark. (With hindsight I can report that only 1.7&#215;10<sup>14</sup> pairs need to be considered but also that our estimate of 10<sup>9</sup> iterations per second was optimistic. The actual calculation ran for 20 days, an average of about 10<sup>8</sup> iterations per second.) </p> <h3>Endgame: Directed Search</h3> <p class=pp> A month is still a long time to wait, and we can do better. Near the end (after k is bigger than, say, 22), we are exploring the fairly large space of function pairs in hopes of finding a fairly small number of remaining functions. At that point it makes sense to change from the bottom-up &ldquo;bang things together and see what we make&rdquo; to the top-down &ldquo;try to make this one of these specific functions.&rdquo; That is, the core of the current search is: </p> <pre class="indent alg"> for each canonical function f for each function ff equivalent to f for each canonical function g of size k &#8722; size(f) &#8722; 1 visit(k, ff OR g) visit(k, ff AND g) visit(k, ff AND &#172;g) visit(k, &#172;ff AND g) </pre> <p class=lp> We can change it to: </p> <pre class="indent alg"> for each missing function fg for each canonical function g for all possible f such that one of these holds * fg = f OR g * fg = f AND g * fg = &#172;f AND g * fg = f AND &#172;g if size[f] == k &#8722; size(g) &#8722; 1 visit(k, fg) next fg </pre> <p class=lp> By the time we're at the end, exploring all the possible f to make the missing functions&#8212;a directed search&#8212;is much less work than the brute force of exploring all combinations. </p> <p class=pp> As an example, suppose we are looking for f such that fg = f OR g. The equation is only possible to satisfy if fg OR g == fg. That is, if g has any extraneous 1 bits, no f will work, so we can move on. Otherwise, the remaining condition is that f AND &#172;g == fg AND &#172;g. That is, for the bit positions where g is 0, f must match fg. The other bits of f (the bits where g has 1s) can take any value. We can enumerate the possible f values by recursively trying all possible values for the &ldquo;don't care&rdquo; bits. </p> <pre class="indent alg"> func find(x, any, xsize): if size(x) == xsize return x while any != 0 bit = any AND &#8722;any // rightmost 1 bit in any any = any AND &#172;bit if f = find(x OR bit, any, xsize) succeeds return f return failure </pre> <p class=lp> It doesn't matter which 1 bit we choose for the recursion, but finding the rightmost 1 bit is cheap: it is isolated by the (admittedly surprising) expression &ldquo;any AND &#8722;any.&rdquo; </p> <p class=pp> Given <code>find</code>, the loop above can try these four cases: </p> <center> <table id=find> <tr><th>Formula <th>Condition <th>Base x <th>&ldquo;Any&rdquo; bits <tr><td>fg = f OR g <td>fg OR g == fg <td>fg AND &#172;g <td>g <tr><td>fg = f OR &#172;g <td>fg OR &#172;g == fg <td>fg AND g <td>&#172;g <tr><td>&#172;fg = f OR g <td>&#172;fg OR g == fg <td>&#172;fg AND &#172;g <td>g <tr><td>&#172;fg = f OR &#172;g <td>&#172;fg OR &#172;g == &#172;fg <td>&#172;fg AND g <td>&#172;g </table> </center> <p class=lp> Rewriting the Boolean expressions to use only the four OR forms means that we only need to write the &ldquo;adding bits&rdquo; version of find. </p> <p class=pp> The final algorithm is: </p> <pre class="indent alg"> // Algorithm 5 func compute(): for each function f size[f] = &#8734; for each single variable function f = v size[f] = 0 // Generate functions. for k = 1 to max_generate for each canonical function f for each function ff equivalent to f for each canonical function g of size k &#8722; size(f) &#8722; 1 visit(k, ff OR g) visit(k, ff AND g) visit(k, ff AND &#172;g) visit(k, &#172;ff AND g) // Search for functions. for k = max_generate+1 to &#8734; for each missing function fg for each canonical function g fsize = k &#8722; size(g) &#8722; 1 if fg OR g == fg if f = find(fg AND &#172;g, g, fsize) succeeds visit(k, fg) next fg if fg OR &#172;g == fg if f = find(fg AND g, &#172;g, fsize) succeeds visit(k, fg) next fg if &#172;fg OR g == &#172;fg if f = find(&#172;fg AND &#172;g, g, fsize) succeeds visit(k, fg) next fg if &#172;fg OR &#172;g == &#172;fg if f = find(&#172;fg AND g, &#172;g, fsize) succeeds visit(k, fg) next fg if nvisited == 2<sup>2<sup>n</sup></sup> return func visit(d, fg): if size[fg] != &#8734; return record fg as canonical for each function ffgg equivalent to fg if size[ffgg] != &#8734; size[ffgg] = d nvisited += 2 // counts ffgg and &#172;ffgg return func find(x, any, xsize): if size(x) == xsize return x while any != 0 bit = any AND &#8722;any // rightmost 1 bit in any any = any AND &#172;bit if f = find(x OR bit, any, xsize) succeeds return f return failure </pre> <p class=lp> To get a sense of the speedup here, and to check my work, I ran the program using both algorithms on a 2.53 GHz Intel Core 2 Duo E7200. </p> <center> <table id=times> <tr><th> <th colspan=3>&#8212;&#8212;&#8212;&#8212;&#8212; # of Functions &#8212;&#8212;&#8212;&#8212;&#8212;<th colspan=2>&#8212;&#8212;&#8212;&#8212; Time &#8212;&#8212;&#8212;&#8212; <tr><th>Size <th>Canonical <th>All <th>All, Cumulative <th>Generate <th>Search <tr><td>0 <td>1 <td>10 <td>10 <tr><td>1 <td>2 <td>82 <td>92 <td>&lt; 0.1 seconds <td>3.4 minutes <tr><td>2 <td>2 <td>640 <td>732 <td>&lt; 0.1 seconds <td>7.2 minutes <tr><td>3 <td>7 <td>4420 <td>5152 <td>&lt; 0.1 seconds <td>12.3 minutes <tr><td>4 <td>19 <td>25276 <td>29696 <td>&lt; 0.1 seconds <td>30.1 minutes <tr><td>5 <td>44 <td>117440 <td>147136 <td>&lt; 0.1 seconds <td>1.3 hours <tr><td>6 <td>142 <td>515040 <td>662176 <td>&lt; 0.1 seconds <td>3.5 hours <tr><td>7 <td>436 <td>1999608 <td>2661784 <td>0.2 seconds <td>11.6 hours <tr><td>8 <td>1209 <td>6598400 <td>9260184 <td>0.6 seconds <td>1.7 days <tr><td>9 <td>3307 <td>19577332 <td>28837516 <td>1.7 seconds <td>4.9 days <tr><td>10 <td>7741 <td>50822560 <td>79660076 <td>4.6 seconds <td>[ 10 days ? ] <tr><td>11 <td>17257 <td>114619264 <td>194279340 <td>10.8 seconds <td>[ 20 days ? ] <tr><td>12 <td>31851 <td>221301008 <td>415580348 <td>21.7 seconds <td>[ 50 days ? ] <tr><td>13 <td>53901 <td>374704776 <td>790285124 <td>38.5 seconds <td>[ 80 days ? ] <tr><td>14 <td>75248 <td>533594528 <td>1323879652 <td>58.7 seconds <td>[ 100 days ? ] <tr><td>15 <td>94572 <td>667653642 <td>1991533294 <td>1.5 minutes <td>[ 120 days ? ] <tr><td>16 <td>98237 <td>697228760 <td>2688762054 <td>2.1 minutes <td>[ 120 days ? ] <tr><td>17 <td>89342 <td>628589440 <td>3317351494 <td>4.1 minutes <td>[ 90 days ? ] <tr><td>18 <td>66951 <td>468552896 <td>3785904390 <td>9.1 minutes <td>[ 50 days ? ] <tr><td>19 <td>41664 <td>287647616 <td>4073552006 <td>23.4 minutes <td>[ 30 days ? ] <tr><td>20 <td>21481 <td>144079832 <td>4217631838 <td>57.0 minutes <td>[ 10 days ? ] <tr><td>21 <td>8680 <td>55538224 <td>4273170062 <td>2.4 hours <td>2.5 days <tr><td>22 <td>2730 <td>16099568 <td>4289269630 <td>5.2 hours <td>11.7 hours <tr><td>23 <td>937 <td>4428800 <td>4293698430 <td>11.2 hours <td>2.2 hours <tr><td>24 <td>228 <td>959328 <td>4294657758 <td>22.0 hours <td>33.2 minutes <tr><td>25 <td>103 <td>283200 <td>4294940958 <td>1.7 days <td>4.0 minutes <tr><td>26 <td>21 <td>22224 <td>4294963182 <td>2.9 days <td>42 seconds <tr><td>27 <td>10 <td>3602 <td>4294966784 <td>4.7 days <td>2.4 seconds <tr><td>28 <td>3 <td>512 <td>4294967296 <td>[ 7 days ? ] <td>0.1 seconds </table> </center> <p class=pp> The bracketed times are estimates based on the work involved: I did not wait that long for the intermediate search steps. The search algorithm is quite a bit worse than generate until there are very few functions left to find. However, it comes in handy just when it is most useful: when the generate algorithm has slowed to a crawl. If we run generate through formulas of size 22 and then switch to search for 23 onward, we can run the whole computation in just over half a day of CPU time. </p> <p class=pp> The computation of a(5) identified the sizes of all 616,126 canonical Boolean functions of 5 inputs. In contrast, there are <a href="http://oeis.org/A000370">just over 200 trillion canonical Boolean functions of 6 inputs</a>. Determining a(6) is unlikely to happen by brute force computation, no matter what clever tricks we use. </p> <h3>Adding XOR</h3> <p class=pp>We've assumed the use of just AND and OR as our basis for the Boolean formulas. If we also allow XOR, functions can be written using many fewer operators. In particular, a hardest function for the 1-, 2-, 3-, and 4-input cases&#8212;parity&#8212;is now trivial. Knuth examines the complexity of 5-input Boolean functions using AND, OR, and XOR in detail in <a href="http://www-cs-faculty.stanford.edu/~uno/taocp.html">The Art of Computer Programming, Volume 4A</a>. Section 7.1.2's Algorithm L is the same as our Algorithm 3 above, given for computing 4-input functions. Knuth mentions that to adapt it for 5-input functions one must treat only canonical functions and gives results for 5-input functions with XOR allowed. So another way to check our work is to add XOR to our Algorithm 4 and check that our results match Knuth's. </p> <p class=pp> Because the minimum formula sizes are smaller (at most 12), the computation of sizes with XOR is much faster than before: </p> <center> <table> <tr><th> <th><th colspan=5>&#8212;&#8212;&#8212;&#8212;&#8212; # of Functions &#8212;&#8212;&#8212;&#8212;&#8212;<th> <tr><th>Size <th width=10><th>Canonical <th width=10><th>All <th width=10><th>All, Cumulative <th width=10><th>Time <tr><td align=right>0 <td><td align=right>1 <td><td align=right>10 <td><td align=right>10 <td><td> <tr><td align=right>1 <td><td align=right>3 <td><td align=right>102 <td><td align=right>112 <td><td align=right>&lt; 0.1 seconds <tr><td align=right>2 <td><td align=right>5 <td><td align=right>1140 <td><td align=right>1252 <td><td align=right>&lt; 0.1 seconds <tr><td align=right>3 <td><td align=right>20 <td><td align=right>11570 <td><td align=right>12822 <td><td align=right>&lt; 0.1 seconds <tr><td align=right>4 <td><td align=right>93 <td><td align=right>109826 <td><td align=right>122648 <td><td align=right>&lt; 0.1 seconds <tr><td align=right>5 <td><td align=right>366 <td><td align=right>936440 <td><td align=right>1059088 <td><td align=right>0.1 seconds <tr><td align=right>6 <td><td align=right>1730 <td><td align=right>7236880 <td><td align=right>8295968 <td><td align=right>0.7 seconds <tr><td align=right>7 <td><td align=right>8782 <td><td align=right>47739088 <td><td align=right>56035056 <td><td align=right>4.5 seconds <tr><td align=right>8 <td><td align=right>40297 <td><td align=right>250674320 <td><td align=right>306709376 <td><td align=right>24.0 seconds <tr><td align=right>9 <td><td align=right>141422 <td><td align=right>955812256 <td><td align=right>1262521632 <td><td align=right>95.5 seconds <tr><td align=right>10 <td><td align=right>273277 <td><td align=right>1945383936 <td><td align=right>3207905568 <td><td align=right>200.7 seconds <tr><td align=right>11 <td><td align=right>145707 <td><td align=right>1055912608 <td><td align=right>4263818176 <td><td align=right>121.2 seconds <tr><td align=right>12 <td><td align=right>4423 <td><td align=right>31149120 <td><td align=right>4294967296 <td><td align=right>65.0 seconds </table> </center> <p class=pp> Knuth does not discuss anything like Algorithm 5, because the search for specific functions does not apply to the AND, OR, and XOR basis. XOR is a non-monotone function (it can both turn bits on and turn bits off), so there is no test like our &ldquo;<code>if fg OR g == fg</code>&rdquo; and no small set of &ldquo;don't care&rdquo; bits to trim the search for f. The search for an appropriate f in the XOR case would have to try all f of the right size, which is exactly what Algorithm 4 already does. </p> <p class=pp> Volume 4A also considers the problem of building minimal circuits, which are like formulas but can use common subexpressions additional times for free, and the problem of building the shallowest possible circuits. See Section 7.1.2 for all the details. </p> <h3>Code and Web Site</h3> <p class=pp> The web site <a href="http://boolean-oracle.swtch.com">boolean-oracle.swtch.com</a> lets you type in a Boolean expression and gives back the minimal formula for it. It uses tables generated while running Algorithm 5; those tables and the programs described in this post are also <a href="http://boolean-oracle.swtch.com/about">available on the site</a>. </p> <h3>Postscript: Generating All Permutations and Inversions</h3> <p class=pp> The algorithms given above depend crucially on the step &ldquo;<code>for each function ff equivalent to f</code>,&rdquo; which generates all the ff obtained by permuting or inverting inputs to f, but I did not explain how to do that. We already saw that we can manipulate the binary truth table representation directly to turn <code>f</code> into <code>&#172;f</code> and to compute combinations of functions. We can also manipulate the binary representation directly to invert a specific input or swap a pair of adjacent inputs. Using those operations we can cycle through all the equivalent functions. </p> <p class=pp> To invert a specific input, let's consider the structure of the truth table. The index of a bit in the truth table encodes the inputs for that entry. For example, the low bit of the index gives the value of the first input. So the even-numbered bits&#8212;at indices 0, 2, 4, 6, ...&#8212;correspond to the first input being false, while the odd-numbered bits&#8212;at indices 1, 3, 5, 7, ...&#8212;correspond to the first input being true. Changing just that bit in the index corresponds to changing the single variable, so indices 0, 1 differ only in the value of the first input, as do 2, 3, and 4, 5, and 6, 7, and so on. Given the truth table for f(V, W, X, Y, Z) we can compute the truth table for f(&#172;V, W, X, Y, Z) by swapping adjacent bit pairs in the original truth table. Even better, we can do all the swaps in parallel using a bitwise operation. To invert a different input, we swap larger runs of bits. </p> <center> <table> <tr><th>Function <th width=10> <th>Truth Table (<span style="font-weight: normal;"><code>f</code> = f(V, W, X, Y, Z)</span>) <tr><td>f(&#172;V, W, X, Y, Z) <td><td><code>(f&amp;0x55555555)&lt;&lt;&nbsp;1 | (f&gt;&gt;&nbsp;1)&amp;0x55555555</code> <tr><td>f(V, &#172;W, X, Y, Z) <td><td><code>(f&amp;0x33333333)&lt;&lt;&nbsp;2 | (f&gt;&gt;&nbsp;2)&amp;0x33333333</code> <tr><td>f(V, W, &#172;X, Y, Z) <td><td><code>(f&amp;0x0f0f0f0f)&lt;&lt;&nbsp;4 | (f&gt;&gt;&nbsp;4)&amp;0x0f0f0f0f</code> <tr><td>f(V, W, X, &#172;Y, Z) <td><td><code>(f&amp;0x00ff00ff)&lt;&lt;&nbsp;8 | (f&gt;&gt;&nbsp;8)&amp;0x00ff00ff</code> <tr><td>f(V, W, X, Y, &#172;Z) <td><td><code>(f&amp;0x0000ffff)&lt;&lt;16 | (f&gt;&gt;16)&amp;0x0000ffff</code> </table> </center> <p class=lp> Being able to invert a specific input lets us consider all possible inversions by building them up one at a time. The <a href="http://oeis.org/A003188">Gray code</a> lets us enumerate all possible 5-bit input codes while changing only 1 bit at a time as we move from one input to the next: </p> <center> 0, 1, 3, 2, 6, 7, 5, 4, <br> 12, 13, 15, 14, 10, 11, 9, 8, <br> 24, 25, 27, 26, 30, 31, 29, 28, <br> 20, 21, 23, 22, 18, 19, 17, 16 </center> <p class=lp> This minimizes the number of inversions we need: to consider all 32 cases, we only need 31 inversion operations. In contrast, visiting the 5-bit input codes in the usual binary order 0, 1, 2, 3, 4, ... would often need to change multiple bits, like when changing from 3 to 4. </p> <p class=pp> To swap a pair of adjacent inputs, we can again take advantage of the truth table. For a pair of inputs, there are four cases: 00, 01, 10, and 11. We can leave the 00 and 11 cases alone, because they are invariant under swapping, and concentrate on swapping the 01 and 10 bits. The first two inputs change most often in the truth table: each run of 4 bits corresponds to those four cases. In each run, we want to leave the first and fourth alone and swap the second and third. For later inputs, the four cases consist of sections of bits instead of single bits. </p> <center> <table> <tr><th>Function <th width=10> <th>Truth Table (<span style="font-weight: normal;"><code>f</code> = f(V, W, X, Y, Z)</span>) <tr><td>f(<b>W, V</b>, X, Y, Z) <td><td><code>f&amp;0x99999999 | (f&amp;0x22222222)&lt;&lt;1 | (f&gt;&gt;1)&amp;0x22222222</code> <tr><td>f(V, <b>X, W</b>, Y, Z) <td><td><code>f&amp;0xc3c3c3c3 | (f&amp;0x0c0c0c0c)&lt;&lt;1 | (f&gt;&gt;1)&amp;0x0c0c0c0c</code> <tr><td>f(V, W, <b>Y, X</b>, Z) <td><td><code>f&amp;0xf00ff00f | (f&amp;0x00f000f0)&lt;&lt;1 | (f&gt;&gt;1)&amp;0x00f000f0</code> <tr><td>f(V, W, X, <b>Z, Y</b>) <td><td><code>f&amp;0xff0000ff | (f&amp;0x0000ff00)&lt;&lt;8 | (f&gt;&gt;8)&amp;0x0000ff00</code> </table> </center> <p class=lp> Being able to swap a pair of adjacent inputs lets us consider all possible permutations by building them up one at a time. Again it is convenient to have a way to visit all permutations by applying only one swap at a time. Here Volume 4A comes to the rescue. Section 7.2.1.2 is titled &ldquo;Generating All Permutations,&rdquo; and Knuth delivers many algorithms to do just that. The most convenient for our purposes is Algorithm P, which generates a sequence that considers all permutations exactly once with only a single swap of adjacent inputs between steps. Knuth calls it Algorithm P because it corresponds to the &ldquo;Plain changes&rdquo; algorithm used by <a href="http://en.wikipedia.org/wiki/Change_ringing">bell ringers in 17th century England</a> to ring a set of bells in all possible permutations. The algorithm is described in a manuscript written around 1653! </p> <p class=pp> We can examine all possible permutations and inversions by nesting a loop over all permutations inside a loop over all inversions, and in fact that's what my program does. Knuth does one better, though: his Exercise 7.2.1.2-20 suggests that it is possible to build up all the possibilities using only adjacent swaps and inversion of the first input. Negating arbitrary inputs is not hard, though, and still does minimal work, so the code sticks with Gray codes and Plain changes. </p></p> Zip Files All The Way Down tag:research.swtch.com,2012:research.swtch.com/zip 2010-03-18T00:00:00-04:00 2010-03-18T00:00:00-04:00 Did you think it was turtles? <p><p class=lp> Stephen Hawking begins <i><a href="http://www.amazon.com/-/dp/0553380168">A Brief History of Time</a></i> with this story: </p> <blockquote> <p class=pp> A well-known scientist (some say it was Bertrand Russell) once gave a public lecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the center of a vast collection of stars called our galaxy. At the end of the lecture, a little old lady at the back of the room got up and said: &ldquo;What you have told us is rubbish. The world is really a flat plate supported on the back of a giant tortoise.&rdquo; The scientist gave a superior smile before replying, &ldquo;What is the tortoise standing on?&rdquo; &ldquo;You're very clever, young man, very clever,&rdquo; said the old lady. &ldquo;But it's turtles all the way down!&rdquo; </p> </blockquote> <p class=lp> Scientists today are pretty sure that the universe is not actually turtles all the way down, but we can create that kind of situation in other contexts. For example, here we have <a href="http://www.youtube.com/watch?v=Y-gqMTt3IUg">video monitors all the way down</a> and <a href="http://www.amazon.com/gp/customer-media/product-gallery/0387900926/ref=cm_ciu_pdp_images_all">set theory books all the way down</a>, and <a href="http://blog.makezine.com/archive/2009/01/thousands_of_shopping_carts_stake_o.html">shopping carts all the way down</a>. </p> <p class=pp> And here's a computer storage equivalent: look inside <a href="http://swtch.com/r.zip"><code>r.zip</code></a>. It's zip files all the way down: each one contains another zip file under the name <code>r/r.zip</code>. (For the die-hard Unix fans, <a href="http://swtch.com/r.tar.gz"><code>r.tar.gz</code></a> is gzipped tar files all the way down.) Like the line of shopping carts, it never ends, because it loops back onto itself: the zip file contains itself! And it's probably less work to put together a self-reproducing zip file than to put together all those shopping carts, at least if you're the kind of person who would read this blog. This post explains how. </p> <p class=pp> Before we get to self-reproducing zip files, though, we need to take a brief detour into self-reproducing programs. </p> <h3>Self-reproducing programs</h3> <p class=pp> The idea of self-reproducing programs dates back to the 1960s. My favorite statement of the problem is the one Ken Thompson gave in his 1983 Turing Award address: </p> <blockquote> <p class=pp> In college, before video games, we would amuse ourselves by posing programming exercises. One of the favorites was to write the shortest self-reproducing program. Since this is an exercise divorced from reality, the usual vehicle was FORTRAN. Actually, FORTRAN was the language of choice for the same reason that three-legged races are popular. </p> <p class=pp> More precisely stated, the problem is to write a source program that, when compiled and executed, will produce as output an exact copy of its source. If you have never done this, I urge you to try it on your own. The discovery of how to do it is a revelation that far surpasses any benefit obtained by being told how to do it. The part about &ldquo;shortest&rdquo; was just an incentive to demonstrate skill and determine a winner. </p> </blockquote> <p class=lp> <b>Spoiler alert!</b> I agree: if you have never done this, I urge you to try it on your own. The internet makes it so easy to look things up that it's refreshing to discover something yourself once in a while. Go ahead and spend a few days figuring out. This blog will still be here when you get back. (If you don't mind the spoilers, the entire <a href="http://cm.bell-labs.com/who/ken/trust.html">Turing award address</a> is worth reading.) </p> <center> <br><br> <i>(Spoiler blocker.)</i> <br> <a href="http://www.robertwechsler.com/projects.html"><img src="http://research.swtch.com/applied_geometry.jpg"></a> <br> <i><a href="http://www.robertwechsler.com/projects.html">http://www.robertwechsler.com/projects.html</a></i> <br><br> </center> <p class=pp> Let's try to write a Python program that prints itself. It will probably be a <code>print</code> statement, so here's a first attempt, run at the interpreter prompt: </p> <pre class=indent> &gt;&gt;&gt; print '<span style="color: #005500">hello</span>' hello </pre> <p class=lp> That didn't quite work. But now we know what the program is, so let's print it: </p> <pre class=indent> &gt;&gt;&gt; print "<span style="color: #005500">print 'hello'</span>" print 'hello' </pre> <p class=lp> That didn't quite work either. The problem is that when you execute a simple print statement, it only prints part of itself: the argument to the print. We need a way to print the rest of the program too. </p> <p class=pp> The trick is to use recursion: you write a string that is the whole program, but with itself missing, and then you plug it into itself before passing it to print. </p> <pre class=indent> &gt;&gt;&gt; s = '<span style="color: #005500">print %s</span>'; print s % repr(s) print 'print %s' </pre> <p class=lp> Not quite, but closer: the problem is that the string <code>s</code> isn't actually the program. But now we know the general form of the program: <code>s = '<span style="color: #005500">%s</span>'; print s % repr(s)</code>. That's the string to use. </p> <pre class=indent> &gt;&gt;&gt; s = '<span style="color: #005500">s = %s; print s %% repr(s)</span>'; print s % repr(s) s = 's = %s; print s %% repr(s)'; print s % repr(s) </pre> <p class=lp> Recursion for the win. </p> <p class=pp> This form of self-reproducing program is often called a <a href="http://en.wikipedia.org/wiki/Quine_(computing)">quine</a>, in honor of the philosopher and logician W. V. O. Quine, who discovered the paradoxical sentence: </p> <blockquote> &ldquo;Yields falsehood when preceded by its quotation&rdquo;<br>yields falsehood when preceded by its quotation. </blockquote> <p class=lp> The simplest English form of a self-reproducing quine is a command like: </p> <blockquote> Print this, followed by its quotation:<br>&ldquo;Print this, followed by its quotation:&rdquo; </blockquote> <p class=lp> There's nothing particularly special about Python that makes quining possible. The most elegant quine I know is a Scheme program that is a direct, if somewhat inscrutable, translation of that sentiment: </p> <pre class=indent> ((lambda (x) `<span style="color: #005500">(</span>,x <span style="color: #005500">'</span>,x<span style="color: #005500">)</span>) '<span style="color: #005500">(lambda (x) `(,x ',x))</span>) </pre> <p class=lp> I think the Go version is a clearer translation, at least as far as the quoting is concerned: </p> <pre class=indent> /* Go quine */ package main import "<span style="color: #005500">fmt</span>" func main() { fmt.Printf("<span style="color: #005500">%s%c%s%c\n</span>", q, 0x60, q, 0x60) } var q = `<span style="color: #005500">/* Go quine */ package main import "fmt" func main() { fmt.Printf("%s%c%s%c\n", q, 0x60, q, 0x60) } var q = </span>` </pre> <p class=lp>(I've colored the data literals green throughout to make it clear what is program and what is data.)</p> <p class=pp>The Go program has the interesting property that, ignoring the pesky newline at the end, the entire program is the same thing twice (<code>/* Go quine */ ... q = `</code>). That got me thinking: maybe it's possible to write a self-reproducing program using only a repetition operator. And you know what programming language has essentially only a repetition operator? The language used to encode Lempel-Ziv compressed files like the ones used by <code>gzip</code> and <code>zip</code>. </p> <h3>Self-reproducing Lempel-Ziv programs</h3> <p class=pp> Lempel-Ziv compressed data is a stream of instructions with two basic opcodes: <code>literal(</code><i>n</i><code>)</code> followed by <i>n</i> bytes of data means write those <i>n</i> bytes into the decompressed output, and <code>repeat(</code><i>d</i><code>,</code> <i>n</i><code>)</code> means look backward <i>d</i> bytes from the current location in the decompressed output and copy the <i>n</i> bytes you find there into the output stream. </p> <p class=pp> The programming exercise, then, is this: write a Lempel-Ziv program using just those two opcodes that prints itself when run. In other words, write a compressed data stream that decompresses to itself. Feel free to assume any reasonable encoding for the <code>literal</code> and <code>repeat</code> opcodes. For the grand prize, find a program that decompresses to itself surrounded by an arbitrary prefix and suffix, so that the sequence could be embedded in an actual <code>gzip</code> or <code>zip</code> file, which has a fixed-format header and trailer. </p> <p class=pp> <b>Spoiler alert!</b> I urge you to try this on your own before continuing to read. It's a great way to spend a lazy afternoon, and you have one critical advantage that I didn't: you know there is a solution. </p> <center> <br><br> <i>(Spoiler blocker.)</i> <br> <a href=""><img src="http://research.swtch.com/the_best_circular_bike(sbcc_sbma_students_roof).jpg"></a> <br> <i><a href="http://www.robertwechsler.com/thebest.html">http://www.robertwechsler.com/thebest.html</a></i> <br><br> </center> <p class=lp>By the way, here's <a href="http://swtch.com/r.gz"><code>r.gz</code></a>, gzip files all the way down. <pre class=indent> $ gunzip &lt; r.gz &gt; r $ cmp r r.gz $ </pre> <p class=lp>The nice thing about <code>r.gz</code> is that even broken web browsers that ordinarily decompress downloaded gzip data before storing it to disk will handle this file correctly! </p> <p class=pp>Enough stalling to hide the spoilers. Let's use this shorthand to describe Lempel-Ziv instructions: <code>L</code><i>n</i> and <code>R</code><i>n</i> are shorthand for <code>literal(</code><i>n</i><code>)</code> and <code>repeat(</code><i>n</i><code>,</code> <i>n</i><code>)</code>, and the program assumes that each code is one byte. <code>L0</code> is therefore the Lempel-Ziv no-op; <code>L5</code> <code>hello</code> prints <code>hello</code>; and so does <code>L3</code> <code>hel</code> <code>R1</code> <code>L1</code> <code>o</code>. </p> <p class=pp> Here's a Lempel-Ziv program that prints itself. (Each line is one instruction.) </p> <br> <center> <table border=0> <tr><th></th><th width=30></th><th>Code</th><th width=30></th><th>Output</th></tr> <tr><td align=right><i><span style="font-size: 0.8em;">no-op</span></i></td><td></td><td><code>L0</code></td><td></td><td></td></tr> <tr><td align=right><i><span style="font-size: 0.8em;">no-op</span></i></td><td></td><td><code>L0</code></td><td></td><td></td></tr> <tr><td align=right><i><span style="font-size: 0.8em;">no-op</span></i></td><td></td><td><code>L0</code></td><td></td><td></td></tr> <tr><td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td><td></td><td><code>L4 <span style="color: #005500">L0 L0 L0 L4</span></code></td><td></td><td><code>L0 L0 L0 L4</code></td></tr> <tr><td align=right><i><span style="font-size: 0.8em;">repeat last 4 printed bytes</span></i></td><td></td><td><code>R4</code></td><td></td><td><code>L0 L0 L0 L4</code></td></tr> <tr><td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td><td></td><td><code>L4 <span style="color: #005500">R4 L4 R4 L4</span></code></td><td></td><td><code>R4 L4 R4 L4</code></td></tr> <tr><td align=right><i><span style="font-size: 0.8em;">repeat last 4 printed bytes</span></i></td><td></td><td><code>R4</code></td><td></td><td><code>R4 L4 R4 L4</code></td></tr> <tr><td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td><td></td><td><code>L4 <span style="color: #005500">L0 L0 L0 L0</span></code></td><td></td><td><code>L0 L0 L0 L0</code></td></tr> </table> </center> <br> <p class=lp> (The two columns Code and Output contain the same byte sequence.) </p> <p class=pp> The interesting core of this program is the 6-byte sequence <code>L4 R4 L4 R4 L4 R4</code>, which prints the 8-byte sequence <code>R4 L4 R4 L4 R4 L4 R4 L4</code>. That is, it prints itself with an extra byte before and after. </p> <p class=pp> When we were trying to write the self-reproducing Python program, the basic problem was that the print statement was always longer than what it printed. We solved that problem with recursion, computing the string to print by plugging it into itself. Here we took a different approach. The Lempel-Ziv program is particularly repetitive, so that a repeated substring ends up containing the entire fragment. The recursion is in the representation of the program rather than its execution. Either way, that fragment is the crucial point. Before the final <code>R4</code>, the output lags behind the input. Once it executes, the output is one code ahead. </p> <p class=pp> The <code>L0</code> no-ops are plugged into a more general variant of the program, which can reproduce itself with the addition of an arbitrary three-byte prefix and suffix: </p> <br> <center> <table border=0> <tr><th></th><th width=30></th><th>Code</th><th width=30></th><th>Output</th></tr> <tr><td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td><td></td><td><code>L4 <span style="color: #005500"><i>aa bb cc</i> L4</span></code></td><td></td><td><code><i>aa bb cc</i> L4</code></td></tr> <tr><td align=right><i><span style="font-size: 0.8em;">repeat last 4 printed bytes</span></i></td><td></td><td><code>R4</code></td><td></td><td><code><i>aa bb cc</i> L4</code></td></tr> <tr><td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td><td></td><td><code>L4 <span style="color: #005500">R4 L4 R4 L4</span></code></td><td></td><td><code>R4 L4 R4 L4</code></td></tr> <tr><td align=right><i><span style="font-size: 0.8em;">repeat last 4 printed bytes</span></i></td><td></td><td><code>R4</code></td><td></td><td><code>R4 L4 R4 L4</code></td></tr> <tr><td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td><td></td><td><code>L4 <span style="color: #005500">R4 <i>xx yy zz</i></span></code></td><td></td><td><code>R4 <i>xx yy zz</i></code></td></tr> <tr><td align=right><i><span style="font-size: 0.8em;">repeat last 4 printed bytes</span></i></td><td></td><td><code>R4</code></td><td></td><td><code>R4 <i>xx yy zz</i></code></td></tr> </table> </center> <br> <p class=lp> (The byte sequence in the Output column is <code><i>aa bb cc</i></code>, then the byte sequence from the Code column, then <code><i>xx yy zz</i></code>.) </p> <p class=pp> It took me the better part of a quiet Sunday to get this far, but by the time I got here I knew the game was over and that I'd won. From all that experimenting, I knew it was easy to create a program fragment that printed itself minus a few instructions or even one that printed an arbitrary prefix and then itself, minus a few instructions. The extra <code>aa bb cc</code> in the output provides a place to attach such a program fragment. Similarly, it's easy to create a fragment to attach to the <code>xx yy zz</code> that prints itself, minus the first three instructions, plus an arbitrary suffix. We can use that generality to attach an appropriate header and trailer. </p> <p class=pp> Here is the final program, which prints itself surrounded by an arbitrary prefix and suffix. <code>[P]</code> denotes the <i>p</i>-byte compressed form of the prefix <code>P</code>; similarly, <code>[S]</code> denotes the <i>s</i>-byte compressed form of the suffix <code>S</code>. </p> <br> <center> <table border=0> <tr><th></th><th width=30></th><th>Code</th><th width=30></th><th>Output</th></tr> <tr> <td align=right><i><span style="font-size: 0.8em;">print prefix</span></i></td> <td></td> <td><code>[P]</code></td> <td></td> <td><code>P</code></td> </tr> <tr> <td align=right><span style="font-size: 0.8em;"><i>print </i>p<i>+1 bytes</i></span></td> <td></td> <td><code>L</code><span style="font-size: 0.8em;"><i>p</i>+1</span><code> <span style="color: #005500">[P] L</span></code><span style="color: #005500"><span style="font-size: 0.8em;"><i>p</i>+1</span></span><code></code></td> <td></td> <td><code>[P] L</code><span style="font-size: 0.8em;"><i>p</i>+1</span><code></code></td> </tr> <tr> <td align=right><span style="font-size: 0.8em;"><i>repeat last </i>p<i>+1 printed bytes</i></span></td> <td></td> <td><code>R</code><span style="font-size: 0.8em;"><i>p</i>+1</span><code></code></td> <td></td> <td><code>[P] L</code><span style="font-size: 0.8em;"><i>p</i>+1</span><code></code></td> </tr> <tr> <td align=right><span style="font-size: 0.8em;"><i>print 1 byte</i></span></td> <td></td> <td><code>L1 <span style="color: #005500">R</span></code><span style="color: #005500"><span style="font-size: 0.8em;"><i>p</i>+1</span></span><code></code></td> <td></td> <td><code>R</code><span style="font-size: 0.8em;"><i>p</i>+1</span><code></code></td> </tr> <tr> <td align=right><span style="font-size: 0.8em;"><i>print 1 byte</i></span></td> <td></td> <td><code>L1 <span style="color: #005500">L1</span></code></td> <td></td> <td><code>L1</code></td> </tr> <tr> <td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td> <td></td> <td><code>L4 <span style="color: #005500">R</span></code><span style="color: #005500"><span style="font-size: 0.8em;"><i>p</i>+1</span></span><code><span style="color: #005500"> L1 L1 L4</span></code></td> <td></td> <td><code>R</code><span style="font-size: 0.8em;"><i>p</i>+1</span><code> L1 L1 L4</code></td> </tr> <tr> <td align=right><i><span style="font-size: 0.8em;">repeat last 4 printed bytes</span></i></td> <td></td> <td><code>R4</code></td> <td></td> <td><code>R</code><span style="font-size: 0.8em;"><i>p</i>+1</span><code> L1 L1 L4</code></td> </tr> <tr> <td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td> <td></td> <td><code>L4 <span style="color: #005500">R4 L4 R4 L4</span></code></td> <td></td> <td><code>R4 L4 R4 L4</code></td> </tr> <tr> <td align=right><i><span style="font-size: 0.8em;">repeat last 4 printed bytes</span></i></td> <td></td> <td><code>R4</code></td> <td></td> <td><code>R4 L4 R4 L4</code></td> </tr> <tr> <td align=right><i><span style="font-size: 0.8em;">print 4 bytes</span></i></td> <td></td> <td><code>L4 <span style="color: #005500">R4 L0 L0 L</span></code><span style="color: #005500"><span style="font-size: 0.8em;"><i>s</i>+1</span></span><code><span style="color: #005500"></span></code></td> <td></td> <td><code>R4 L0 L0 L</code><span style="font-size: 0.8em;"><i>s</i>+1</span><code></code></td> </tr> <tr> <td align=right><i><span style="font-size: 0.8em;">repeat last 4 printed bytes</span></i></td> <td></td> <td><code>R4</code></td> <td></td> <td><code>R4 L0 L0 L</code><span style="font-size: 0.8em;"><i>s</i>+1</span><code></code></td> </tr> <tr> <td align=right><i><span style="font-size: 0.8em;">no-op</span></i></td> <td></td> <td><code>L0</code></td> <td></td> <td></td> </tr> <tr> <td align=right><i><span style="font-size: 0.8em;">no-op</span></i></td> <td></td> <td><code>L0</code></td> <td></td> <td></td> </tr> <tr> <td align=right><span style="font-size: 0.8em;"><i>print </i>s<i>+1 bytes</i></span></td> <td></td> <td><code>L</code><span style="font-size: 0.8em;"><i>s</i>+1</span><code> <span style="color: #005500">R</span></code><span style="color: #005500"><span style="font-size: 0.8em;"><i>s</i>+1</span></span><code><span style="color: #005500"> [S]</span></code></td> <td></td> <td><code>R</code><span style="font-size: 0.8em;"><i>s</i>+1</span><code> [S]</code></td> </tr> <tr> <td align=right><span style="font-size: 0.8em;"><i>repeat last </i>s<i>+1 bytes</i></span></td> <td></td> <td><code>R</code><span style="font-size: 0.8em;"><i>s</i>+1</span><code></code></td> <td></td> <td><code>R</code><span style="font-size: 0.8em;"><i>s</i>+1</span><code> [S]</code></td> </tr> <tr> <td align=right><i><span style="font-size: 0.8em;">print suffix</span></i></td> <td></td> <td><code>[S]</code></td> <td></td> <td><code>S</code></td> </tr> </table> </center> <br> <p class=lp> (The byte sequence in the Output column is <code><i>P</i></code>, then the byte sequence from the Code column, then <code><i>S</i></code>.) </p> <h3>Self-reproducing zip files</h3> <p class=pp> Now the rubber meets the road. We've solved the main theoretical obstacle to making a self-reproducing zip file, but there are a couple practical obstacles still in our way. </p> <p class=pp> The first obstacle is to translate our self-reproducing Lempel-Ziv program, written in simplified opcodes, into the real opcode encoding. <a href="http://www.ietf.org/rfc/rfc1951.txt">RFC 1951</a> describes the DEFLATE format used in both gzip and zip: a sequence of blocks, each of which is a sequence of opcodes encoded using Huffman codes. Huffman codes assign different length bit strings to different opcodes, breaking our assumption above that opcodes have fixed length. But wait! We can, with some care, find a set of fixed-size encodings that says what we need to be able to express. </p> <p class=pp> In DEFLATE, there are literal blocks and opcode blocks. The header at the beginning of a literal block is 5 bytes: </p> <center> <img src="http://research.swtch.com/zip1.png"> </center> <p class=pp> If the translation of our <code>L</code> opcodes above are 5 bytes each, the translation of the <code>R</code> opcodes must also be 5 bytes each, with all the byte counts above scaled by a factor of 5. (For example, <code>L4</code> now has a 20-byte argument, and <code>R4</code> repeats the last 20 bytes of output.) The opcode block with a single <code>repeat(20,20)</code> instruction falls well short of 5 bytes: </p> <center> <img src="http://research.swtch.com/zip2.png"> </center> <p class=lp>Luckily, an opcode block containing two <code>repeat(20,10)</code> instructions has the same effect and is exactly 5 bytes: </p> <center> <img src="http://research.swtch.com/zip3.png"> </center> <p class=lp> Encoding the other sized repeats (<code>R</code><span style="font-size: 0.8em;"><i>p</i>+1</span> and <code>R</code><span style="font-size: 0.8em;"><i>s</i>+1</span>) takes more effort and some sleazy tricks, but it turns out that we can design 5-byte codes that repeat any amount from 9 to 64 bytes. For example, here are the repeat blocks for 10 bytes and for 40 bytes: </p> <center> <img src="http://research.swtch.com/zip4.png"> <br> <img src="http://research.swtch.com/zip5.png"> </center> <p class=lp> The repeat block for 10 bytes is two bits too short, but every repeat block is followed by a literal block, which starts with three zero bits and then padding to the next byte boundary. If a repeat block ends two bits short of a byte but is followed by a literal block, the literal block's padding will insert the extra two bits. Similarly, the repeat block for 40 bytes is five bits too long, but they're all zero bits. Starting a literal block five bits too late steals the bits from the padding. Both of these tricks only work because the last 7 bits of any repeat block are zero and the bits in the first byte of any literal block are also zero, so the boundary isn't directly visible. If the literal block started with a one bit, this sleazy trick wouldn't work. </p> <p class=pp>The second obstacle is that zip archives (and gzip files) record a CRC32 checksum of the uncompressed data. Since the uncompressed data is the zip archive, the data being checksummed includes the checksum itself. So we need to find a value <i>x</i> such that writing <i>x</i> into the checksum field causes the file to checksum to <i>x</i>. Recursion strikes back. </p> <p class=pp> The CRC32 checksum computation interprets the entire file as a big number and computes the remainder when you divide that number by a specific constant using a specific kind of division. We could go through the effort of setting up the appropriate equations and solving for <i>x</i>. But frankly, we've already solved one nasty recursive puzzle today, and <a href="http://www.youtube.com/watch?v=TQBLTB5f3j0">enough is enough</a>. There are only four billion possibilities for <i>x</i>: we can write a program to try each in turn, until it finds one that works. </p> <p class=pp> If you want to recreate these files yourself, there are a few more minor obstacles, like making sure the tar file is a multiple of 512 bytes and compressing the rather large zip trailer to at most 59 bytes so that <code>R</code><span style="font-size: 0.8em;"><i>s</i>+1</span> is at most <code>R</code><span style="font-size: 0.8em;">64</span>. But they're just a simple matter of programming. </p> <p class=pp> So there you have it: <code><a href="http://swtch.com/r.gz">r.gz</a></code> (gzip files all the way down), <code><a href="http://swtch.com/r.tar.gz">r.tar.gz</a></code> (gzipped tar files all the way down), and <code><a href="http://swtch.com/r.zip">r.zip</a></code> (zip files all the way down). I regret that I have been unable to find any programs that insist on decompressing these files recursively, ad infinitum. It would have been fun to watch them squirm, but it looks like much less sophisticated <a href="http://en.wikipedia.org/wiki/Zip_bomb">zip bombs</a> have spoiled the fun. </p> <p class=pp> If you're feeling particularly ambitious, here is <a href="http://swtch.com/rgzip.go">rgzip.go</a>, the <a href="http://golang.org/">Go</a> program that generated these files. I wonder if you can create a zip file that contains a gzipped tar file that contains the original zip file. Ken Thompson suggested trying to make a zip file that contains a slightly larger copy of itself, recursively, so that as you dive down the chain of zip files each one gets a little bigger. (If you do manage either of these, please leave a comment.) </p> <br> <p class=lp><font size=-1>P.S. I can't end the post without sharing my favorite self-reproducing program: the one-line shell script <code>#!/bin/cat</code></font>. </p></p> </div> </div> </div> UTF-8: Bits, Bytes, and Benefits tag:research.swtch.com,2012:research.swtch.com/utf8 2010-03-05T00:00:00-05:00 2010-03-05T00:00:00-05:00 The reasons to switch to UTF-8 <p><p class=pp> UTF-8 is a way to encode Unicode code points&#8212;integer values from 0 through 10FFFF&#8212;into a byte stream, and it is far simpler than many people realize. The easiest way to make it confusing or complicated is to treat it as a black box, never looking inside. So let's start by looking inside. Here it is: </p> <center> <table cellspacing=5 cellpadding=0 border=0> <tr height=10><th colspan=4></th></tr> <tr><th align=center colspan=2>Unicode code points</th><th width=10><th align=center>UTF-8 encoding (binary)</th></tr> <tr height=10><td colspan=4></td></tr> <tr><td align=right>00-7F</td><td>(7 bits)</td><td></td><td align=right>0<i>tuvwxyz</i></td></tr> <tr><td align=right>0080-07FF</td><td>(11 bits)</td><td></td><td align=right>110<i>pqrst</i>&nbsp;10<i>uvwxyz</i></td></tr> <tr><td align=right>0800-FFFF</td><td>(16 bits)</td><td></td><td align=right>1110<i>jklm</i>&nbsp;10<i>npqrst</i>&nbsp;10<i>uvwxyz</i></td></tr> <tr><td align=right valign=top>010000-10FFFF</td><td>(21 bits)</td><td></td><td align=right valign=top>11110<i>efg</i>&nbsp;10<i>hijklm</i> 10<i>npqrst</i>&nbsp;10<i>uvwxyz</i></td> <tr height=10><td colspan=4></td></tr> </table> </center> <p class=lp> The convenient properties of UTF-8 are all consequences of the choice of encoding. </p> <ol> <li><i>All ASCII files are already UTF-8 files.</i><br> The first 128 Unicode code points are the 7-bit ASCII character set, and UTF-8 preserves their one-byte encoding. </li> <li><i>ASCII bytes always represent themselves in UTF-8 files. They never appear as part of other UTF-8 sequences.</i><br> All the non-ASCII UTF-8 sequences consist of bytes with the high bit set, so if you see the byte 0x7A in a UTF-8 file, you can be sure it represents the character <code>z</code>. </li> <li><i>ASCII bytes are always represented as themselves in UTF-8 files. They cannot be hidden inside multibyte UTF-8 sequences.</i><br> The ASCII <code>z</code> 01111010 cannot be encoded as a two-byte UTF-8 sequence 11000001 10111010</code>. Code points must be encoded using the shortest possible sequence. A corollary is that decoders must detect long-winded sequences as invalid. In practice, it is useful for a decoder to use the Unicode replacement character, code point FFFD, as the decoding of an invalid UTF-8 sequence rather than stop processing the text. </li> <li><i>UTF-8 is self-synchronizing.</i><br> Let's call a byte of the form 10<i>xxxxxx</i> a continuation byte. Every UTF-8 sequence is a byte that is not a continuation byte followed by zero or more continuation bytes. If you start processing a UTF-8 file at an arbitrary point, you might not be at the beginning of a UTF-8 encoding, but you can easily find one: skip over continuation bytes until you find a non-continuation byte. (The same applies to scanning backward.) </li> <li><i>Substring search is just byte string search.</i><br> Properties 2, 3, and 4 imply that given a string of correctly encoded UTF-8, the only way those bytes can appear in a larger UTF-8 text is when they represent the same code points. So you can use any 8-bit safe byte at a time search function, like <code>strchr</code> or <code>strstr</code>, to run the search. </li> <li><i>Most programs that handle 8-bit files safely can handle UTF-8 safely.</i><br> This also follows from Properties 2, 3, and 4. I say &ldquo;most&rdquo; programs, because programs that take apart a byte sequence expecting one character per byte will not behave correctly, but very few programs do that. It is far more common to split input at newline characters, or split whitespace-separated fields, or do other similar parsing around specific ASCII characters. For example, Unix tools like cat, cmp, cp, diff, echo, head, tail, and tee can process UTF-8 files as if they were plain ASCII files. Most operating system kernels should also be able to handle UTF-8 file names without any special arrangement, since the only operations done on file names are comparisons and splitting at <code>/</code>. In contrast, tools like grep, sed, and wc, which inspect arbitrary individual characters, do need modification. </li> <li><i>UTF-8 sequences sort in code point order.</i><br> You can verify this by inspecting the encodings in the table above. This means that Unix tools like join, ls, and sort (without options) don't need to handle UTF-8 specially. </li> <li><i>UTF-8 has no &ldquo;byte order.&rdquo;</i><br> UTF-8 is a byte encoding. It is not little endian or big endian. Unicode defines a byte order mark (BOM) code point FFFE, which are used to determine the byte order of a stream of raw 16-bit values, like UCS-2 or UTF-16. It has no place in a UTF-8 file. Some programs like to write a UTF-8-encoded BOM at the beginning of UTF-8 files, but this is unnecessary (and annoying to programs that don't expect it). </li> </ol> <p class=lp> UTF-8 does give up the ability to do random access using code point indices. Programs that need to jump to the <i>n</i>th Unicode code point in a file or on a line&#8212;text editors are the canonical example&#8212;will typically convert incoming UTF-8 to an internal representation like an array of code points and then convert back to UTF-8 for output, but most programs are simpler when written to manipulate UTF-8 directly. </p> <p class=pp> Programs that make UTF-8 more complicated than it needs to be are typically trying to be too general, not wanting to make assumptions that might not be true of other encodings. But there are good tools to convert other encodings to UTF-8, and it is slowly becoming the standard encoding: even the fraction of web pages written in UTF-8 is <a href="http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html">nearing 50%</a>. UTF-8 was explicitly designed to have these nice properties. Take advantage of them. </p> <p class=pp> For more on UTF-8, see &ldquo;<a href="https://9p.io/sys/doc/utf.html">Hello World or Καλημέρα κόσμε or こんにちは 世界</a>,&rdquo; by Rob Pike and Ken Thompson, and also this <a href="http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt">history</a>. </p> <br> <font size=-1> <p class=lp> Notes: Property 6 assumes the tools do not strip the high bit from each byte. Such mangling was common years ago but is very uncommon now. Property 7 assumes the comparison is done treating the bytes as unsigned, but such behavior is mandated by the ANSI C standard for <code>memcmp</code>, <code>strcmp</code>, and <code>strncmp</code>. </p> </font></p> Computing History at Bell Labs tag:research.swtch.com,2012:research.swtch.com/bell-labs 2008-04-09T00:00:00-04:00 2008-04-09T00:00:00-04:00 Doug McIlroy’s rememberances <p><p class=pp> In 1997, on his retirement from Bell Labs, <a href="http://www.cs.dartmouth.edu/~doug/">Doug McIlroy</a> gave a fascinating talk about the &ldquo;<a href="https://web.archive.org/web/20081022192943/http://cm.bell-labs.com/cm/cs/doug97.html"><b>History of Computing at Bell Labs</b></a>.&rdquo; Almost ten years ago I transcribed the audio but never did anything with it. The transcript is below. </p> <p class=pp> My favorite parts of the talk are the description of the bi-quinary decimal relay calculator and the description of a team that spent over a year tracking down a race condition bug in a missile detector (reliability was king: today you’d just stamp &ldquo;cannot reproduce&rdquo; and send the report back). But the whole thing contains many fantastic stories. It’s well worth the read or listen. I also like his recollection of programming using cards: &ldquo;It’s the kind of thing you can be nostalgic about, but it wasn’t actually fun.&rdquo; </p> <p class=pp> For more information, Bernard D. Holbrook and W. Stanley Brown’s 1982 technical report &ldquo;<a href="cstr99.pdf">A History of Computing Research at Bell Laboratories (1937-1975)</a>&rdquo; covers the earlier history in more detail. </p> <p><i>Corrections added August 19, 2009. Links updated May 16, 2018.</i></p> <p><i>Update, December 19, 2020.</i> The original audio files disappeared along with the rest of the Bell Labs site some time ago, but I discovered a saved copy on one of my computers: [<a href="mcilroy97history.mp3">MP3</a> | <a href="mcilroy97history.rm">original RealAudio</a>]. I also added a few corrections and notes from Doug McIlroy, dated 2015 [sic].</p> <br> <br> <p class=lp><b>Transcript</b></p> <p class=pp> Computing at Bell Labs is certainly an outgrowth of the <a href="https://web.archive.org/web/20080622172015/http://cm.bell-labs.com/cm/ms/history/history.html">mathematics department</a>, which grew from that first hiring in 1897, G A Campbell. When Bell Labs was formally founded in 1925, what it had been was the engineering department of Western Electric. When it was formally founded in 1925, almost from the beginning there was a math department with Thornton Fry as the department head, and if you look at some of Fry’s work, it turns out that he was fussing around in 1929 with trying to discover information theory. It didn’t actually gel until twenty years later with Shannon.</p> <p class=pp><span style="font-size: 0.7em;">1:10</span> Of course, most of the mathematics at that time was continuous. One was interested in analyzing circuits and propagation. And indeed, this is what led to the growth of computing in Bell Laboratories. The computations could not all be done symbolically. There were not closed form solutions. There was lots of numerical computation done. The math department had a fair stable of computers, which in those days meant people. [laughter]</p> <p class=pp><span style="font-size: 0.7em;">2:00</span> And in the late ’30s, <a href="http://en.wikipedia.org/wiki/George_Stibitz">George Stibitz</a> had an idea that some of the work that they were doing on hand calculators might be automated by using some of the equipment that the Bell System was installing in central offices, namely relay circuits. He went home, and on his kitchen table, he built out of relays a binary arithmetic circuit. He decided that binary was really the right way to compute. However, when he finally came to build some equipment, he determined that binary to decimal conversion and decimal to binary conversion was a drag, and he didn’t want to put it in the equipment, and so he finally built in 1939, a relay calculator that worked in decimal, and it worked in complex arithmetic. Do you have a hand calculator now that does complex arithmetic? Ten-digit, I believe, complex computations: add, subtract, multiply, and divide. The I/O equipment was teletypes, so essentially all the stuff to make such machines out of was there. Since the I/O was teletypes, it could be remotely accessed, and there were in fact four stations in the West Street Laboratories of Bell Labs. West Street is down on the left side of Manhattan. I had the good fortune to work there one summer, right next to a district where you’re likely to get bowled over by rolling beeves hanging from racks or tumbling cabbages. The building is still there. It’s called <a href="http://query.nytimes.com/gst/fullpage.html?res=950DE3DB1F38F931A35751C0A96F948260">Westbeth Apartments</a>. It’s now an artist’s colony.</p> <p class=pp><span style="font-size: 0.7em;">4:29</span> Anyway, in West Street, there were four separate remote stations from which the complex calculator could be accessed. It was not time sharing. You actually reserved your time on the machine, and only one of the four terminals worked at a time. In 1940, this machine was shown off to the world at the AMS annual convention, which happened to be held in Hanover at Dartmouth that year, and mathematicians could wonder at remote computing, doing computation on an electromechanical calculator at 300 miles away.</p> <p class=pp><span style="font-size: 0.7em;">5:22</span> Stibitz went on from there to make a whole series of relay machines. Many of them were made for the government during the war. They were named, imaginatively, Mark I through Mark VI. I have read some of his patents. They’re kind of fun. One is a patent on conditional transfer. [laughter] And how do you do a conditional transfer? Well these gadgets were, the relay calculator was run from your fingers, I mean the complex calculator. The later calculators, of course, if your fingers were a teletype, you could perfectly well feed a paper tape in, because that was standard practice. And these later machines were intended really to be run more from paper tape. And the conditional transfer was this: you had two teletypes, and there’s a code that says "time to read from the other teletype". Loops were of course easy to do. You take paper and [laughter; presumably Doug curled a piece of paper to form a physical loop]. These machines never got to the point of having stored programs. But they got quite big. I saw, one of them was here in 1954, and I did see it, behind glass, and if you’ve ever seen these machines in the, there’s one in the Franklin Institute in Philadelphia, and there’s one in the Science Museum in San Jose, you know these machines that drop balls that go wandering sliding around and turning battle wheels and ringing bells and who knows what. It kind of looked like that. It was a very quiet room, with just a little clicking of relays, which is what a central office used to be like. It was the one air-conditioned room in Murray Hill, I think. This machine ran, the Mark VI, well I think that was the Mark V, the Mark VI actually went to Aberdeen. This machine ran for a good number of years, probably six, eight. And it is said that it never made an undetected error. [laughter]</p> <p class=pp><span style="font-size: 0.7em;">8:30</span> What that means is that it never made an error that it did not diagnose itself and stop. Relay technology was very very defensive. The telephone switching system had to work. It was full of self-checking, and so were the calculators, so were the calculators that Stibitz made.</p> <p class=pp><span style="font-size: 0.7em;">9:04</span> Arithmetic was done in bi-quinary, a two out of five representation for decimal integers, and if there weren’t exactly two out of five relays activated it would stop. This machine ran unattended over the weekends. People would bring their tapes in, and the operator would paste everybody’s tapes together. There was a beginning of job code on the tape and there was also a time indicator. If the machine ran out of time, it automatically stopped and went to the next job. If the machine caught itself in an error, it backed up to the current job and tried it again. They would load this machine on Friday night, and on Monday morning, all the tapes, all the entries would be available on output tapes.</p> <p class=pp>Question: I take it they were using a different representation for loops and conditionals by then.</p> <p class=pp>Doug: Loops were done actually by they would run back and forth across the tape now, on this machine.</p> <p class=pp><span style="font-size: 0.7em;">10:40</span> Then came the transistor in ’48. At Whippany, they actually had a transistorized computer, which was a respectable minicomputer, a box about this big, running in 1954, it ran from 1954 to 1956 solidly as a test run. The notion was that this computer might fly in an airplane. And during that two-year test run, one diode failed. In 1957, this machine called <a href="http://www.cedmagic.com/history/tradic-transistorized.html">TRADIC</a>, did in fact fly in an airplane, but to the best of my knowledge, that machine was a demonstration machine. It didn’t turn into a production machine. About that time, we started buying commercial machines. It’s wonderful to think about the set of different architectures that existed in that time. The first machine we got was called a <a href="http://www.columbia.edu/acis/history/cpc.html">CPC from IBM</a>. And all it was was a big accounting machine with a very special plugboard on the side that provided an interpreter for doing ten-digit decimal arithmetic, including opcodes for the trig functions and square root.</p> <p class=pp><span style="font-size: 0.7em;">12:30</span> It was also not a computer as we know it today, because it wasn’t stored program, it had twenty-four memory locations as I recall, and it took its program instead of from tapes, from cards. This was not a total advantage. A tape didn’t get into trouble if you dropped it on the floor. [laughter]. CPC, the operator would stand in front of it, and there, you would go through loops by taking cards out, it took human intervention, to take the cards out of the output of the card reader and put them in the ?top?. I actually ran some programs on the CPC ?...?. It’s the kind of thing you can be nostalgic about, but it wasn’t actually fun. [laughter]</p> <p class=pp><span style="font-size: 0.7em;">13:30</span> The next machine was an <a href="http://www.columbia.edu/acis/history/650.html">IBM 650</a>, and here, this was a stored program, with the memory being on drum. There was no operating system for it. It came with a manual: this is what the machine does. And Michael Wolontis made an interpreter called the <a href="http://hopl.info/showlanguage2.prx?exp=6497">L1 interpreter</a> for this machine, so you could actually program in, the manual told you how to program in binary, and L1 allowed you to give something like 10 for add and 9 for subtract, and program in decimal instead. And of course that machine required interesting optimization, because it was a nice thing if the next program step were stored somewhere -- each program step had the address of the following step in it, and you would try to locate them around the drum so to minimize latency. So there were all kinds of optimizers around, but I don’t think Bell Labs made ?...? based on this called "soap" from Carnegie Mellon. That machine didn’t last very long. Fortunately, a machine with core memory came out from IBM in about ’56, the 704. Bell Labs was a little slow in getting one, in ’58. Again, the machine came without an operating system. In fact, but it did have Fortran, which really changed the world. It suddenly made it easy to write programs. But the way Fortran came from IBM, it came with a thing called the Fortran Stop Book. This was a list of what happened, a diagnostic would execute the halt instruction, the operator would go read the panel lights and discover where the machine had stopped, you would then go look up in the stop book what that meant. Bell Labs, with George Mealy and Gwen Hanson, made an operating system, and one of the things they did was to bring the stop book to heel. They took the compiler, replaced all the stop instructions with jumps to somewhere, and allowed the program instead of stopping to go on to the next trial. By the time I arrived at Bell Labs in 1958, this thing was running nicely.</p> <p class=pp>[<i>McIlroy comments, 2015</i>: I’m pretty sure I was wrong in saying Mealy and Hanson brought the stop book to heel. They built the OS, but I believe Dolores Leagus tamed Fortran. (Dolores was the most accurate programmer I ever knew. She’d write 2000 lines of code before testing a single line--and it would work.)]</p> <p class=pp><span style="font-size: 0.7em;">16:36</span> Bell Labs continued to be a major player in operating systems. This was called BESYS. BE was the SHARE abbreviation for Bell Labs. Each company that belonged to Share, which was the IBM users group, ahd a two letter abbreviation. It’s hard to imagine taking all the computer users now and giving them a two-letter abbreviation. BESYS went through many generations, up to BESYS 5, I believe. Each one with innovations. IBM delivered a machine, the 7090, in 1960. This machine had interrupts in it, but IBM didn’t use them. But BESYS did. And that sent IBM back to the drawing board to make it work. [Laughter]</p> <p class=pp><span style="font-size: 0.7em;">17:48</span> Rob Pike: It also didn’t have memory protection.</p> <p class=pp>Doug: It didn’t have memory protection either, and a lot of people actually got IBM to put memory protection in the 7090, so that one could leave the operating system resident in the presence of a wild program, an idea that the PC didn’t discover until, last year or something like that. [laughter]</p> <p class=pp>Big players then, <a href="http://en.wikipedia.org/wiki/Richard_Hamming">Dick Hamming</a>, a name that I’m sure everybody knows, was sort of the numerical analysis guru, and a seer. He liked to make outrageous predictions. He predicted in 1960, that half of Bell Labs was going to be busy doing something with computers eventually. ?...? exaggerating some ?...? abstract in his thought. He was wrong. Half was a gross underestimate. Dick Hamming retired twenty years ago, and just this June he completed his full twenty years term in the Navy, which entitles him again to retire from the Naval Postgraduate Institute in Monterey. Stibitz, incidentally died, I think within the last year. He was doing medical instrumentation at Dartmouth essentially, near the end.</p> <p class=pp>[<i>McIlroy comments, 2015</i>: I’m not sure what exact unintelligible words I uttered about Dick Hamming. When he predicted that half the Bell Labs budget would be related to computing in a decade, people scoffed in terms like &ldquo;that’s just Dick being himelf, exaggerating for effect&rdquo;.]</p> <p class=pp><span style="font-size: 0.7em;">20:00</span> Various problems intrigued, besides the numerical problems, which in fact were stock in trade, and were the real justification for buying machines, until at least the ’70s I would say. But some non-numerical problems had begun to tickle the palette of the math department. Even G A Campbell got interested in graph theory, the reason being he wanted to think of all the possible ways you could take the three wires and the various parts of the telephone and connect them together, and try permutations to see what you could do about reducing sidetone by putting things into the various parts of the circuit, and devised every possibly way of connecting the telephone up. And that was sort of the beginning of combinatorics at Bell Labs. John Reardon, a mathematician parlayed this into a major subject. Two problems which are now deemed as computing problems, have intrigued the math department for a very long time, and those are the minimum spanning tree problem, and the wonderfully ?comment about Joe Kruskal, laughter?</p> <p class=pp><span style="font-size: 0.7em;">21:50</span> And in the 50s Bob Prim and Kruskal, who I don’t think worked on the Labs at that point, invented algorithms for the minimum spanning tree. Somehow or other, computer scientists usually learn these algorithms, one of the two at least, as Dijkstra’s algorithm, but he was a latecomer.</p> <p class=pp>[<i>McIlroy comments, 2015</i>: I erred in attributing Dijkstra’s algorithm to Prim and Kruskal. That honor belongs to yet a third member of the math department: Ed Moore. (Dijkstra’s algorithm is for shortest path, not spanning tree.)]</p> <p class=pp>Another pet was the traveling salesman. There’s been a long list of people at Bell Labs who played with that: Shen Lin and Ron Graham and David Johnson and dozens more, oh and ?...?. And then another problem is the Steiner minimum spanning tree, where you’re allowed to add points to the graph. Every one of these problems grew, actually had a justification in telephone billing. One jurisdiction or another would specify that the way you bill for a private line network was in one jurisdiction by the minimum spanning tree. In another jurisdiction, by the traveling salesman route. NP-completeness wasn’t a word in the vocabulary of lawmakers [laughter]. And the <a href="http://en.wikipedia.org/wiki/Steiner_tree">Steiner problem</a> came up because customers discovered they could beat the system by inventing offices in the middle of Tennessee that had nothing to do with their business, but they could put the office at a Steiner point and reduce their phone bill by adding to what the service that the Bell System had to give them. So all of these problems actually had some justification in billing besides the fun.</p> <p class=pp><span style="font-size: 0.7em;">24:15</span> Come the 60s, we actually started to hire people for computing per se. I was perhaps the third person who was hired with a Ph.D. to help take care of the computers and I’m told that the then director and head of the math department, Hendrick Bode, had said to his people, "yeah, you can hire this guy, instead of a real mathematician, but what’s he gonna be doing in five years?" [laughter]</p> <p class=pp><span style="font-size: 0.7em;">25:02</span> Nevertheless, we started hiring for real in about ’67. Computer science got split off from the math department. I had the good fortune to move into the office that I’ve been in ever since then. Computing began to make, get a personality of its own. One of the interesting people that came to Bell Labs for a while was Hao Wang. Is his name well known? [Pause] One nod. Hao Wang was a philosopher and logician, and we got a letter from him in England out of the blue saying "hey you know, can I come and use your computers? I have an idea about theorem proving." There was theorem proving in the air in the late 50s, and it was mostly pretty thin stuff. Obvious that the methods being proposed wouldn’t possibly do anything more difficult than solve tic-tac-toe problems by enumeration. Wang had a notion that he could mechanically prove theorems in the style of Whitehead and Russell’s great treatise Principia Mathematica in the early patr of the century. He came here, learned how to program in machine language, and took all of Volume I of Principia Mathematica -- if you’ve ever hefted Principia, well that’s about all it’s good for, it’s a real good door stop. It’s really big. But it’s theorem after theorem after theorem in propositional calculus. Of course, there’s a decision procedure for propositional calculus, but he was proving them more in the style of Whitehead and Russell. And when he finally got them all coded and put them into the computer, he proved the entire contents of this immense book in eight minutes. This was actually a neat accomplishment. Also that was the beginning of all the language theory. We hired people like <a href="http://www1.cs.columbia.edu/~aho/">Al Aho</a> and <a href="http://infolab.stanford.edu/~ullman/">Jeff Ullman</a>, who probed around every possible model of grammars, syntax, and all of the things that are now in the standard undergraduate curriculum, were pretty well nailed down here, on syntax and finite state machines and so on were pretty well nailed down in the 60s. Speaking of finite state machines, in the 50s, both Mealy and Moore, who have two of the well-known models of finite state machines, were here.</p> <p class=pp><span style="font-size: 0.7em;">28:40</span> During the 60s, we undertook an enormous development project in the guise of research, which was <a href="http://www.multicians.org/">MULTICS</a>, and it was the notion of MULTICS was computing was the public utility of the future. Machines were very expensive, and ?indeed? like you don’t own your own electric generator, you rely on the power company to do generation for you, and it was seen that this was a good way to do computing -- time sharing -- and it was also recognized that shared data was a very good thing. MIT pioneered this and Bell Labs joined in on the MULTICS project, and this occupied five years of system programming effort, until Bell Labs pulled out, because it turned out that MULTICS was too ambitious for the hardware at the time, and also with 80 people on it was not exactly a research project. But, that led to various people who were on the project, in particular <a href="http://en.wikipedia.org/wiki/Ken_Thompson">Ken Thompson</a> -- right there -- to think about how to -- <a href="http://en.wikipedia.org/wiki/Dennis_Ritchie">Dennis Ritchie</a> and Rudd Canaday were in on this too -- to think about how you might make a pleasant operating system with a little less resources.</p> <p class=pp><span style="font-size: 0.7em;">30:30</span> And Ken found -- this is a story that’s often been told, so I won’t go into very much of unix -- Ken found an old machine cast off in the corner, the <a href="http://en.wikipedia.org/wiki/GE-600_series">PDP-7</a>, and put up this little operating system on it, and we had immense <a href="http://en.wikipedia.org/wiki/GE-600_series">GE635</a> available at the comp center at the time, and I remember as the department head, muscling in to use this little computer to be, to get to be Unix’s first user, customer, because it was so much pleasanter to use this tiny machine than it was to use the big and capable machine in the comp center. And of course the rest of the story is known to everybody and has affected all college campuses in the country.</p> <p class=pp><span style="font-size: 0.7em;">31:33</span> Along with the operating system work, there was a fair amount of language work done at Bell Labs. Often curious off-beat languages. One of my favorites was called <a href="http://hopl.murdoch.edu.au/showlanguage.prx?exp=6937&language=BLODI-B">Blodi</a>, B L O D I, a block diagram compiler by Kelly and Vyssotsky. Perhaps the most interesting early uses of computers in the sense of being unexpected, were those that came from the acoustics research department, and what the Blodi compiler was invented in the acoustic research department for doing digital simulations of sample data system. DSPs are classic sample data systems, where instead of passing analog signals around, you pass around streams of numerical values. And Blodi allowed you to say here’s a delay unit, here’s an amplifier, here’s an adder, the standard piece parts for a sample data system, and each one was described on a card, and with description of what it’s wired to. It was then compiled into one enormous single straight line loop for one time step. Of course, you had to rearrange the code because some one part of the sample data system would feed another and produce really very efficient 7090 code for simulating sample data systems. By in large, from that time forth, the acoustic department stopped making hardware. It was much easier to do signal processing digitally than previous ways that had been analog. Blodi had an interesting property. It was the only programming language I know where -- this is not my original observation, Vyssotsky said -- where you could take the deck of cards, throw it up the stairs, and pick them up at the bottom of the stairs, feed them into the computer again, and get the same program out. Blodi had two, aside from syntax diagnostics, it did have one diagnostic when it would fail to compile, and that was "somewhere in your system is a loop that consists of all delays or has no delays" and you can imagine how they handled that.</p> <p class=pp><span style="font-size: 0.7em;">35:09</span> Another interesting programming language of the 60s was <a href="http://www.knowltonmosaics.com/">Ken Knowlten</a>’s <a href="http://beflix.com/beflix.php">Beflix</a>. This was for making movies on something with resolution kind of comparable to 640x480, really coarse, and the programming notion in here was bugs. You put on your grid a bunch of bugs, and each bug carried along some data as baggage, and then you would do things like cellular automata operations. You could program it or you could kind of let it go by itself. If a red bug is next to a blue bug then it turns into a green bug on the following step and so on. <span style="font-size: 0.7em;">36:28</span> He and Lillian Schwartz made some interesting abstract movies at the time. It also did some interesting picture processing. One wonderful picture of a reclining nude, something about the size of that blackboard over there, all made of pixels about a half inch high each with a different little picture in it, picked out for their density, and so if you looked at it close up it consisted of pickaxes and candles and dogs, and if you looked at it far enough away, it was a <a href="http://blog.the-eg.com/2007/12/03/ken-knowlton-mosaics/">reclining nude</a>. That picture got a lot of play all around the country.</p> <p class=pp>Lorinda Cherry: That was with Leon, wasn’t it? That was with <a href="https://en.wikipedia.org/wiki/Leon_Harmon">Leon Harmon</a>.</p> <p class=pp>Doug: Was that Harmon?</p> <p class=pp>Lorinda: ?...?</p> <p class=pp>Doug: Harmon was also an interesting character. He did more things than pictures. I’m glad you reminded me of him. I had him written down here. Harmon was a guy who among other things did a block diagram compiler for writing a handwriting recognition program. I never did understand how his scheme worked, and in fact I guess it didn’t work too well. [laughter] It didn’t do any production ?things? but it was an absolutely immense sample data circuit for doing handwriting recognition. Harmon’s most famous work was trying to estimate the information content in a face. And every one of these pictures which are a cliche now, that show a face digitized very coarsely, go back to Harmon’s <a href="https://web.archive.org/web/20080807162812/http://www.doubletakeimages.com/history.htm">first psychological experiments</a>, when he tried to find out how many bits of picture he needed to try to make a face recognizable. He went around and digitized about 256 faces from Bell Labs and did real psychological experiments asking which faces could be distinguished from other ones. I had the good fortune to have one of the most distinguishable faces, and consequently you’ll find me in freshman psychology texts through no fault of my own.</p> <p class=pp><span style="font-size: 0.7em;">39:15</span> Another thing going on the 60s was the halting beginning here of interactive computing. And again the credit has to go to the acoustics research department, for good and sufficient reason. They wanted to be able to feed signals into the machine, and look at them, and get them back out. They bought yet another weird architecture machine called the <a href="http://www.piercefuller.com/library/pb250.html">Packard Bell 250</a>, where the memory elements were <a href="http://en.wikipedia.org/wiki/Delay_line_memory">mercury delay lines</a>.</p> <p class=pp>Question: Packard Bell?</p> <p class=pp>Doug: Packard Bell, same one that makes PCs today.</p> <p class=pp><span style="font-size: 0.7em;">40:10</span> They hung this off of the comp center 7090 and put in a scheme for quickly shipping jobs into the job stream on the 7090. The Packard Bell was the real-time terminal that you could play with and repair stuff, ?...? off the 7090, get it back, and then you could play it. From that grew some graphics machines also, built by ?...? et al. And it was one of the old graphics machines in fact that Ken picked up to build Unix on.</p> <p class=pp><span style="font-size: 0.7em;">40:55</span> Another thing that went on in the acoustics department was synthetic speech and music. <a href="http://csounds.com/mathews/index.html">Max Mathews</a>, who was the the director of the department has long been interested in computer music. In fact since retirement he spent a lot of time with Pierre Boulez in Paris at a wonderful institute with lots of money simply for making synthetic music. He had a language called Music 5. Synthetic speech or, well first of all simply speech processing was pioneered particularly by <a href="http://en.wikipedia.org/wiki/John_Larry_Kelly,_Jr">John Kelly</a>. I remember my first contact with speech processing. It was customary for computer operators, for the benefit of computer operators, to put a loudspeaker on the low bit of some register on the machine, and normally the operator would just hear kind of white noise. But if you got into a loop, suddenly the machine would scream, and this signal could be used to the operator "oh the machines in a loop. Go stop it and go on to the next job." I remember feeding them an Ackermann’s function routine once. [laughter] They were right. It was a silly loop. But anyway. One day, the operators were ?...?. The machine started singing. Out of the blue. &ldquo;Help! I’m caught in a loop.&rdquo;. [laughter] And in a broad Texas accent, which was the recorded voice of John Kelly.</p> <p class=pp><span style="font-size: 0.7em;">43:14</span> However. From there Kelly went on to do some speech synthesis. Of course there’s been a lot more speech synthesis work done since, by <span style="font-size: 0.7em;">43:31</span> folks like Cecil Coker, Joe Olive. But they produced a record, which unfortunately I can’t play because records are not modern anymore. And everybody got one in the Bell Labs Record, which is a magazine, contained once a record from the acoustics department, with both speech and music and one very famous combination where the computer played and sang "A Bicycle Built For Two".</p> <p class=pp>?...?</p> <p class=pp><span style="font-size: 0.7em;">44:32</span> At the same time as all this stuff is going on here, needless to say computing is going on in the rest of the Labs. it was about early 1960 when the math department lost its monopoly on computing machines and other people started buying them too, but for switching. The first experiments with switching computers were operational in around 1960. They were planned for several years prior to that; essentially as soon as the transistor was invented, the making of electronic rather than electromechanical switching machines was anticipated. Part of the saga of the switching machines is cheap memory. These machines had enormous memories -- thousands of words. [laughter] And it was said that the present worth of each word of memory that programmers saved across the Bell System was something like eleven dollars, as I recall. And it was worthwhile to struggle to save some memory. Also, programs were permanent. You were going to load up the switching machine with switching program and that was going to run. You didn’t change it every minute or two. And it would be cheaper to put it in read only memory than in core memory. And there was a whole series of wild read-only memories, both tried and built. The first experimental Essex System had a thing called the flying spot store which was large photographic plates with bits on them and CRTs projecting on the plates and you would detect underneath on the photodetector whether the bit was set or not. That was the program store of Essex. The program store of the first ESS systems consisted of twistors, which I actually am not sure I understand to this day, but they consist of iron wire with a copper wire wrapped around them and vice versa. There were also experiments with an IC type memory called the waffle iron. Then there was a period when magnetic bubbles were all the rage. As far as I know, although microelectronics made a lot of memory, most of the memory work at Bell Labs has not had much effect on ?...?. Nice tries though.</p> <p class=pp><span style="font-size: 0.7em;">48:28</span> Another thing that folks began to work on was the application of (and of course, right from the start) computers to data processing. When you owned equipment scattered through every street in the country, and you have a hundred million customers, and you have bills for a hundred million transactions a day, there’s really some big data processing going on. And indeed in the early 60s, AT&T was thinking of making its own data processing computers solely for billing. Somehow they pulled out of that, and gave all the technology to IBM, and one piece of that technology went into use in high end equipment called tractor tapes. Inch wide magnetic tapes that would be used for a while.</p> <p class=pp><span style="font-size: 0.7em;">49:50</span> By in large, although Bell Labs has participated until fairly recently in data processing in quite a big way, AT&T never really quite trusted the Labs to do it right because here is where the money is. I can recall one occasion when during strike of temporary employees, a fill-in employee like from the Laboratories and so on, lost a day’s billing tape in Chicago. And that was a million dollars. And that’s generally speaking the money people did not until fairly recently trust Bell Labs to take good care of money, even though they trusted the Labs very well to make extremely reliable computing equipment for switches. The downtime on switches is still spectacular by any industry standards. The design for the first ones was two hours down in 40 years, and the design was met. Great emphasis on reliability and redundancy, testing.</p> <p class=pp><span style="font-size: 0.7em;">51:35</span> Another branch of computing was for the government. The whole Whippany Laboratories [time check] Whippany, where we took on contracts for the government particularly in the computing era in anti-missile defense, missile defense, and underwater sound. Missile defense was a very impressive undertaking. It was about in the early ’63 time frame when it was estimated the amount of computation to do a reasonable job of tracking incoming missiles would be 30 M floating point operations a second. In the day of the Cray that doesn’t sound like a great lot, but it’s more than your high end PCs can do. And the machines were supposed to be reliable. They designed the machines at Whippany, a twelve-processor multiprocessor, to no specs, enormously rugged, one watt transistors. This thing in real life performed remarkably well. There were sixty-five missile shots, tests across the Pacific Ocean ?...? and Lorinda Cherry here actually sat there waiting for them to come in. [laughter] And only a half dozen of them really failed. As a measure of the interest in reliability, one of them failed apparently due to processor error. Two people were assigned to look at the dumps, enormous amounts of telemetry and logging information were taken during these tests, which are truly expensive to run. Two people were assigned to look at the dumps. A year later they had not found the trouble. The team was beefed up. They finally decided that there was a race condition in one circuit. They then realized that this particular kind of race condition had not been tested for in all the simulations. They went back and simulated the entire hardware system to see if its a remote possibility of any similar cases, found twelve of them, and changed the hardware. But to spend over a year looking for a bug is a sign of what reliability meant.</p> <p class=pp><span style="font-size: 0.7em;">54:56</span> Since I’m coming up on the end of an hour, one could go on and on and on,</p> <p class=pp>Crowd: go on, go on. [laughter]</p> <p class=pp><span style="font-size: 0.7em;">55:10</span> Doug: I think I’d like to end up by mentioning a few of the programs that have been written at Bell Labs that I think are most surprising. Of course there are lots of grand programs that have been written.</p> <p class=pp>I already mentioned the block diagram compiler.</p> <p class=pp>Another really remarkable piece of work was <a href="eqn.pdf">eqn</a>, the equation typesetting language, which has been imitated since, by Lorinda Cherry and Brian Kernighan. The notion of taking an auditory syntax, the way people talk about equations, but only talk, this was not borrowed from any written notation before, getting the auditory one down on paper, that was very successful and surprising.</p> <p class=pp>Another of my favorites, and again Lorinda Cherry was in this one, with Bob Morris, was typo. This was a program for finding spelling errors. It didn’t know the first thing about spelling. It would read a document, measure its statistics, and print out the words of the document in increasing order of what it thought the likelihood of that word having come from the same statistical source as the document. The words that did not come from the statistical source of the document were likely to be typos, and now I mean typos as distinct from spelling errors, where you actually hit the wrong key. Those tend to be off the wall, whereas phonetic spelling errors you’ll never find. And this worked remarkably well. Typing errors would come right up to the top of the list. A really really neat program.</p> <p class=pp><span style="font-size: 0.7em;">57:50</span> Another one of my favorites was by Brenda Baker called <a href="http://doi.acm.org/10.1145/800168.811545">struct</a>, which took Fortran programs and converted them into a structured programming language called Ratfor, which was Fortran with C syntax. This seemed like a possible undertaking, like something you do by the seat of the pants and you get something out. In fact, folks at Lockheed had done things like that before. But Brenda managed to find theorems that said there’s really only one form, there’s a canonical form into which you can structure a Fortran program, and she did this. It took your Fortran program, completely mashed it, put it out perhaps in almost certainly a different order than it was in Fortran connected by GOTOs, without any GOTOs, and the really remarkable thing was that authors of the program who clearly knew the way they wrote it in the first place, preferred it after it had been rearranged by Brendan. I was astonished at the outcome of that project.</p> <p class=pp><span style="font-size: 0.7em;">59:19</span> Another first that happened around here was by Fred Grampp, who got interested in computer security. One day he decided he would make a program for sniffing the security arrangements on a computer, as a service: Fred would never do anything crooked. [laughter] This particular program did a remarkable job, and founded a whole minor industry within the company. A department was set up to take this idea and parlay it, and indeed ever since there has been some improvement in the way computer centers are managed, at least until we got Berkeley Unix.</p> <p class=pp><span style="font-size: 0.7em;">60:24</span> And the last interesting program that I have time to mention is one by <a href="http://www.cs.jhu.edu/~kchurch/">Ken Church</a>. He was dealing with -- text processing has always been a continuing ?...? of the research, and in some sense it has an application to our business because we’re handling speech, but he got into consulting with the department in North Carolina that has to translate manuals. There are millions of pages of manuals in the Bell System and its successors, and ever since we’ve gone global, these things had to get translated into many languages.</p> <p class=pp><span style="font-size: 0.7em;">61:28</span> To help in this, he was making tools which would put up on the screen, graphed on the screen quickly a piece of text and its translation, because a translator, particularly a technical translator, wants to know, the last time we mentioned this word how was it translated. You don’t want to be creative in translating technical text. You’d like to be able to go back into the archives and pull up examples of translated text. And the neat thing here is the idea for how do you align texts in two languages. You’ve got the original, you’ve got the translated one, how do you bring up on the screen, the two sentences that go together? And the following scam worked beautifully. This is on western languages. <span style="font-size: 0.7em;">62:33</span> Simply look for common four letter tetragrams, four letter combinations between the two and as best as you can, line them up as nearly linearly with the lengths of the two types as possible. And this <a href="church-tetragram.pdf">very simple idea</a> works like storm. Something for nothing. I like that.</p> <p class=pp><span style="font-size: 0.7em;">63:10</span> The last thing is one slogan that sort of got started with Unix and is just rife within the industry now. Software tools. We were making software tools in Unix before we knew we were, just like the Molière character was amazed at discovering he’d been speaking prose all his life. [laughter] But then <a href="http://www.amazon.com/-/dp/020103669X">Kernighan and Plauger</a> came along and christened what was going on, making simple generally useful and compositional programs to do one thing and do it well and to fit together. They called it software tools, made a book, wrote a book, and this notion now is abroad in the industry. And it really did begin all up in the little attic room where you [points?] sat for many years writing up here.</p> <p class=pp> Oh I forgot to. I haven’t used any slides. I’ve brought some, but I don’t like looking at bullets and you wouldn’t either, and I forgot to show you the one exhibit I brought, which I borrowed from Bob Kurshan. When Bell Labs was founded, it had of course some calculating machines, and it had one wonderful computer. This. That was bought in 1918. There’s almost no other computing equipment from any time prior to ten years ago that still exists in Bell Labs. This is an <a href="http://infolab.stanford.edu/pub/voy/museum/pictures/display/2-5-Mechanical.html">integraph</a>. It has two styluses. You trace a curve on a piece of paper with one stylus and the other stylus draws the indefinite integral here. There was somebody in the math department who gave this service to the whole company, with about 24 hours turnaround time, calculating integrals. Our recent vice president Arno Penzias actually did, he calculated integrals differently, with a different background. He had a chemical balance, and he cut the curves out of the paper and weighed them. This was bought in 1918, so it’s eighty years old. It used to be shiny metal, it’s a little bit rusty now. But it still works.</p> <p class=pp><span style="font-size: 0.7em;">66:30</span> Well, that’s a once over lightly of a whole lot of things that have gone on at Bell Labs. It’s just such a fun place that one I said I just could go on and on. If you’re interested, there actually is a history written. This is only one of about six volumes, <a href="http://www.amazon.com/gp/product/0932764061">this</a> is the one that has the mathematical computer sciences, the kind of things that I’ve mostly talked about here. A few people have copies of them. For some reason, the AT&T publishing house thinks that because they’re history they’re obsolete, and they stopped printing them. [laughter]</p> <p class=pp>Thank you, and that’s all.</p></p> Using Uninitialized Memory for Fun and Profit tag:research.swtch.com,2012:research.swtch.com/sparse 2008-03-14T00:00:00-04:00 2008-03-14T00:00:00-04:00 An unusual but very useful data structure <p><p class=lp> This is the story of a clever trick that's been around for at least 35 years, in which array values can be left uninitialized and then read during normal operations, yet the code behaves correctly no matter what garbage is sitting in the array. Like the best programming tricks, this one is the right tool for the job in certain situations. The sleaziness of uninitialized data access is offset by performance improvements: some important operations change from linear to constant time. </p> <p class=pp> Alfred Aho, John Hopcroft, and Jeffrey Ullman's 1974 book <i>The Design and Analysis of Computer Algorithms</i> hints at the trick in an exercise (Chapter 2, exercise 2.12): </p> <blockquote> Develop a technique to initialize an entry of a matrix to zero the first time it is accessed, thereby eliminating the <i>O</i>(||<i>V</i>||<sup>2</sup>) time to initialize an adjacency matrix. </blockquote> <p class=lp> Jon Bentley's 1986 book <a href="http://www.cs.bell-labs.com/cm/cs/pearls/"><i>Programming Pearls</i></a> expands on the exercise (Column 1, exercise 8; <a href="http://www.cs.bell-labs.com/cm/cs/pearls/sec016.html">exercise 9</a> in the Second Edition): </p> <blockquote> One problem with trading more space for less time is that initializing the space can itself take a great deal of time. Show how to circumvent this problem by designing a technique to initialize an entry of a vector to zero the first time it is accessed. Your scheme should use constant time for initialization and each vector access; you may use extra space proportional to the size of the vector. Because this method reduces initialization time by using even more space, it should be considered only when space is cheap, time is dear, and the vector is sparse. </blockquote> <p class=lp> Aho, Hopcroft, and Ullman's exercise talks about a matrix and Bentley's exercise talks about a vector, but for now let's consider just a simple set of integers. </p> <p class=pp> One popular representation of a set of <i>n</i> integers ranging from 0 to <i>m</i> is a bit vector, with 1 bits at the positions corresponding to the integers in the set. Adding a new integer to the set, removing an integer from the set, and checking whether a particular integer is in the set are all very fast constant-time operations (just a few bit operations each). Unfortunately, two important operations are slow: iterating over all the elements in the set takes time <i>O</i>(<i>m</i>), as does clearing the set. If the common case is that <i>m</i> is much larger than <i>n</i> (that is, the set is only sparsely populated) and iterating or clearing the set happens frequently, then it could be better to use a representation that makes those operations more efficient. That's where the trick comes in. </p> <p class=pp> Preston Briggs and Linda Torczon's 1993 paper, &ldquo;<a href="http://citeseer.ist.psu.edu/briggs93efficient.html"><b>An Efficient Representation for Sparse Sets</b></a>,&rdquo; describes the trick in detail. Their solution represents the sparse set using an integer array named <code>dense</code> and an integer <code>n</code> that counts the number of elements in <code>dense</code>. The <i>dense</i> array is simply a packed list of the elements in the set, stored in order of insertion. If the set contains the elements 5, 1, and 4, then <code>n = 3</code> and <code>dense[0] = 5</code>, <code>dense[1] = 1</code>, <code>dense[2] = 4</code>: </p> <center> <img src="http://research.swtch.com/sparse0.png" /> </center> <p class=pp> Together <code>n</code> and <code>dense</code> are enough information to reconstruct the set, but this representation is not very fast. To make it fast, Briggs and Torczon add a second array named <code>sparse</code> which maps integers to their indices in <code>dense</code>. Continuing the example, <code>sparse[5] = 0</code>, <code>sparse[1] = 1</code>, <code>sparse[4] = 2</code>. Essentially, the set is a pair of arrays that point at each other: </p> <center> <img src="http://research.swtch.com/sparse0b.png" /> </center> <p class=pp> Adding a member to the set requires updating both of these arrays: </p> <pre class=indent> add-member(i): &nbsp;&nbsp;&nbsp;&nbsp;dense[n] = i &nbsp;&nbsp;&nbsp;&nbsp;sparse[i] = n &nbsp;&nbsp;&nbsp;&nbsp;n++ </pre> <p class=lp> It's not as efficient as flipping a bit in a bit vector, but it's still very fast and constant time. </p> <p class=pp> To check whether <code>i</code> is in the set, you verify that the two arrays point at each other for that element: </p> <pre class=indent> is-member(i): &nbsp;&nbsp;&nbsp;&nbsp;return sparse[i] &lt; n && dense[sparse[i]] == i </pre> <p class=lp> If <code>i</code> is not in the set, then <i>it doesn't matter what <code>sparse[i]</code> is set to</i>: either <code>sparse[i]</code> will be bigger than <code>n</code> or it will point at a value in <code>dense</code> that doesn't point back at it. Either way, we're not fooled. For example, suppose <code>sparse</code> actually looks like: </p> <center> <img src="http://research.swtch.com/sparse1.png" /> </center> <p class=lp> <code>Is-member</code> knows to ignore members of sparse that point past <code>n</code> or that point at cells in <code>dense</code> that don't point back, ignoring the grayed out entries: <center> <img src="http://research.swtch.com/sparse2.png" /> </center> <p class=pp> Notice what just happened: <code>sparse</code> can have <i>any arbitrary values</i> in the positions for integers not in the set, those values actually get used during membership tests, and yet the membership test behaves correctly! (This would drive <a href="http://valgrind.org/">valgrind</a> nuts.) </p> <p class=pp> Clearing the set can be done in constant time: </p> <pre class=indent> clear-set(): &nbsp;&nbsp;&nbsp;&nbsp;n = 0 </pre> <p class=lp> Zeroing <code>n</code> effectively clears <code>dense</code> (the code only ever accesses entries in dense with indices less than <code>n</code>), and <code>sparse</code> can be uninitialized, so there's no need to clear out the old values. </p> <p class=pp> This sparse set representation has one more trick up its sleeve: the <code>dense</code> array allows an efficient implementation of set iteration. </p> <pre class=indent> iterate(): &nbsp;&nbsp;&nbsp;&nbsp;for(i=0; i&lt;n; i++) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;yield dense[i] </pre> <p class=pp> Let's compare the run times of a bit vector implementation against the sparse set: </p> <center> <table> <tr> <td><i>Operation</i> <td align=center width=10> <td align=center><i>Bit Vector</i> <td align=center width=10> <td align=center><i>Sparse set</i> </tr> <tr> <td>is-member <td> <td align=center><i>O</i>(1) <td> <td align=center><i>O</i>(1) </tr> <tr> <td>add-member <td> <td align=center><i>O</i>(1) <td> <td align=center><i>O</i>(1) </tr> <tr> <td>clear-set <td><td align=center><i>O</i>(<i>m</i>) <td><td align=center><i>O</i>(1) </tr> <tr> <td>iterate <td><td align=center><i>O</i>(<i>m</i>) <td><td align=center><i>O</i>(<i>n</i>) </tr> </table> </center> <p class=lp> The sparse set is as fast or faster than bit vectors for every operation. The only problem is the space cost: two words replace each bit. Still, there are times when the speed differences are enough to balance the added memory cost. Briggs and Torczon point out that liveness sets used during register allocation inside a compiler are usually small and are cleared very frequently, making sparse sets the representation of choice. </p> <p class=pp> Another situation where sparse sets are the better choice is work queue-based graph traversal algorithms. Iteration over sparse sets visits elements in the order they were inserted (above, 5, 1, 4), so that new entries inserted during the iteration will be visited later in the same iteration. In contrast, iteration over bit vectors visits elements in integer order (1, 4, 5), so that new elements inserted during traversal might be missed, requiring repeated iterations. </p> <p class=pp> Returning to the original exercises, it is trivial to change the set into a vector (or matrix) by making <code>dense</code> an array of index-value pairs instead of just indices. Alternately, one might add the value to the <code>sparse</code> array or to a new array. The relative space overhead isn't as bad if you would have been storing values anyway. </p> <p class=pp> Briggs and Torczon's paper implements additional set operations and examines performance speedups from using sparse sets inside a real compiler. </p></p> Play Tic-Tac-Toe with Knuth tag:research.swtch.com,2012:research.swtch.com/tictactoe 2008-01-25T00:00:00-05:00 2008-01-25T00:00:00-05:00 The only winning move is not to play. <p><p class=lp>Section 7.1.2 of the <b><a href="http://www-cs-faculty.stanford.edu/~knuth/taocp.html#vol4">Volume 4 pre-fascicle 0A</a></b> of Donald Knuth's <i>The Art of Computer Programming</i> is titled &#8220;Boolean Evaluation.&#8221; In it, Knuth considers the construction of a set of nine boolean functions telling the correct next move in an optimal game of tic-tac-toe. In a footnote, Knuth tells this story:</p> <blockquote><p class=lp>This setup is based on an exhibit from the early 1950s at the Museum of Science and Industry in Chicago, where the author was first introduced to the magic of switching circuits. The machine in Chicago, designed by researchers at Bell Telephone Laboratories, allowed me to go first; yet I soon discovered there was no way to defeat it. Therefore I decided to move as stupidly as possible, hoping that the designers had not anticipated such bizarre behavior. In fact I allowed the machine to reach a position where it had two winning moves; and it seized <i>both</i> of them! Moving twice is of course a flagrant violation of the rules, so I had won a moral victory even though the machine had announced that I had lost.</p></blockquote> <p class=lp> That story alone is fairly amusing. But turning the page, the reader finds a quotation from Charles Babbage's <i><a href="http://onlinebooks.library.upenn.edu/webbin/book/lookupid?key=olbp36384">Passages from the Life of a Philosopher</a></i>, published in 1864:</p> <blockquote><p class=lp>I commenced an examination of a game called &#8220;tit-tat-to&#8221; ... to ascertain what number of combinations were required for all the possible variety of moves and situations. I found this to be comparatively insignificant. ... A difficulty, however, arose of a novel kind. When the automaton had to move, it might occur that there were two different moves, each equally conducive to his winning the game. ... Unless, also, some provision were made, the machine would attempt two contradictory motions.</p></blockquote> <p class=lp> The only real winning move is not to play.</p></p> Crabs, the bitmap terror! tag:research.swtch.com,2012:research.swtch.com/crabs 2008-01-09T00:00:00-05:00 2008-01-09T00:00:00-05:00 A destructive, pointless violation of the rules <p><p class=lp>Today, window systems seem as inevitable as hierarchical file systems, a fundamental building block of computer systems. But it wasn't always that way. This paper could only have been written in the beginning, when everything about user interfaces was up for grabs.</p> <blockquote><p class=lp>A bitmap screen is a graphic universe where windows, cursors and icons live in harmony, cooperating with each other to achieve functionality and esthetics. A lot of effort goes into making this universe consistent, the basic law being that every window is a self contained, protected world. In particular, (1) a window shall not be affected by the internal activities of another window. (2) A window shall not be affected by activities of the window system not concerning it directly, i.e. (2.1) it shall not notice being obscured (partially or totally) by other windows or obscuring (partially or totally) other windows, (2.2) it shall not see the <i>image</i> of the cursor sliding on its surface (it can only ask for its position).</p> <p class=pp> Of course it is difficult to resist the temptation to break these rules. Violations can be destructive or non-destructive, useful or pointless. Useful non-destructive violations include programs printing out an image of the screen, or magnifying part of the screen in a <i>lens</i> window. Useful destructive violations are represented by the <i>pen</i> program, which allows one to scribble on the screen. Pointless non-destructive violations include a magnet program, where a moving picture of a magnet attracts the cursor, so that one has to continuously pull away from it to keep working. The first pointless, destructive program we wrote was <i>crabs</i>.</p> </blockquote> <p class=lp>As the crabs walk over the screen, they leave gray behind, &#8220;erasing&#8221; the apps underfoot:</p> <blockquote><img src="http://research.swtch.com/crabs1.png"> </blockquote> <p class=lp> For the rest of the story, see Luca Cardelli's &#8220;<a style="font-weight: bold;" href="http://lucacardelli.name/Papers/Crabs.pdf">Crabs: the bitmap terror!</a>&#8221; (6.7MB). Additional details in &#8220;<a href="http://lucacardelli.name/Papers/Crabs%20%28History%20and%20Screen%20Dumps%29.pdf">Crabs (History and Screen Dumps)</a>&#8221; (57.1MB).</p></p>