research!rsc

Thoughts and links about programming, by

RSS

Reproducible, Verifiable, Verified Builds
(Go & Versioning, Part 5)
Posted on Wednesday, February 21, 2018. PDF

Once both Go developers and tools share a vocabulary around package versions, it's relatively straightforward to add support in the toolchain for reproducible, verifiable, and verified builds. In fact, the basics are already in the vgo prototype.

Since people sometimes disagree about the exact definitions of these terms, let's establish some basic terminology. For this post:

Vgo delivers reproducible builds by default. The resulting binaries are verifiable, in that they record versions of the exact source code that went into the build. And it is possible to configure your repository so that users rebuilding your software verify that their builds match yours, using cryptographic hashes, no matter how they obtain the dependencies.

Reproducible Builds

At the very least, we want to make sure that when you build my program, the build system decides to use the same versions of the code. Minimal version selection delivers this property by default. The go.mod file alone is enough to uniquely determine which module versions should be used for the build (assuming dependencies are available), and that decision is stable even as new versions of a module are introduced into the ecosystem. This differs from most other systems, which adopt new versions automatically and need to be constrained to yield reproducible builds. I covered this in the minimal version selection post, but it's an important, subtle detail, so I'll try to give a short reprise here.

To make this concrete, let's look at a few real packages from Cargo, Rust's package manager. To be clear, I am not picking on Cargo. I think Cargo is an example of the current state of the art in package managers, and there's much to learn from it. If we can make Go package management as smooth as Cargo's, I'll be happy. But I also think that it is worth exploring whether we would benefit from choosing a different default when it comes to version selection.

Cargo prefers maximum versions in the following sense. Over at crates.io, the latest toml is 0.4.5 as I write this post. It lists a dependency on serde 1.0 or later; the latest serde is 1.0.27. If you start a new project and add a dependency on toml 0.4.1 or later, Cargo has a choice to make. According to the constraints, any of 0.4.1, 0.4.2, 0.4.3, 0.4.4, or 0.4.5 would be acceptable. All other things being equal, Cargo prefers the newest acceptable version, 0.4.5. Similarly, any of serde 1.0.0 through 1.0.27 are acceptable, and Cargo chooses 1.0.27. These choices change as new versions are introduced. If serde 1.0.28 is released tonight and I add toml 0.4.5 to a project tomorrow, I'll get 1.0.28 instead of 1.0.27. As described so far, Cargo's builds are not repeatable. Cargo's (entirely reasonable) answer to this problem is to have not just a constraint file (the manifest, Cargo.toml) but also a list of the exact artifacts to use in the build (the lock file, Cargo.lock). The lock file stops future upgrades; once it is written, your build stays on serde 1.0.27 even when 1.0.28 is released.

In contrast, minimal version selection prefers the minimum allowed version, which is the exact version requested by some go.mod in the project. That answer does not change as new versions are added. Given the same choices in the Cargo example, vgo would select toml 0.4.1 (what you requested) and then serde 1.0 (what toml requested). Those choices are stable, without a lock file. This is what I mean when I say that vgo's builds are reproducible by default.

Verifiable Builds

Go binaries have long included a string indicating the version of Go they were built with. Last year I wrote a tool rsc.io/goversion that fetches that information from a given executable or tree of executables. For example, on my Ubuntu Linux laptop, I can look to see which system utilities are implemented in Go:

$ go get -u rsc.io/goversion
$ goversion /usr/bin
/usr/bin/containerd go1.8.3
/usr/bin/containerd-shim go1.8.3
/usr/bin/ctr go1.8.3
/usr/bin/go go1.8.3
/usr/bin/gofmt go1.8.3
/usr/bin/kbfsfuse go1.8.3
/usr/bin/kbnm go1.8.3
/usr/bin/keybase go1.8.3
/usr/bin/snap go1.8.3
/usr/bin/snapctl go1.8.3
$ 

Now that the vgo prototype understands module versions, it includes that information in the final binary too, and the new goversion -m flag prints it back out. Using our “hello, world” program from the tour:

$ go get -u rsc.io/goversion
$ goversion ./hello
./hello go1.10
$ goversion -m hello
./hello go1.10
    path  github.com/you/hello
    mod   github.com/you/hello  (devel)
    dep   golang.org/x/text     v0.0.0-20170915032832-14c0d48ead0c
    dep   rsc.io/quote          v1.5.2
    dep   rsc.io/sampler        v1.3.0
$ 

The main module, supposedly github.com/you/hello, has no version information, because it's the local development copy, not a specific version we downloaded. But if instead we build a command directly from a versioned module, then the listing does report versions for all modules:

$ vgo build -o hello2 rsc.io/hello
vgo: resolving import "rsc.io/hello"
vgo: finding rsc.io/hello (latest)
vgo: adding rsc.io/hello v1.0.0
vgo: finding rsc.io/hello v1.0.0
vgo: finding rsc.io/quote v1.5.1
vgo: downloading rsc.io/hello v1.0.0
$ goversion -m ./hello2
./hello2 go1.10
    path  rsc.io/hello
    mod   rsc.io/hello       v1.0.0
    dep   golang.org/x/text  v0.0.0-20170915032832-14c0d48ead0c
    dep   rsc.io/quote       v1.5.2
    dep   rsc.io/sampler     v1.3.0
$ 

When we do integrate versions into the main Go toolchain, we will add APIs to access this information from inside a running binary, just like runtime.Version provides access to the more limited Go version information.

For the purpose of attempting to reconstruct the binary, the information listed by goversion -m suffices: put the versions into a go.mod file and build the target named on the path line. But if the result is not the same binary, you might wonder about ways to narrow down what's different. What changed?

When vgo downloads each module, it computes a hash of the file tree corresponding to that module. That hash is also included in the binary, alongside the version information, and goversion -mh prints it:

$ goversion -mh ./hello
hello go1.10
    path  github.com/you/hello
    mod   github.com/you/hello  (devel)
    dep   golang.org/x/text     v0.0.0-20170915032832-14c0d48ead0c  h1:qgOY6WgZOaTkIIMiVjBQcw93ERBE4m30iBm00nkL0i8=
    dep   rsc.io/quote          v1.5.2                              h1:w5fcysjrx7yqtD/aO+QwRjYZOKnaM9Uh2b40tElTs3Y=
    dep   rsc.io/sampler        v1.3.1                              h1:F0c3J2nQCdk9ODsNhU3sElnvPIxM/xV1c/qZuAeZmac=
$ goversion -mh ./hello2
hello go1.10
    path  rsc.io/hello
    mod   rsc.io/hello       v1.0.0                              h1:CDmhdOARcor1WuRUvmE46PK91ahrSoEJqiCbf7FA56U=
    dep   golang.org/x/text  v0.0.0-20170915032832-14c0d48ead0c  h1:qgOY6WgZOaTkIIMiVjBQcw93ERBE4m30iBm00nkL0i8=
    dep   rsc.io/quote       v1.5.2                              h1:w5fcysjrx7yqtD/aO+QwRjYZOKnaM9Uh2b40tElTs3Y=
    dep   rsc.io/sampler     v1.3.0                              h1:7uVkIFmeBqHfdjD+gZwtXXI+RODJ2Wc4O7MPEh/QiW4=
$ 

The h1: prefix indicates which hash is being reported. Today, there is only “hash 1,” a SHA-256 hash of a list of files and the SHA-256 hashes of their contents. If we need to update to a new hash later, the prefix will help us tell old from new hashes.

I must stress that these hashes are self-reported by the build system. If someone gives you a binary with certain hashes in its build information, there's no guarantee they are accurate. They are very useful information supporting a later verification, not a signature that can be trusted by themselves.

Verified Builds

An author distributing a program in source form might want to let users verify that they are building it with exactly the expected dependencies. We know vgo will make the same decisions about which versions of dependencies to use, but there is still the problem of mapping a version like v1.5.2 to an actual source tree. What if the author of v1.5.2 changes the tag to point at a different file tree? What if a malicious middlebox intercepts the download request and delivers a different zip file? What if the user has accidentally edited the source files in the local copy of v1.5.2? The vgo prototype supports this kind of verification too.

The final form may be somewhat different, but if you create a file named go.modverify next to go.mod, then builds will keep that file up-to-date with known hashes for specific versions of modules:

$ echo >go.modverify
$ vgo build
$ tcat go.modverify  # go get rsc.io/tcat, or use cat
golang.org/x/text  v0.0.0-20170915032832-14c0d48ead0c  h1:qgOY6WgZOaTkIIMiVjBQcw93ERBE4m30iBm00nkL0i8=
rsc.io/quote       v1.5.2                              h1:w5fcysjrx7yqtD/aO+QwRjYZOKnaM9Uh2b40tElTs3Y=
rsc.io/sampler     v1.3.0                              h1:7uVkIFmeBqHfdjD+gZwtXXI+RODJ2Wc4O7MPEh/QiW4=
$ 

The go.modverify file is a log of the hash of all versions ever encountered: lines are only added, never removed. If we update rsc.io/sampler to v1.3.1, then the log will now contain hashes for both versions:

$ vgo get rsc.io/sampler@v1.3.1
$ tcat go.modverify
golang.org/x/text  v0.0.0-20170915032832-14c0d48ead0c  h1:qgOY6WgZOaTkIIMiVjBQcw93ERBE4m30iBm00nkL0i8=
rsc.io/quote       v1.5.2                              h1:w5fcysjrx7yqtD/aO+QwRjYZOKnaM9Uh2b40tElTs3Y=
rsc.io/sampler     v1.3.0                              h1:7uVkIFmeBqHfdjD+gZwtXXI+RODJ2Wc4O7MPEh/QiW4=
rsc.io/sampler     v1.3.1                              h1:F0c3J2nQCdk9ODsNhU3sElnvPIxM/xV1c/qZuAeZmac=
$ 

When go.modverify exists, vgo checks that all downloaded modules used in a given build are consistent with entries already in the file. For example, if we change the first digit of the rsc.io/quote hash from w to v:

$ vgo build
vgo: verifying rsc.io/quote v1.5.2: module hash mismatch
    downloaded:   h1:w5fcysjrx7yqtD/aO+QwRjYZOKnaM9Uh2b40tElTs3Y=
    go.modverify: h1:v5fcysjrx7yqtD/aO+QwRjYZOKnaM9Uh2b40tElTs3Y=
$ 

Or suppose we fix that one but then modify the v1.3.0 hash. Now our build succeeds, because v1.3.0 is not being used by the build, so its line is (correctly) ignored. But if we try to downgrade to v1.3.0, then the build verification will correctly begin failing:

$ vgo build
$ vgo get rsc.io/sampler@v1.3.0
vgo: verifying rsc.io/sampler v1.3.0: module hash mismatch
    downloaded:   h1:7uVkIFmeBqHfdjD+gZwtXXI+RODJ2Wc4O7MPEh/QiW4=
    go.modverify: h1:8uVkIFmeBqHfdjD+gZwtXXI+RODJ2Wc4O7MPEh/QiW4=
$ 

Developers who want to ensure that others rebuild their program with exactly the same sources they did can store a go.modverify in their repository. Then others building using the same repo will automatically get verified builds. For now, only the go.modverify in the top-level module of the build applies. But note that go.modverify lists all dependencies, including indirect dependencies, so the whole build is verified.

The go.modverify feature helps detect unexpected mismatches between downloaded dependencies on different machines. It compares the hashes in go.modverify against hashes computed and saved at module download time. It is also useful to check that downloaded modules have not changed on the local machine since it was downloaded. This is less about security from attacks and more about avoiding mistakes. For example, because source file paths appear in stack traces, it's common to open those files when debugging. If you accidentally (or, I suppose, intentionally) modify the file during the debugging session, it would be nice to be able to detect that later. The vgo verify command does this:

$ go get -u golang.org/x/vgo  # fixed a bug, sorry! :-)
$ vgo verify
all modules verified
$ 

If a source file changes, vgo verify notices:

$ echo >>$GOPATH/src/v/rsc.io/quote@v1.5.2/quote.go
$ vgo verify
rsc.io/quote v1.5.2: dir has been modified (/Users/rsc/src/v/rsc.io/quote@v1.5.2)
$ 

If we restore the file, all is well:

$ gofmt -w $GOPATH/src/v/rsc.io/quote@v1.5.2/quote.go
$ vgo verify
all modules verified
$ 

If cached zip files are modified after download, vgo verify notices that too, although I can't plausibly explain how that might happen:

$ zip $GOPATH/src/v/cache/rsc.io/quote/@v/v1.5.2.zip /etc/resolv.conf
  adding: etc/resolv.conf (deflated 36%)
$ vgo verify
rsc.io/quote v1.5.2: zip has been modified (/Users/rsc/src/v/cache/rsc.io/quote/@v/v1.5.2.zip)
$ 

Because vgo keeps the original zip file after unpacking it, if vgo verify decides that only one of the zip file and the directory tree have been modified, it could even print a diff of the two.

What's Next?

This is implemented already in vgo. You can try it out and use it. As with the rest of vgo, feedback about what doesn't work right (or works great) is appreciated.

The functionality presented here is more the start of something than a finished feature. A cryptographic hash of the file tree is a building block. The go.modverify built on top of it checks that developers all build a particular module with precisely the same dependencies, but there's no verification when downloading a new version of a module (unless someone else already added it to go.modverify), nor is there any sharing of expected hashes between modules.

The exact details of how to fix those two shortcomings are not obvious. It may make sense to allow some kind of cryptographic signatures of the file tree, and to verify that an upgrade finds a version signed with the same key as the previous version. Or it may make sense to adopt an approach along the lines of The Update Framework (TUF), although using their network protocols directly is not practical. Or, instead of using per-repo go.modverify logs, it might make sense to establish some kind of shared global log, a bit like Certificate Transparency, or to use a public identity server like Upspin. There are many avenues we might explore, but all this is getting a little ahead of ourselves. For now we are focused on successfully integrating versioning into the go command.