Go command package management integration

[This is a doc shared with the Go package management group in June 2017.]

Go command + package management integration

Russ Cox

June 2017

Abstract

The Go community has put significant effort into exploring approaches to package management over the past four or more years, from godep, which Keith Rarick started in spring 2013, to the recent dependency management working group’s dep command, led by Sam Boyer. The Go team left package management out of the standard toolchain and encouraged this experimentation largely because we knew we had no expertise here, and we believed that the community would be better able to lead the way. Just as we integrated the the community-proposed idea of “vendor” directories into the standard toolchain a few years ago, we now believe it is time to integrate package management into the standard toolchain.

This doc sketches pieces of a tentative plan for integrating package management, as pioneered by dep and its predecessors, into the go command. This is not a complete design document. It enumerates some of the high-level decisions and implications, in order to collect feedback about those. Nearly everything here is subject to change based on feedback from the community and experience using these tools. Certainly many decisions are missing as well.

Goal

The overall goal is to integrate package management as seamlessly as possible into the existing go command and standard toolchain, so that most users reap the benefit of versioned packages without making an effort to do so. A consequence is that, instead of introducing an separate, parallel command set, we want to redefine existing command idioms and workflows to take advantage of package management. For example, instead of having new commands that are “go get for versioned code”, we want to migrate known idioms like “go get” and “go get -u” to work sensibly in a versioned environment.

Semantic versioning

The approach to package management described here depends on semantic versioning (semver.org), as do package managers for most languages today.

Semantic versioning is a particular format for and interpretation of source code version numbers. Semantic versions are always of the form X.Y.Z where X, Y, and Z are decimal numbers. If two versions of a package have the same X.Y, differing only in Z (for example 1.2.3 vs 1.2.4), the only changes in the newer version should be bug fixes. If two versions of a package have the same X but differ in Y (for example 1.2.3 vs 1.3.1), there may also be new features or other important changes in the newer version, but only backwards-compatible ones. If two versions of a package differ in X, there are no guarantees: the newer package may have backwards-incompatible changes and in fact may not resemble the older package in any way other than having the same name (for Go, the same import path).

By the semantics of the version numbers, it should always be safe to replace version X.Y1.Z1 of a package with version X.Y2.Z2 of the same package provided Y2 > Y1 or else both Y2 == Y1 and Z2 >= Z1. When X = 0, different (yet to be determined) rules apply. Go’s approach to package management will require that package authors honor these semantics.

Unfortunately, semantic versioning fails to capture “how breaking” a change is: removal of or a breaking change to some little-used feature that is expected to affect very few existing clients is treated the same as a complete API rewrite. Even for the mostly-non-breaking case, semantic versioning demands that X be incremented, so that if the old version is 1.2.3, the new version must be 2.0.0, because the new version is not strictly backwards compatible with 1.2.3. The expectation of Go’s package management system is that increments of X will typically be for this kind of change and that, for the most part, package authors will not break large numbers of clients without very good reason. Also, package authors might choose to wait until a large number of these small breaking changes have accumulated before issuing them all together as a single 2.0.0, instead of allowing the sequence to force the creation of 3.0.0, 4.0.0, and so on.

Versioned source code

The go command to date has not made any attempt to track versions of source code, and certainly not to allow having two different versions active in a workspace at a time. That needs to change: we might want to reproducibly build two different commands, one that uses v1.2 of a package and one that needs v1.3 or even v3.4 of a package, and we want those two separate builds to be possible in a single workspace.

The unit of versioning is the version control repository (for example, a Git repo). Version numbers are recorded as version control tags named for semantic versions: v1.2.3, where all three numbers are required (the tag must be “v1.2.0” or “v1.0.0”, never “v1.2” or “v1”). Version control tags are expected not to be redefined after being published: once the v1.2.3 tag has been published to mean a particular commit, it should not be redefined to point at a different commit. (The lock file, described below, can double-check that tags don’t move, so that the situation can be detected.)

Versioned snapshots will be stored, unpacked, in GOPATH/vers/repo/path(v1.2.3), which may contain subdirectories, like GOPATH/vers/repo/path(v1.2.3)/sub/dir. The source code is not in GOPATH/src because today code in GOPATH/src is mutable and importable using the path to the directory below src; the versioned code in contrast is immutable and not importable using that kind of path.

The go command needs a syntax for referring to a specific version of a package, so that users can install or test a specific version of a package or binary. The command-line syntax will be repo/path(v1.2.3), or more generally repo/path/sub/dir(v1.2.3): the version is always last, to avoid interrupting the import path repo/path/sub/dir.

Although specific versions must always be fully-qualified, as in v1.2.3, it is allowed on the command line to specify shorter versions, like v1.2 or v1; such versions expand to the latest known such version. “Known” usually means on the local system; only “go get” consults the network, and only in certain circumstances roughly matching today’s behavior (TBD).

For example:

go get rsc.io/md2html(v1.2)

go install rsc.io/md2html(v1.2)

go test golang.org/x/net/http2(v3.4.5)

Import statements in source code will not include versions. Matching most other language package managers as well as dep, information about version requirements is stored in manifest files (see below). This reduces the amount of code that must be changed to switch to a new version of a dependency.

Versioned source servers

Many companies using Go are interested in being able to host source code packages on their own servers, instead of fetching them directly from the original source (for example, Github). This allows vetting of the code and insulates against those sources being removed or going down. We will define a simple configuration mechanism for opting in to the use of such a server, a HTTP-based API for that server, and a reference implementation that serves from a collection of snapshot archives in a local file system.

In the long term, we would like to investigate the possibility of running a public server that would at least serve metadata about versions and dependencies of all known packages, and optionally actual package snapshots. That may enable important performance improvements, as it seems to in Rust’s cargo. However, it is also possible that good performance can be had without such a central server. In any event, whether and how to have a public server is a decision for later, after the rest of the package management system is implemented, working, and in use.

Source code dependents

The go command also needs a syntax for referring to a package as used by a given target. That syntax is target//dependency, with two slashes.

For example, rsc.io/md2html(v1.2) may import github.com/russross/blackfriday, which in turn imports github.com/shurcooL/sanitized_anchor_name. It may be that a default build of blackfriday would use the latest sanitized_anchor_name, v2.3.4. Running “go test .../blackfriday” would test blackfriday with sanitized_anchor_name(v2.3.4). But perhaps that version is, for some reason, incompatible with md2html, so md2html disallows use of that version. Then, within the md2html(v1.2) build, the compilation of blackfriday will use sanitized_anchor_name v2.3.3. The go command needs a way to refer to that compilation of blackfriday. It is:

rsc.io/md2html(v1.2)//github.com/russross/blackfriday

as in:

go test rsc.io/md2html(v1.2)//github.com/russross/blackfriday

This tests blackfriday as built for inclusion in md2html(v1.2). Note that the equivalent today would be for rsc.io/md2html to have a vendor directory containing a specific versions of blackfriday and sanitized_anchor_name, and then the command would be:

go test rsc.io/md2html(v1.2)/vendor/github.com/russross/blackfriday

The double-slash syntax is, in some sense, the equivalent vendored path without the word vendor, because the code is no longer vendored. It’s important that the syntax be restricted to the command line and be invalid for import paths: one cannot import either one directly, even today.

It is not expected that users will type these commands every day, but it is important that the commands are possible to type. Just like ... cannot match /vendor/ (as of Go 1.9), it also cannot match //. One result is that it is possible to test all dependencies of md2html(v1.2) as built for that command by using

go test rsc.io/md2html(v1.2)//...

(This is analogous to go test rsc.io/md2html/vendor/... today, assuming that md2html vendors all its dependencies.)

Resolving versions

To load information about a package, the go command must, for every import in the package’s source code, decide which version of the imported package to use.

The package management systems for most other languages allow packages to declare compatibility or incompatibility with specific versions of their dependencies and then use some kind of solver - now often an off-the-shelf SAT solver - to search for a solution. This is an NP-complete task, and when no satisfying assignment can be found, there is in general no decent explanation to give the user. Instead of attempting to solve and to present a user interface in terms of an NP-complete problem, the go command will use a much more limited algorithm that decides versions without any general search or backtracking, making use of the semantics of semantic versioning. In particular, it will assume that v1.2.3 can always be substituted for earlier v1 versions, such as v1.1.4 and v1.2.2, always choosing the latest version within a major version family (such as v1); and it will allow building a project in which v1 and v2 of a package can both be linked into the package and are treated as separate.

The default algorithm matches the current go command: start with a specific version of a target package, for each of its dependencies use the latest tagged version of that package (or in the absence of any version information use the latest commit, as today), and then resolve imports in those packages, recursing until the entire import graph has been resolved. It is expected that, as today, the Go community will keep changes to packages mainly backwards compatible, so that no explicit version selection will be needed in many or even most packages: the default algorithm will continue to be good enough.

The default algorithm can be overridden by a manifest file, described more below. The manifest file can specify the following directives:

In this build, import “x.com/y/z” means v1.2.3+ (but before v2).
In this build, import “x.com/y/z” means v1.2.3+ or v2.1+ (multiple major versions ok).
In this build, “x.com/y/z” v1.2.3 comes from a named local directory
In this build, when building “f.io/g/h”, <insert any of the previous directives>
Exclude “x.com/y/z” v1.2.4 from consideration in this build (it’s known to be buggy).

That is, the manifest can specify a minimum version of an import, it can specify that a particular import must be loaded from a local directory (to allow experimenting with modification to that import), and it can specify directives limited to a subset of the overall build. (For now we are ignoring the exact syntax of these directives and just using prose descriptions.)

The default algorithm’s choices are refined by the directives in the manifest. Where before the algorithm would use the latest version of a given import, here it uses the latest version allowed by the manifest (that is, the latest version with an acceptable major version number), making sure that the version is late enough (is at least as great as the one required).

If target t depends through some chain of imports on package u, and package u has a manifest, then u’s manifest applies only to the packages built as dependencies of u. For example, t’s manifest may specify that import “x.com/y/z” means v1.2.3+ while u’s may specify that the import means v2.1+. If so, then the resulting binary will have two copies of x.com/y/z, a v1 for use by most of the program and a v2 for use by u and its dependencies: by default, inner manifest override outer ones for better modularity. If u’s manifest said instead that either v1.3.4+ or v2.1+ were acceptable, however, then u’s build would use the v1 (at least v1.3.4) used by the rest of t’s build.

It is possible for manifests in a given build to conflict. If t and u both depend directly on w v1.2 (the latest version of w) and w v1.2 imports x.com/y/z, the two manifests conflict: one says to compile w using x.com/y/z v1.2.3 while the other says to compile w using x.com/y/z v2.1. In this case an error explaining the situation will be reported. To fix this, an outer manifest can override an inner one provided it does so explicitly. In this example, the outer manifest would need to say

In this build, when building “u”, “x.com/y/z” means v1.2.3+

and then u’s manifest’s request for v2.1+ would be ignored.

The exclusion in the final directive form is a global exclusion: it says that x.com/y/z v1.2.4 is known to be buggy and simply must never be chosen. Version resolution proceeds as if v1.2.4 does not exist at all, under any circumstances. This differs from satisfiability-based approaches, which would declare v1.2.4 usable with some versions of the importing package but not others, leading to the complex (and possibly incompletely specified) searches that become NP-complete. In contrast, the approach described here takes the point of view that “buggy is buggy.” If v1.2.4 has a bug, it should be ignored entirely, not used under some circumstances but not others.

The resolution algorithm also reads the top-level lock file as a hint about what versions to use throughout the program build. The exact version resolutions listed in the lock file will be used as long as they do not conflict with the manifest. If they do conflict with the manifest, the go command re-resolves everything - under the assumption that the manifest has itself been updated - and writes a new lock file. If no lock file exists, a build will write a lock file for package main but not for non-main packages, following experience in Rust that lock files make sense for commands (for reproducible builds) but not for libraries.

Manifest format

The manifest is a file in the directory of a package being built (not necessarily the repository root), named Go.deps.

Users are expected to edit the manifest using their favorite text editor, so the format must be easy to edit. This requirement excludes JSON, which is too finicky about comma placement and has no syntax for comments, and also XML. It also suggests having a fairly line-oriented syntax, so that diffs make sense.

Tools that need version information are not expected to read the manifest directly; instead they are expected to invoke the go command to get a build description (see below). So the format need not be easy to parse with multiple programs. That means there is not a compelling reason to shoehorn the format into TOML. In particular, TOML is unfortunate because an import path like rsc.io/md2html is not a valid TOML key, so all import paths would need to be quoted, and all values must also be quoted unconditionally, which makes editing more finicky than it really needs to be.

At the same time, we want the manifest format to be simple to explain. This excludes YAML and its 84-page printed spec, at least in full generality. (It’s also unclear to me, despite skimming through that spec, whether an unquoted path like rsc.io/md2html is a valid YAML key.)

It’s possible that a subset of YAML would be okay, as long as we can use unquoted paths as keys. The structure of TOML/YAML is nice, compared to how git’s ini-based .gitconfig.

More investigation is needed here.

Lock file format

The lock file is a file in the directory of a package being built, named Go.depslock.

Users are not expected to edit the lock file, ever, but lock files for main packages are expected to be checked in to version control, so it is important that diffs of lock files make sense. As a result the lock file should be line-based and should avoid redundant information (for example, if the whole build uses x.com/y/z v1.2.3, that fact need not be repeated for every different package that imports x.com/y/z).

Because the lock file is for reproducible builds, it may make sense for the lock file to give a hash of the source code for a given version, whether that’s a version control revision or a separately-computed hash of all the source files at that revision.

It might be nice if the lock file could have the same format as the manifest file, just longer and more detailed, provided the above requirements can still be satisfied.

More investigation is needed here as well.

Build descriptions

As mentioned above, tools that need to know what version of code is being built (for example, guru) are not expected to read the manifest or the lock file. Instead they are expected to ask the go command for a “build description” for a given target, perhaps using a new option to “go list”. That description will explain, for every package in the transitive closure of the import graph of the target package, what files comprise that package (and where they are) and what each import path in that package resolves to.

We hope that this kind of build description will also be supported by other build tools, specifically Google’s Bazel, so that tools like guru can work in both worlds. (NOTE: See also related discussion at https://github.com/rust-lang/cargo/issues/3815.)

Alan Donovan and Michael Matloob are thinking about exactly what the build descriptions should look like and how to integrate them into existing tools like guru.

Relationship and compatibility with dep

The most recent work on Go package management is dep, built primarily by Sam Boyer with input from the package management working group. The integration into the go command follows many of the ideas embraced and prototyped in dep, such as applying semantic versions and a manifest and a lock file (to be clear, dep is itself benefiting from experience with these in other systems such as Rust’s cargo and its predecessors). But dep has different design constraints than the package management integration described here, and so the details are necessarily different.

Because dep was designed to work with an unmodified go toolchain, dep cannot change any of the existing go commands. For example, the design of go command integration above updates “go get” to be version-aware, and it changes commands like “go build,” “go test,” and even “go list” to resolve package versions as part of preparing to do their work. In contrast, dep cannot change these commands, so it must introduce an alternate command to fetch code (dep init or dep ensure), and it must introduce a new command to re-resolve package versions (dep init) before a build. By changing the go command itself, the package management integration described here aims to avoid requiring users to learn these kinds of new steps.

Dep must also work in terms of creating vendor subtrees, so its semantics are limited to what is possible using vendor subtrees. For example, the design of go command integration above allows an import of a given import path to resolve to v1 in part of the build and v2 in another part of the build, so that both v1 and v2 are included in the final binary. We believe that this situation is bound to arise in large programs and must be possible to support. In contrast, dep itself cannot support having v1 and v2 of a package in a single build, because that situation cannot be expressed in terms of a vendor directory, and ultimately dep must compile its view of the world into a vendor subtree in order to invoke an unmodified go command. By changing the go command itself, the package management integration described here aims to lift this restriction.

Although the two are necessarily different in detail, it is an explicit goal to allow library authors to publish packages compatible with both dep and the go command integration and to allow developers to continue to use dep during a transition period, with the eventual goal that everyone is using the go command integration. The basic approach is that where dep and the go command integration either agree or coexist on each detail of the data they consume. In particular, they agree about the format of version tags (v1.2.3 as a repository tag pointing at a specific commit), and they coexist as far as manifest and lock files, by using different file names. It is possible to publish a library that works with both systems by publishing pairs of manifest and lock files. We expect that a tool will be written to convert between the two formats to aid in this transition.

Effect on the golang.org/x repositories

The golang.org/x repositories should start tagging versions and publishing manifests.

Effect on the standard library

The effect of package management on the standard library is unclear. We may want to provide an automated mechanism for accessing versioned pieces of the standard library, so that a program could opt in to net/http from Go 1.10 without updating the rest of the system. That’s future work and out of scope for now.

Appendix: Cached builds

The feasibility of the system described above depends on updating the go command’s support for caching build artifacts.

The current go command build system assumes a direct correspondence between an import path and a directory containing the source code for that import path, and then a similarly direct correspondence between the source code directory name and the canonical file where the compiled form of that source code is kept (“installed”). Using a combination of modification times and partial content information, the go command detects when the compiled form needs to be rebuilt. For example, import “net/http” expects source code in $GOROOT/src/net/http and a compiled package in $GOROOT/pkg/$GOOS_$GOARCH/net/http.a.

This system is already straining: the $GOOS_$GOARCH in the compiled package is meant to allow cross-compilation, and there is also $GOOS_$GOARCH_race for code built for the race detector, and $GOOS_$GOARCH_msan for code built for msan, but there is no distinction for sub-architecture settings like GOARM and GO386, nor are additional compiler flags recorded, leading to either missed or unnecessary rebuilds in the event of install collisions. Vendoring put additional strain on the system: import paths are now first translated into “fully-qualified import paths”, so that import “net/http” might be read as import “x.com/y/z/vendor/net/http” in a given source file, and then the longer path is used for the source and compiled package lookups.

The model of the build here is essentially the same as that of Unix’s make, and in the end it makes simplifying assumptions that are no longer reasonable. In particular, it assumes that there’s only one way to build a given set of source code, or at least that the number of ways can be enumerated. This simply isn’t true anymore in modern development practices.

The go command needs to move to a more general build artifact cache, like in Google’s Bazel, that can keep track of the result of compiling a given set of source code in a variety of ways: with different compiler flags, for different architecture targets, and – to enable support for package management – with different versions of its dependencies as part of different builds. In addition to enabling package management, this more precise caching will fix at least a dozen or so open issues in the go command.

The details of build caching are out of scope for this doc, but I’ve prototyped it and know that it can be made to work. There will be a separate doc to examine the details of build caching, and it will likely take multiple releases to transition fully to builds based on such a cache. For now just assume that it exists and works.