research!rsc

Thoughts and links about programming, by

RSS

Go's Package Name Space
Posted on Tuesday, February 2, 2010.

Go organizes programs into individual pieces called packages. A package gets to pick a short name for itself, like vector, even if the import statement must use a longer path like "container/vector" to name the file where the compiled code is installed. The early Go compilers used the package name as a unique identifier during linking, so that vector's New function could be distinguished from list's New. In the final binary, one was vector.New and the other list.New. As we started to fill out the standard library, it became clear that we needed to do something about managing the package name space: if multiple packages tried to be package vector, their symbols would collide in the linker. For a while we considered segmenting the name space, reserving lower-case names for standard packages and upper-case names for local packages. (Since package names and object file names are conventionally the same, one reason not to do this is that it would require a case-sensitive file system.)

Other languages simply use longer names. Both Java and Python tie the name to the directory in which the package is found, as in com.java.google.WebServer for the code in com/java/google/WebServer.class. In practice this leads to unnecessarily long identifiers, something Go tries to avoid. It also ties the name to a particular mechanism for finding code: a file system. One of the reasons that import paths are string constants in Go is so that it is easy to substitute other notations, like URLs.

Last spring, during a long discussion about how to divide up the package name space, Robert Griesemer cut the Gordian knot by suggesting that we allow multiple packages to choose a single name and fix the tool chain to cope. The import statement already allows introducing a local alias for the package during the import, so there's no linguistic reason package names have to be unique. We all agreed that this was the right approach, but we weren't sure how to implement it. Other considerations, like the open source release, took priority during most of 2009, but we recently returned to the problem.

Ultimately, the linker needs some unique name for each symbol in the program; the fundamental problem caused by deciding that package names won't be unique is to find another source of uniqueness that fits into the tool chain well.

The best approach* seems to be to use the package's import path as the unique identifier, since it must uniquely identify the package in the import statement already. Then container/vector's New is container/vector.New. But! When you're compiling a package, how does the compiler know what the package's import path will be? The package statement just says vector, and while every compilation that imports "container/vector" knows the import path, the compilation of vector itself does not, because compilation is handled separately from installing the binary in its final, importable location.

Last week I changed the gc compiler suite to do this. My solution to the import path question was to introduce a special name syntax that refers to “this package's import path.” Because the import paths are string literals in the Go compiler metadata, I chose the empty string—""—as the self-reference name. Thus, in the object file for package vector, the local symbol New is written "".New. When the linker reads the object file, it knows what import path it used to find the file. It substitutes that path for the "", producing, in this case, the unique name container/vector.New.

Not embedding a package's final installed location in its object file makes the object files easy to move and duplicate. For example, consider this trivial package:

package seq
var n int
func Next() int {
    n++
    return n
}

It's valid for a Go program to import the same path multiple times using different local names, but all the names end up referring to the same package:

package main

import (
    "fmt"
    s "seq" // changed to "seq1" later
    t "seq"
)

func main() {
    fmt.Println(s.Next(), s.Next(), t.Next(), t.Next())
}

prints 1 2 3 4, because it all four calls are to the same Next function:

$ 6g seq.go
$ 6g -I. main.go
$ 6l -L. main.6
$ 6.out
1 2 3 4
$ 

But if we change one of the imports to say "seq1" and then merely copy the "seq" binary to "seq1", we've created a distinct package, using lowly cp instead of a compiler:

$ cp seq.6 seq1.6
$ ed main.go
120
/seq
 s "seq"
s/seq/seq1
 s "seq1"
wq
121
$ 6g -I. main.go
$ 6l -L. main.6
$ 6.out
1 2 1 2
$ 

Now the s.Next calls refer to seq1.6's Next, while the t.Next calls refer to seq.6's Next. Duplicating the object actually duplicated the code. This is very different from the behavior of a traditional C compiler and linker.

A digression: the explicit "". prefix is not strictly necessary. It would be cleaner if the linker treated every symbol as needing to be qualified by the import path, so that all the "". could be dropped. But occasionally it's important to be able to break the rules, for example to define a symbol that is logically in one package be implemented in another. For example, the implementation of unsafe.Reflect is actually in the binary for package runtime, because that's where all the interface manipulation code lives:

$ 6nm pkg/darwin_amd64/runtime.a|grep Reflect
iface.6: T unsafe.Reflect
$

Another reason to use an explicit prefix is to admit names with no prefix at all, as would be generated by legacy C code. Otherwise, what should C's printf be in? If the linker enforced a strict boundary between packages, both of these examples would be impossible. Most of the time that would be a good thing, but systems languages do not have the luxury of stopping at “most of the time.” Last October, a few weeks before the public release of Go, I changed the linker to insert import path qualifiers on all names during linking, but it was too disruptive a change to commit before the release. Last week's implementation, which allows for semipermeable package boundaries, is a much better fit for Go.

This week Ian Lance Taylor is working on eliminating the global package name space assumption in gccgo. He'd like to avoid making changes to the linker, which rules out introducing a “this package” notation like "". Gccgo must be able to write objects that know their own import paths, which means gccgo must know the import path at compile time. But how? There will be a new gccgo command line option, and the build system will simply tell the compiler what the import path is.

In retrospect, I wonder if the effort of "" in the gc tool chain was justified compared to adding an option. The gc implementation is easier to use, but it's not clear how important that will be. Time will tell.


* An alternative approach would be to generate a random identifier each time the compiler is invoked and to use it for the package compiled by that run. When other packages import the compiled package, they can read the identifier and use it to generate references to that package's symbols. The most glaring problem with this approach is that the symbol names you'd see while debugging would be ugly, like mangled C++ names but worse. Another problem is that it would break aggressive incremental compilation: if fmt gets recompiled, all packages that import it would have to be recompiled to pick up the new identifier, even if the external interface hadn't changed. It would be nice to avoid those recompilations, especially in large programs.

(Comments originally posted via Blogger.)

  • Brian Slesinsky (February 3, 2010 9:04 AM) Thanks for the clear explanation! However, I have to make a few corrections about Java: it isn't true that Java packages tie a name to "a particular mechanism for finding code: a file system."

    Java classes are often loaded from jar files, which in the case of applets are loaded over the network, so there's no reason why the machine running a JVM need have a traditional filesystem at all. It can cache jars and .class files using any convenient mechanism. In addition, Android and GWT take Java source as input but don't generate .class files at all.

    I think this shows the flexibility of the Java naming scheme. Java transparently supports loading code over the network without changing source code (in particular, import statements). It would be a mistake to put network identifiers into import statements because it would require editing source code in order to change where source code or object code are located. Instead, the strategy for finding source code is left to the build system where it can be conveniently overridden using developer-specific flags.

    Also, while there's a strong convention for locating Java source files in a particular directory hierarchy, this isn't strictly necessary. I wrote a Java source code indexer that scans a large directory tree for cross-references in Java source files, and it works find regardless of where the source files are located. Cross-references can be found simply by looking at the contents of Java source files for package names and import statements. This isn't true in the scheme you describe here, which seems unfortunate.