Names

Every programmer has a variable naming philosophy. This is mine:

A name's length should not exceed its information content. For a local variable, the name i conveys as much information as index or idx and is quicker to read. Similarly, i and j are a better pair of names for index variables than i1 and i2 (or, worse, index1 and index2), because they are easier to tell apart when skimming the program. Global names must convey relatively more information, because they appear in a larger variety of contexts. Even so, a short, precise name can say more than a long-winded one: compare acquire and take_ownership. Make every name tell.

The information content metric gives a quantitative argument against long-winded names: they're simply inefficient. I internalized this metric years ago but only realized this phrasing of it recently, perhaps because I have been looking at too much Java code.

Go's Package Name Space

Go organizes programs into individual pieces called packages. A package gets to pick a short name for itself, like vector, even if the import statement must use a longer path like "container/vector" to name the file where the compiled code is installed. The early Go compilers used the package name as a unique identifier during linking, so that vector's New function could be distinguished from list's New. In the final binary, one was vector.New and the other list.New. As we started to fill out the standard library, it became clear that we needed to do something about managing the package name space: if multiple packages tried to be package vector, their symbols would collide in the linker. For a while we considered segmenting the name space, reserving lower-case names for standard packages and upper-case names for local packages. (Since package names and object file names are conventionally the same, one reason not to do this is that it would require a case-sensitive file system.)

Other languages simply use longer names. Both Java and Python tie the name to the directory in which the package is found, as in com.java.google.WebServer for the code in com/java/google/WebServer.class. In practice this leads to unnecessarily long identifiers, something Go tries to avoid. It also ties the name to a particular mechanism for finding code: a file system. One of the reasons that import paths are string constants in Go is so that it is easy to substitute other notations, like URLs.

Last spring, during a long discussion about how to divide up the package name space, Robert Griesemer cut the Gordian knot by suggesting that we allow multiple packages to choose a single name and fix the tool chain to cope. The import statement already allows introducing a local alias for the package during the import, so there's no linguistic reason package names have to be unique. We all agreed that this was the right approach, but we weren't sure how to implement it. Other considerations, like the open source release, took priority during most of 2009, but we recently returned to the problem.

Ultimately, the linker needs some unique name for each symbol in the program; the fundamental problem caused by deciding that package names won't be unique is to find another source of uniqueness that fits into the tool chain well.

The best approach* seems to be to use the package's import path as the unique identifier, since it must uniquely identify the package in the import statement already. Then container/vector's New is container/vector.New. But! When you're compiling a package, how does the compiler know what the package's import path will be? The package statement just says vector, and while every compilation that imports "container/vector" knows the import path, the compilation of vector itself does not, because compilation is handled separately from installing the binary in its final, importable location.

Last week I changed the gc compiler suite to do this. My solution to the import path question was to introduce a special name syntax that refers to “this package's import path.” Because the import paths are string literals in the Go compiler metadata, I chose the empty string—""—as the self-reference name. Thus, in the object file for package vector, the local symbol New is written "".New. When the linker reads the object file, it knows what import path it used to find the file. It substitutes that path for the "", producing, in this case, the unique name container/vector.New.

Not embedding a package's final installed location in its object file makes the object files easy to move and duplicate. For example, consider this trivial package:

package seq
var n int
func Next() int {
    n++
    return n
}

It's valid for a Go program to import the same path multiple times using different local names, but all the names end up referring to the same package:

package main

import (
    "fmt"
    s "seq" // changed to "seq1" later
    t "seq"
)

func main() {
    fmt.Println(s.Next(), s.Next(), t.Next(), t.Next())
}

prints 1 2 3 4, because it all four calls are to the same Next function:

$ 6g seq.go
$ 6g -I. main.go
$ 6l -L. main.6
$ 6.out
1 2 3 4
$ 

But if we change one of the imports to say "seq1" and then merely copy the "seq" binary to "seq1", we've created a distinct package, using lowly cp instead of a compiler:

$ cp seq.6 seq1.6
$ ed main.go
120
/seq
 s "seq"
s/seq/seq1
 s "seq1"
wq
121
$ 6g -I. main.go
$ 6l -L. main.6
$ 6.out
1 2 1 2
$ 

Now the s.Next calls refer to seq1.6's Next, while the t.Next calls refer to seq.6's Next. Duplicating the object actually duplicated the code. This is very different from the behavior of a traditional C compiler and linker.

A digression: the explicit "". prefix is not strictly necessary. It would be cleaner if the linker treated every symbol as needing to be qualified by the import path, so that all the "". could be dropped. But occasionally it's important to be able to break the rules, for example to define a symbol that is logically in one package be implemented in another. For example, the implementation of unsafe.Reflect is actually in the binary for package runtime, because that's where all the interface manipulation code lives:

$ 6nm pkg/darwin_amd64/runtime.a|grep Reflect
iface.6: T unsafe.Reflect
$

Another reason to use an explicit prefix is to admit names with no prefix at all, as would be generated by legacy C code. Otherwise, what should C's printf be in? If the linker enforced a strict boundary between packages, both of these examples would be impossible. Most of the time that would be a good thing, but systems languages do not have the luxury of stopping at “most of the time.” Last October, a few weeks before the public release of Go, I changed the linker to insert import path qualifiers on all names during linking, but it was too disruptive a change to commit before the release. Last week's implementation, which allows for semipermeable package boundaries, is a much better fit for Go.

This week Ian Lance Taylor is working on eliminating the global package name space assumption in gccgo. He'd like to avoid making changes to the linker, which rules out introducing a “this package” notation like "". Gccgo must be able to write objects that know their own import paths, which means gccgo must know the import path at compile time. But how? There will be a new gccgo command line option, and the build system will simply tell the compiler what the import path is.

In retrospect, I wonder if the effort of "" in the gc tool chain was justified compared to adding an option. The gc implementation is easier to use, but it's not clear how important that will be. Time will tell.


* An alternative approach would be to generate a random identifier each time the compiler is invoked and to use it for the package compiled by that run. When other packages import the compiled package, they can read the identifier and use it to generate references to that package's symbols. The most glaring problem with this approach is that the symbol names you'd see while debugging would be ugly, like mangled C++ names but worse. Another problem is that it would break aggressive incremental compilation: if fmt gets recompiled, all packages that import it would have to be recompiled to pick up the new identifier, even if the external interface hadn't changed. It would be nice to avoid those recompilations, especially in large programs.

Generating Good Syntax Errors

One of the great religious debates in compiler writing is whether you should use parser generators like yacc and its many descendants or write parsers by hand, usually using recursive descent. On the one hand, using parser generators means you have a precise definition of the language that you are parsing, and a program does most of the grunt work for you. On the other hand, the proponents of hand-written recursive descent parsers argue that parser generators are overkill, that parsers are easy enough to write by hand, and that the result is easier to understand, more efficient, and can give better error messages when presented with syntactically invalid programs.

Like in most religious debates, the choice of side seems to be determined by familiarity more than anything else. I knew how to use yacc before I knew how to write a parser by hand, which put me solidly on the parser generator side of the fence. Now that I do know how to apply both techniques, I'd still rather use a parser generator. In fact, I've worked on projects where I've written parser generators rather than write a parser by hand. Good notation counts for a lot, as we shall see.

In Coders at Work, Ken Thompson talks to Peter Seibel about yacc and lex:

Seibel: And are there development tools that just make you happy to program?

Thompson: I love yacc. I just love yacc. It just does exactly what you want done. Its complement, lex, is horrible. It does nothing you want done.

Seibel: Do you use it anyway or do you write your lexers by hand?

Thompson: I write my lexers by hand. Much easier.

Many of the objections raised by the hand-written parser camp are similar to Thompson's objection to lex—the tool doesn't do what you want—but there's no fundamental reason a tool can't. For example, the spurious conflicts that an LALR(1) algorithm produces are definitely hard to understand, but if you replace it with LR(1) or GLR(1), you get a more useful tool. And if you do learn how to work within the LALR(1) algorithm, even yacc is very powerful.

The objection to parser generates that seems to resonate most is that generators like yacc produce inadequate error messages, little more than “syntax error.” Better error messages were one of the key benefits hoped for when g++ converted from a yacc-based C++ parser to a hand-written one (and to be fair, C++ syntax is nearly impossible to parse with any tool; the many special cases cry out for hand-written code). Here the objection seems harder to work around: the parser internally gets compiled into an automaton—usually a big table of numbers—that moves from state to state as it processes input tokens. If at some point it can't find a next state to go to, it reports an error. How could you possibly turn that into a good message?

Clinton Jeffery showed how in a paper published in ACM TOPLAS in 2003 titled “Generating LR Syntax Error Messages from Examples.” As you can guess from the title, the idea is to introduce a training stage after the parser is generated but before it is used for real. In that stage, you feed examples of syntax errors into the parser and look at what state it ends up in when it detects the error, and maybe also what the next input token is. If the automaton hits an error in that state (and with that input token) during real use, you can issue the message appropriate to the example. The important part is that the parser states are basically an abstract summary of the surrounding context, so it's reasonable to treat errors in a particular state with a single message. For example, suppose you find that this Go program

package main

import (
 "fmt",
 "math"
)

causes a syntax error at the comma, in state 76. In the parser, that state means that you're in the middle of an import block and expecting to see a string constant. The state abstracts that context, so you'd end up in the same state if you were parsing this erroneous program:

package fmt

import ( "strings"; "testing", "xgb" )

Having told the parser that the appropriate error message for the first program is “unexpected , during import block,” the parser will use the same message for the second program.

It's an elegant idea, and the implementation can be kept on the side, without adding any complexity to the grammar itself. I heard about this idea through the grapevine years ago (in 2000, I think) but had never gotten a chance to try it until this week.

There are three Go parsers: Ian Lance Taylor wrote a hand-written recursive descent parser in C++ for gccgo, Robert Griesemer wrote a hand-written recursive descent parser in Go (import "go/parser") for godoc and gofmt, and Ken Thompson wrote a yacc-based parser in C for the gc compiler suite.

On Monday night, I implemented Jeffery's idea in the gc compiler suite. The gc compilers use bison, the GNU implementation of yacc. Bison doesn't support this kind of error management natively, but it is not hard to adapt. Changing bison would require distributing a new version of bison; instead, my implementation post-processes bison's output.

The examples are in a new file go.errors. Because the lexer is written by hand, it's not available in the simulation, so the examples are token name sequences rather than actual program fragments. In token lists, the program above and its corresponding error message are:

% loadsys package LIMPORT '(' LLITERAL import_package import_there ','
"unexpected , during import block",

Understanding the tokens on the first line requires knowing what the various grammar token names mean, but they basically mimic the syntax, and this tool is targeted at the people working on the grammar anyway. An awk program loads bison's tables from its debugging dump y.output and then processes the go.errors file, replacing each line like the above with the number of the offending state and input token. That produces a table that can be linked into the compiler and consulted when a syntax error is encountered. If the state and input token match an entry in the table, the better error is used instead of a plain syntax error.

That was an outsized amount of work to close one bug, but now it's easy to add better messages for other common situations. For example, the fact that only the short := declaration form is allowed in for initializers sometimes trips up new Go programmers. When presented with this program:

package main

import "fmt"

func main() {
 fmt.Printf("hello, world\n")
 for var i = 0; i < 10; i++ {
  fmt.Println(i)
 }
}

the compiler used to just say there was a syntax error on line 7, but now it explains:

$ 6g x.go
x.go:7: syntax error: var declaration not allowed in for initializer

because of this stanza in go.errors:

% loadsys package imports LFUNC LNAME '(' ')' '{' LFOR LVAR LNAME '=' LNAME
"var declaration not allowed in for initializer",

Gccgo and gofmt give more traditional token-based errors:

$ gccgo x.go
x.go:7:6: error: expected operand
...

$ gofmt x.go
x.go:7:6: expected operand, found 'var'

What's interesting isn't that neither has bothered to handle this as a special case yet. What's interesting is to consider the work required to do so. Changing either would require understanding the corresponding hand-written parser well enough to find the right place to put the special case. With the example-based approach, you just write an example, and the tool figures out where in the parser it needs to go.

RSS | Atom

mailing list

About


Blogger Template derived from one by Blogcrowds