Showing posts with label bit twiddling. Show all posts
Showing posts with label bit twiddling. Show all posts

[ next | prev | up ]

Using Uninitialized Memory for Fun and Profit

This is the story of a clever trick that's been around for at least 35 years, in which array values can be left uninitialized and then read during normal operations, yet the code behaves correctly no matter what garbage is sitting in the array. Like the best programming tricks, this one is the right tool for the job in certain situations. The sleaziness of uninitialized data access is offset by performance improvements: some important operations change from linear to constant time.

Alfred Aho, John Hopcroft, and Jeffrey Ullman's 1974 book The Design and Analysis of Computer Algorithms hints at the trick in an exercise (Chapter 2, exercise 2.12):

Develop a technique to initialize an entry of a matrix to zero the first time it is accessed, thereby eliminating the O(||V||2) time to initialize an adjacency matrix.

Jon Bentley's 1986 book Programming Pearls expands on the exercise (Column 1, exercise 8; exercise 9 in the Second Edition):

One problem with trading more space for less time is that initializing the space can itself take a great deal of time. Show how to circumvent this problem by designing a technique to initialize an entry of a vector to zero the first time it is accessed. Your scheme should use constant time for initialization and each vector access; you may use extra space proportional to the size of the vector. Because this method reduces initialization time by using even more space, it should be considered only when space is cheap, time is dear, and the vector is sparse.

Aho, Hopcroft, and Ullman's exercise talks about a matrix and Bentley's exercise talks about a vector, but for now let's consider just a simple set of integers.

One popular representation of a set of n integers ranging from 0 to m is a bit vector, with 1 bits at the positions corresponding to the integers in the set. Adding a new integer to the set, removing an integer from the set, and checking whether a particular integer is in the set are all very fast constant-time operations (just a few bit operations each). Unfortunately, two important operations are slow: iterating over all the elements in the set takes time O(m), as does clearing the set. If the common case is that m is much larger than n (that is, the set is only sparsely populated) and iterating or clearing the set happens frequently, then it could be better to use a representation that makes those operations more efficient. That's where the trick comes in.

Preston Briggs and Linda Torczon's 1993 paper, “An Efficient Representation for Sparse Sets,” describes the trick in detail. Their solution represents the sparse set using an integer array named dense and an integer n that counts the number of elements in dense. The dense array is simply a packed list of the elements in the set, stored in order of insertion. If the set contains the elements 5, 1, and 4, then n = 3 and dense[0] = 5, dense[1] = 1, dense[2] = 4:

Together n and dense are enough information to reconstruct the set, but this representation is not very fast. To make it fast, Briggs and Torczon add a second array named sparse which maps integers to their indices in dense. Continuing the example, sparse[5] = 0, sparse[1] = 1, sparse[4] = 2. Essentially, the set is a pair of arrays that point at each other:

Adding a member to the set requires updating both of these arrays:

add-member(i):
    dense[n] = i
    sparse[i] = n
    n++

It's not as efficient as flipping a bit in a bit vector, but it's still very fast and constant time.

To check whether i is in the set, you verify that the two arrays point at each other for that element:

is-member(i):
    return sparse[i] < n && dense[sparse[i]] == i

If i is not in the set, then it doesn't matter what sparse[i] is set to: either sparse[i] will be bigger than n or it will point at a value in dense that doesn't point back at it. Either way, we're not fooled. For example, suppose sparse actually looks like:

Is-member knows to ignore members of sparse that point past n or that point at cells in dense that don't point back, ignoring the grayed out entries:

Notice what just happened: sparse can have any arbitrary values in the positions for integers not in the set, those values actually get used during membership tests, and yet the membership test behaves correctly! (This would drive valgrind nuts.)

Clearing the set can be done in constant time:

clear-set():
    n = 0

Zeroing n effectively clears dense (the code only ever accesses entries in dense with indices less than n), and sparse can be uninitialized, so there's no need to clear out the old values.

This sparse set representation has one more trick up its sleeve: the dense array allows an efficient implementation of set iteration.

iterate():
    for(i=0; i<n; i++)
        yield dense[i]

Let's compare the run times of a bit vector implementation against the sparse set:

Operation Bit Vector Sparse set
is-member O(1) O(1)
add-member O(1) O(1)
clear-set O(m) O(1)
iterate O(m) O(n)

The sparse set is as fast or faster than bit vectors for every operation. The only problem is the space cost: two words replace each bit. Still, there are times when the speed differences are enough to balance the added memory cost. Briggs and Torczon point out that liveness sets used during register allocation inside a compiler are usually small and are cleared very frequently, making sparse sets the representation of choice.

Another situation where sparse sets are the better choice is work queue-based graph traversal algorithms. Iteration over sparse sets visits elements in the order they were inserted (above, 5, 1, 4), so that new entries inserted during the iteration will be visited later in the same iteration. In contrast, iteration over bit vectors visits elements in integer order (1, 4, 5), so that new elements inserted during traversal might be missed, requiring repeated iterations.

Returning to the original exercises, it is trivial to change the set into a vector (or matrix) by making dense an array of index-value pairs instead of just indices. Alternately, one might add the value to the sparse array or to a new array. The relative space overhead isn't as bad if you would have been storing values anyway.

Briggs and Torczon's paper implements additional set operations and examines performance speedups from using sparse sets inside a real compiler.

Rotating Hashes

Long ago, before hash tables (aka scatter storage files) were a fundamental data type like int, programmers had to implement them by hand. (Nowadays, manually-implemented hash tables are rarely seen outside of historical reenactments like undergraduate algorithms classes and C programs.) Back then, there was fierce debate about how to handle hash collisions.

One option, called direct chaining, makes each table slot a linked list head, and then all the entries that hash to that slot are kept in the linked list pointed to by the slot. (Bonus points if you reorder the lists using the move to front heuristic.)

Another option, called linear scanning or open addressing, is to try table entry h, and then if that is full, try h+1, h+2, and so on until an empty slot is found. This only works well if the table is sized so that it has some large fraction (say, 1/4) of unused slots.

Direct chaining and linear scanning are still the most common collision resolution methods, but in the 1960s, more were explored. Still another option is to use non-linear scanning, so that each data item produces a sequence of hash values h1, h2, h3, and those table slots are tried in turn. Two objects with the same h1 won't necessarily have the same h2, unlike in linear scanning. But hashing is a relatively slow operation. Is it worth the time and code to compute all those extra hash functions?

This sets the stage for Doug McIlroy's 1963 letter to the Communications of the ACM Pracniques page, describing a trick by Vic Vyssotsky:

A VARIANT METHOD OF FILE SEARCHING

I would like to publicize a trick due to V. A. Vyssotsky for improving the efficiency of scatter storage files.

Vyssotsky's idea, used in a remarkably short and elegant code for the IBM 7090, is to continue random searching after the first probe fails:

(1) From a data item, compute a pseudo-random address in the file.
(2) If the item is in this place, or if the place is empty, the job is done.
(3) If another item is there, probe again with a new address computed by another pseudo-random function, until the stock of pseudo-random functions has been exhausted.
(4) As a last resort, when all pseudo-random functions are exhausted, do a linear search of the whole file for the item or a hole, whichever comes first.

In Vyssotsky's program, the later random functions are trivially obtained from the first one. He computes an n-bit function of which only the first m bits are used as an address. Later, random addresses are obtained by rotating the same n bits. After n probes he gives up and resorts to linear search.

More familiar methods of scatter storage employ chaning or linear search for later probes. Chaining is the most efficient method in terms of average number of probes per item but pays a price in storage space for the chaining information. Linear search is expensive in time. Table I, comparing average number of probes per item for these methods, is based on popular uniformity and independence assumptions:

TABLE I. Average Number of Probes per Item
Load Factor Chaining [1] Linear Search [2] Random Search
p 1 + p/2 1 + ½p/(1-p) -p-1 log(1-p)
.50 1.25 1.5 1.39
.75 1.38 2.5 1.83
.90 1.45 5.5 2.56

Chaining and random search have a clear edge. If storage requirements permit, chaining is superior. Vyssotsky's random search pays an attractively low time penalty to achieve ultimate storage utilization in pressing situations.

A typical 7090 application used a file of two-word items. Chaining would add an extra word or, with extra coding complexity to handle items of nonintegral word length, an extra half word per item. Table II compares lookup times for various file schemes given a certain amount of storage:

TABLE II. Lookup Times for Various File Schemes
Load Factor with
3-word items
Chaining
3-word
Chaining
2½-word
Linear
2-word
Random
2-word
0.6 1.3 1.25 1.33 1.30
0.8 1.4 1.33 1.57 1.49
1.0 1.5 1.42 2.00 1.65
1.2 overflow 1.50 3.00 2.01
1.4 overflow overflow 6.00 2.90

References:

1. Johnson, L. R. Indirect chaining method for addressing on secondary keys. Comm. ACM 4 (1961), 218-223.

2. Schay, G. and Spruth, W. G. Analysis of a file addressing method. Comm. ACM 5 (1962), 459-462.


M. D. McIlroy
Bell Telephone Laboratories, Inc.
Murray Hill, N. J.

(Posting may be intermittent for the rest of March.)

Superoptimizer

In 1987, Henry Massalin invented the “superoptimizer,” a program that finds the shortest possible instruction sequence for a given function. The superoptimizer works by sheer dumb luck: it checks whether every possible instruction sequence up to a given length produces the desired answer. The sequences that pass this test are not guaranteed to be an exact match for the function, but they usually are and can be checked analytically by a human.

As an example of what a superoptimizer can find, here is an Intel x86 instruction sequence that starts with a signed 16-bit number in ax and ends with the sign of that number (-1, 0, or 1) in dx.

cwd
neg ax
adc dx, dx

Massalin gives this example in his 1987 paper, “Superoptimizer: a look at the smallest program” (subscription required, sadly).

Superoptimizers are useful for compiler writers, but perhaps more as a source of amusement than as a time-saving device. One important way that they can be useful is to find sequences that avoid conditional branches, which get more and more expensive as processor pipelines deepen. Torbjörn Granlund and Richard Kenner used a superoptimizer to eliminate branches in gcc's output; see their paper “Eliminating Branches using a Superoptimizer and the GNU C Compiler.”

The Wikipedia entry for Superoptimization gives links to implementations.

Division via Multiplication

Division is the most expensive integer operation you can ask of your CPU. A well-known optimization rewrites division by constant powers of two as bit shifts: if x is an unsigned integer, then x>>8 is a fast way to compute x/256.

Division by constants that aren't powers of two are a little harder. Jim Blinn dedicated his November 1995 column (“Three Wrongs Make a Right”; subscription required) in IEEE Computer Graphics and Applications to tricks for computing x/255 for 16-bit x (a common computation during alpha compositing). Since 1/255 is approximately the same as 257/65536, you can get pretty close to x/255 by computing x*257/65536 (aka (x*257)>>16)). 257/65536 is a tiny bit bigger smaller than 1/255, so the approximation is sometimes too small by 1. Adding 257/65536 to nudge the values slightly upward results in a perfect approximation: for unsigned 16-bit x, x/255 = (x*257+257)>>16. No division!

In 1991, Robert Alverson described how to determine these approximations for the general case of arbitrary integer constants (“Integer Division Using Reciprocals”). Alverson was on the team building the 64-bit Tera Computer; they used the technique as the hardware implementation of integer division.

The Tera Computer didn't exactly catch on, but a few years later, Torbjörn Granlund and Peter Montgomery, inspired by Alverson's paper, implemented the technique in the GNU C compiler (gcc) for division by integer constants (“Division by Invariant Integers Using Multiplication”). Gcc still uses this technique, as do other compilers. Gcc rewrites a 32-bit unsigned x/255 into (x*0x80808081)>>39 (the intermediate values are 64-bit integers).

Henry S. Warren's delightfully bit-twiddly book Hacker's Delight devotes 46 pages to this topic. His web site includes a magic number computer.

Bitblt

In the 1980s, bitmap graphics used a powerful primitive called bitblt. Bitblt combined a rectangle of a source image with a similarly-sized rectangle in a destination image using a boolean function and replaced the destination rectangle with the boolean result.

Dan Ingalls invented bitblt for the Alto workstation developed at Xerox PARC. Rob Pike and Bart Locanthi used bitblt as the graphics primitive and the name for the Blit, the first graphical terminal for Unix.

At the 1984 SIGGRAPH conference, Ingalls, Leo Guibas (also at Xerox PARC), and Pike gave a course on the history, use, and implementation of bitblt. The course notes from 1984 (pdf, 68 pages, 5MB) have a wealth of information about bitblt, including area-filling, bitmap rotation, bitmap magnification, implementation via on-the-fly code generation, line-drawing, text manipulation, and more.