Posted on Wednesday, March 12, 2008.
Long ago, before hash tables (aka scatter storage files)
were a fundamental data type like
programmers had to implement them by hand. (Nowadays, manually-implemented
hash tables are rarely seen outside of historical reenactments like undergraduate
algorithms classes and C programs.)
Back then, there was fierce debate about how to handle hash collisions.
One option, called direct chaining, makes each table slot a linked list head, and then all the entries that hash to that slot are kept in the linked list pointed to by the slot. (Bonus points if you reorder the lists using the move to front heuristic.)
Another option, called linear scanning or open addressing, is to try table entry h, and then if that is full, try h+1, h+2, and so on until an empty slot is found. This only works well if the table is sized so that it has some large fraction (say, 1/4) of unused slots.
Direct chaining and linear scanning are still the most common collision resolution methods, but in the 1960s, more were explored. Still another option is to use non-linear scanning, so that each data item produces a sequence of hash values h1, h2, h3, and those table slots are tried in turn. Two objects with the same h1 won't necessarily have the same h2, unlike in linear scanning. But hashing is a relatively slow operation. Is it worth the time and code to compute all those extra hash functions?
This sets the stage for Doug McIlroy's 1963 letter to the Communications of the ACM Pracniques page, describing a trick by Vic Vyssotsky:
A VARIANT METHOD OF FILE SEARCHING
I would like to publicize a trick due to V. A. Vyssotsky for improving the efficiency of scatter storage files.
Vyssotsky's idea, used in a remarkably short and elegant code for the IBM 7090, is to continue random searching after the first probe fails:
(1) From a data item, compute a pseudo-random address in the file.
(2) If the item is in this place, or if the place is empty, the job is done.
(3) If another item is there, probe again with a new address computed by another pseudo-random function, until the stock of pseudo-random functions has been exhausted.
(4) As a last resort, when all pseudo-random functions are exhausted, do a linear search of the whole file for the item or a hole, whichever comes first.
In Vyssotsky's program, the later random functions are trivially obtained from the first one. He computes an n-bit function of which only the first m bits are used as an address. Later, random addresses are obtained by rotating the same n bits. After n probes he gives up and resorts to linear search.
More familiar methods of scatter storage employ chaning or linear search for later probes. Chaining is the most efficient method in terms of average number of probes per item but pays a price in storage space for the chaining information. Linear search is expensive in time. Table I, comparing average number of probes per item for these methods, is based on popular uniformity and independence assumptions:
TABLE I. Average Number of Probes per Item
Load Factor Chaining  Linear Search  Random Search p 1 + p/2 1 + ½p/(1-p) -p-1 log(1-p) .50 1.25 1.5 1.39 .75 1.38 2.5 1.83 .90 1.45 5.5 2.56
Chaining and random search have a clear edge. If storage requirements permit, chaining is superior. Vyssotsky's random search pays an attractively low time penalty to achieve ultimate storage utilization in pressing situations.
A typical 7090 application used a file of two-word items. Chaining would add an extra word or, with extra coding complexity to handle items of nonintegral word length, an extra half word per item. Table II compares lookup times for various file schemes given a certain amount of storage:
TABLE II. Lookup Times for Various File Schemes
Load Factor with
0.6 1.3 1.25 1.33 1.30 0.8 1.4 1.33 1.57 1.49 1.0 1.5 1.42 2.00 1.65 1.2 overflow 1.50 3.00 2.01 1.4 overflow overflow 6.00 2.90
1. Johnson, L. R. Indirect chaining method for addressing on secondary keys. Comm. ACM 4 (1961), 218-223.
2. Schay, G. and Spruth, W. G. Analysis of a file addressing method. Comm. ACM 5 (1962), 459-462.
M. D. McIlroy
Bell Telephone Laboratories, Inc.
Murray Hill, N. J.
(Posting may be intermittent for the rest of March.)