Sets and Maps
Hashing
Function Objects
Sorting Vectors and Deques
A hash table is conceptually a collection of (key, value) pairs.
Two main operations on a hash table, h are
h.insert(pair(key, value)); or h[key] = value; value = h[key];
A hash table should guarantee that these two operations are very efficient (at least on average).
In particular, if the hash table contains N keys (and associated values) we want these operations to be O(1).
There are two main structures used for implementing a hash table.
Open addressing using arrays Separate chaining: an array of linked lists.
To achieve O(1) time for put and get,
hash function is applied to the key to produce an index in the structure at which to store the (key, value).
Keys will be 'Product codes' Values will be 'quantity on hand' Pairs to be inserted into a hash: (14913, 20) (15930, 10) (16918, 50) (17020, 19)
We will put these into a hash table using open addressing and storing the values in an array.
The success of hashing depends on how full the array is. As a rule of thumb, one should try to keep the array no more than 75% full.
We will use size 7 for this example.
The hash function should map each key to the array index range: 0 - 7. An easy way to do this is to use the remainder on division by 7: hash(key) = key % 7;
Inserting the first three pairs (14913, 20), (15930, 10), (16918, 50) we get:
hash(14913) = 14913 % 7 = 3 hash(15930) = 15930 % 7 = 5 hash(16918) = 16918 % 7 = 6 ----------- 0 null 1 null 2 null 3 (14913, 20) 4 null 5 (15930, 10) 6 (16918, 50)
To lookup the quantity for product code 15930, we just apply the hash function again:
Lookup quantity for 15930: hash(15930) = 15930 % 7 = 5 At index 5 we get (15930, 10), so the quantity is 10.
Now try to insert the fourth pair, (17020, 19), we get a collision:
hash(14913) = 14913 % 7 = 3 hash(15930) = 15930 % 7 = 5 hash(16918) = 16918 % 7 = 6 hash(17020) = 17020 % 7 = 3 ----------- 0 null 1 null 2 null 3 (14913, 20) (17020, 19) 4 null 5 (15930, 10) 6 (16918, 50)
One simple way to resolve collisions in an array implementation is to use linear probing. That is, just try the next position:
hash(14913) = 14913 % 7 = 3 hash(15930) = 15930 % 7 = 5 hash(16918) = 16918 % 7 = 6 hash(17020) = 17020 % 7 = 3 ----------- 0 null 1 null 2 null 3 (14913, 20) (17020, 19) goes in next available spot 4 (17020, 19) 5 (15930, 10) 6 (16918, 50) How many comparisons must be made now, to look up key 14913? What about key 17020? Where would (14920, 40) go? hash(14920) = 14920 % 7 = 3 Answer: Position 0; that is, next available just wraps around to the beginning of the array if necessary.
So inserting all these values gives:
hash(14913) = 14913 % 7 = 3 hash(15930) = 15930 % 7 = 5 hash(16918) = 16918 % 7 = 6 hash(17020) = 17020 % 7 = 3 hash(14920) = 14920 % 7 = 3 ----------- 0 (14920, 40) 1 null 2 null 3 (14913, 20) 4 (17020, 19) 5 (15930, 10) 6 (16918, 50) Pairs in blue are not in their hashed positions: Key Number of Probes to Lookup 14913 1 15930 1 16918 1 17020 2 14920 5 ------ 10 For the 5 keys in the table Average #probes (comparisions) to lookup a value which is present 10/5 = 2
The mod operator does a resonably good job of distributing random keys and this helps avoid collisions.
When collisions occur, clusters can form so that instead of being able to put the key at the next position, many positions must be skipped over because they are occupied.
There are several techniques that seem to help to avoid clustering to some extent:
If a collision occurs for a key with hash(key) = k, the positions to try are (mod table size, of course): Method Next Position linear probing k + 1, k + 2 k + 3, ... quadratic probing k + 1, k + 4, k + 9, ... double hashing k + j, k + 2j, k + 3j ... (A second hash function, hash2 is used to compute the skip value, j = hash2(key) )
There are problems with all these.
Linear probing can cause clusters.
Quadratic probing avoids primary clustering, but can fail to find a spot if the table is more than 50% full.
Double hashing requires a second hash function and so is likely to be a bit slower to compute than quadratic probing.
The general formulation of which index to use for the i-th probe of value x is hash(x) + f(i) for i = 0, 1, ... where f(0) = 0. Linear probing: f(i) = i Quadratic probing: f(i) = i2 Double hashing: f(i) = i * hash2(x) where hash2 is a second hash function. But this formulation tells us which location to probe on the i-th probe in relation to the original hash location. Alternatively, we can express each one method in terms of how many locations to advance for each successive probe.
If the i-th probe has just been made at some location k, then how much do we add to k for the the location for the (i+1)-th probe?
Answer f(i+1) - f(i)
So from the location for the i-th probe,
Probing method | f(i) | Skip Value f(i+1) - f(i) |
Skip sequence |
---|---|---|---|
Linear | i | 1 | 1,1,1,... Just skip to the next location |
Quadratic | i2 | 2i+1 | 1,3,5,7,... Each skip is 2 more than the previous skip |
Double Hashing | i*hash2(x) | hash2(x) | hash2(x),hash2(x),... Each skip is equal to hash2(x) |
For quadratic probing, the probe sequence can fail to find an empty entry even if empty entries exist.
This can be avoiding provided:
For double hashing, it must be guaranteed that the second hash function never returns 0 for any key.
Since the second hash function is only used to determine the skip value, a value of 0 would mean to not skip, but to keep trying the original location!
int hash(const string& s) { int result = 0; for(int i = 0; i < s.size(); i++) { result = 31 * result + (int) s[i]; } return result; }
For each hash value, keep a linked list of the pairs whose keys map to that value.
Using the same example as for linear probing:
hash(14913) = 14913 % 7 = 3 hash(15930) = 15930 % 7 = 5 hash(16918) = 16918 % 7 = 6 hash(17020) = 17020 % 7 = 3 hash(14920) = 14920 % 7 = 3 ----------- 0 null 1 null 2 null 3 -----> (14913, 20) --> (17020,19) --> (14920,40) 4 null 5 -----> (15930, 10) 6 -----> (16918, 50) Collisions are all on the same list. So lookup of 14920 only requires 3 probes instead of 5; no values that hash to a different value are are involved unlike linear probing. Key Number of Probes to Lookup 14913 1 15930 1 16918 1 17020 2 14920 3 ------ 8 Average #probes (comparisons) to lookup a value which is present 8/5 = 1.6 The general number of probes for a successful search is about 1 + (λ/2) where the load factor for separate chaining is λ = N/M In this case, λ = 5/7 and 1 + (λ/2) = 1.4 (to one decimal place. The number of probes for insertion or unsuccessful search is about λ
If separate chaining is used for a hash table that contains N keys and M chains, and we want the average number of probes to be no more than 2, how many separate chains, M, should we have? The answer to this question depends on two factors: (1) the number of keys in the table (2) the hash function; that is, on how uniformly the keys are distributed among the chains.
What is the average length of the chains? For N keys and M chains, the average chain length is N/M This is the load factor of a hash table with separate chaining. All keys on a given chain have the same hash value. However, it is not necessarily true that the average number of probes is equal to the expressions just given.
Here the average chain length is N/M = 2. But suppose the hash function is "bad". For example, suppose all the keys end in 0 and the hash function is just the key mod 5. Then all keys map to chain 0. chain length 0 10 1 0 2 0 3 0 4 0 The average chain length is = 2. For a successful find key probes 1-st 1 2nd 2 3rd 3 ... ... 10th 10 ------ ----- total 55 Average #probes for successful find = 55/10 = 5.5 Average #probes for an unsuccessful find is 10 (again assuming all keys map to chain 0).
Here the average chain length is still N/M = 2. But we now suppose the hash function is "good". For example, suppose the keys are uniformly distributed so that all chains have length 2. Then chain length 0 2 1 2 2 2 3 2 4 2 The average chain length is 2, and For a successful find key probes 1-st 1 2nd 2 3rd 1 4th 2 5th 1 6th 2 ... ... 9th 1 10th 2 ------ ----- total 15 Average #probes for successful find = 15/10 = 1.5
Suppose we reduce the load factor to be 1 instead of 2. What happens to the two previous examples: bad hash and good hash?
Here the average chain length is N/M = 1. With the "bad" hash function, increasing the chains means: chain length 0 10 1 0 2 0 3 0 4 0 5 0 ... ... 9 0 The average chain length is now 1 instead of 2, but For a successful find key probes 1-st 1 2nd 2 3rd 3 ... ... 10th 10 ------ ----- total 55 Average #probes for successful find is still = 55/10 = 5.5
Here the average chain length is N/M = 1. With the "good" hash function, increasing the chains means: chain length 0 1 1 1 2 1 3 1 4 1 5 1 ... ... 9 1 The average chain length is now 1 instead of 2, and For a successful find key probes 1-st 1 2nd 1 3rd 1 ... ... 10th 1 ------ ----- total 10 Average #probes for successful find is slightly better = 10/10 = 1
Generally, the load factor for separate chaining should be 1; that is, the number of chains should be about equal to the number of keys. However, with a "bad" hash function, in the worst case (all keys colliding on one chain), the total number of probes for successful finds for N keys will be 1 + 2 + 3 + ... + N = N(N+1)/2 and so the average number of probes in this worst case for a successful find would be (N+1)/2 However, for a "good" hash function that uniformly distributes the keys over the chains and a load factor of 1, the average number of probes for a successful find would be 1
Suppose the table size in separate chaining is 8. If all the keys happen to be even, then using the hash function
If key is even, hash(key) = key % tablesize = key % 8 hash(key) will always be even also. So none of the odd indices will be used. That is, half the table entries will not be used.
In general if all the keys and tablesize have common divisors, then key % tablesize wil have the same divisors and so the other indices will not be used. Moral: Use a prime number for the tablesize.
Deletions from a hash table with separate chaining are straightforward.
Open addressing is trickier. Consider deleting key 17020 from this hash table and then lookup 14920:
hash(14913) = 14913 % 7 = 3 hash(15930) = 15930 % 7 = 5 hash(16918) = 16918 % 7 = 6 hash(17020) = 17020 % 7 = 3 hash(14920) = 14920 % 7 = 3 ----------- 0 (14920, 40) 1 null 2 null 3 (14913, 20) 4 null (17020, 19) deleted 5 (15930, 10) 6 (16918, 50) Pairs in blue are not in their hashed positions. To lookup 14920, we look for 14920 at positions 3, 4, ... until found or until a null entry is encountered. It appears that 14920 is not in the table. But it is. What is the problem?
To handle deletions we need to have 3 possibilities for each entry
1. empty (=> available for insertions) 2. occupied 3. deleted (=> available for insertions)
Deleted entries are available for insertions; that is, they are treated just like empty positions for insertions.
Deleted entries are treated similar to occupied entries for searches in that the search must keep going.
What would happen if a hash table was filled completely and then each of the items deleted?
All the items would be marked deleted. So when would a search terminate?
When searching for a key that is not in the hash table: The search must continue until a. an empty entry is found or b. all entries have been examined
Comparing the data types below with the assumption collisions do not degrade the hash table:
Time Requirement of Operation if ADT contains N elements HashTables HashTable AvlTree insert(x) 1 ? find(x) 1 ? enumerate in order N * log(N) ? findMax() N ? findMin() N ?