Suppose we will need these operations (and only these) on data
to be stored and all operations must have the indicated running
time restriction:
- insert(x) - must be O(log(N), preferably O(1)
- contains(x) - must be O(1)
A class is to be written with these public methods. A private data
member will store the actual data and be used to implement the
public methods.
Which type could be used for the private data member?
- LinkedList
- ArrayList
- Binary Search Tree
- AVL Tree
- Min Heap
- Max Heap
How fast can we make both the insert and the lookup?
O( log(N) )?
O( 1 )?
A hash table is conceptually a collection of (key,
value) pairs. (Like a Map).
Two main operations on a hash table, h are
h.put(key, value)
value = h.get(key)
A hash table should guarantee that these two operations are very
efficient (at least on average).
In particular, if the hash table contains N keys (and
associated values) we want these operations to be O(1).
There are two main structures used for implementing a hash table.
Open addressing using arrays
Separate chaining: an array of linked lists.
To achieve O(1) time for put and get,
hash function is applied to the key to produce an index in the
structure at which to store the (key, value).
Keys will be 'Product codes'
Values will be 'quantity on hand'
Pairs to be inserted into a hash:
(14913, 20)
(15930, 10)
(16918, 50)
(17020, 19)
We will put these into a hash table using open addressing and storing
the values in an array.
The success of hashing depends on how full the array is. As a rule of
thumb, one should try to keep the array no more than 75% full.
We will use size 7 for this example.
The hash function should map each key to the array index range:
0 - 6.
An easy way to do this is to use the remainder on division by 7:
hash(key) = key % 7;
Inserting the first three pairs (14913, 20), (15930, 10), (16918, 50)
we get:
hash(14913) = 14913 % 7 = 3
hash(15930) = 15930 % 7 = 5
hash(16918) = 16918 % 7 = 6
-----------
0 null
1 null
2 null
3 (14913, 20)
4 null
5 (15930, 10)
6 (16918, 50)
To lookup the quantity for product code 15930, we just apply the hash
function again:
Lookup quantity for 15930: hash(15930) = 15930 % 7 = 5
At index 5 we get (15930, 10), so the quantity is 10.
Now try to insert the fourth pair, (17020, 19), we get a collision:
hash(14913) = 14913 % 7 = 3
hash(15930) = 15930 % 7 = 5
hash(16918) = 16918 % 7 = 6
hash(17020) = 17020 % 7 = 3
-----------
0 null
1 null
2 null
3 (14913, 20) (17020, 19)
4 null
5 (15930, 10)
6 (16918, 50)
One simple way to resolve collisions in an array implementation is to
use linear probing. That is, just try the next position:
hash(14913) = 14913 % 7 = 3
hash(15930) = 15930 % 7 = 5
hash(16918) = 16918 % 7 = 6
hash(17020) = 17020 % 7 = 3
-----------
0 null
1 null
2 null
3 (14913, 20) (17020, 19) goes in next available spot
4 (17020, 19)
5 (15930, 10)
6 (16918, 50)
How many comparisons must be made now, to look up key 14913?
What about key 17020?
Where would (14920, 40) go? hash(14920) = 14920 % 7 = 3
Answer: Position 0; that is, next available just wraps around to the
beginning of the array if necessary.
So inserting all these values gives:
hash(14913) = 14913 % 7 = 3
hash(15930) = 15930 % 7 = 5
hash(16918) = 16918 % 7 = 6
hash(17020) = 17020 % 7 = 3
hash(14920) = 14920 % 7 = 3
-----------
0 (14920, 40)
1 null
2 null
3 (14913, 20)
4 (17020, 19)
5 (15930, 10)
6 (16918, 50)
Pairs in blue are not in their hashed positions:
Key Number of Probes to Lookup
14913 1
15930 1
16918 1
17020 2
14920 5
------
10
For the 5 keys in the table
Average #probes (comparisions) to lookup a value which is present
10/5 = 2
The mod operator does a resonably good job of distributing random keys
and this helps avoid collisions.
When collisions occur, clusters can form so that
instead of being able to put the key at the next position,
many positions must be skipped over because they are
occupied by keys with a different hash value.
There are several techniques that seem to help to avoid clustering to
some extent:
If a collision occurs for a key with hash(key) = k, the positions to try are
(mod table size, of course):
Method Next Position
linear probing k + 1, k + 2 k + 3, ...
quadratic probing k + 1, k + 4, k + 9, ...
double hashing k + j, k + 2j, k + 3j ...
(A second hash function, hash2 is used
to compute the skip value, j = hash2(key) )
There are problems with all these.
Linear probing can cause clusters.
Quadratic probing avoids primary clustering, but can fail to find a
spot if the table is more than 50% full.
Double hashing requires a second hash function and so is likely to be
a bit slower to compute than quadratic probing.
-
For quadratic probing, the probe sequence can fail to find
an empty entry even if empty entries exist.
This can be avoiding provided:
- The size of the table is a prime number, and
- The number of keys is always less than half the table size. (This
will occasionally require increasing the table size and rehashing all
the keys.)
-
For double hashing, it must be guaranteed that the second
hash function never returns 0 for any key.
Since the second hash function is only used to determine the skip
value, a value of 0 would mean to not skip, but to keep trying the
original location!
Recall the general principle about the Java hashCode()
method inherited from Object:
If x.equals(y) returns true, then x.hashCode() should equal y.hashCode()
The Java String class overrides both the equals and
the hashCode methods. The first method returns true if
two Strings represent the same sequence of characters, not just if
they are identical as objects. So hashCode was rewritten so that two
equal strings return equal hash values.
But to be a "good" hash function, we also would like to have two
different String values return different hash values.
So a "good" hash function for String should involve all the
characters in the string and it should not give the same value for
permutations of the same string.
E.g, "bat" and "tab" should return different hash values.
Here is what is advertised in the Java API to be the implementation of hashCode() in the
String class:
public int hashCode()
{
int result = 0;
for(int i = 0; i < this.length(); i++)
{
result = 31 * result + (int) this.charAt(i);
}
return result;
}
For each hash value, keep a linked list (also called a
chain) of the pairs whose keys map to that value.
When inserting a new key, first compute its hash value, then
insert the key at the beginning of the list for that hash value.
When searching for a key, first compute its hash value, then
search the list for that hash value.
Using the same example as for linear probing:
hash(14913) = 14913 % 7 = 3
hash(15930) = 15930 % 7 = 5
hash(16918) = 16918 % 7 = 6
hash(17020) = 17020 % 7 = 3
hash(14920) = 14920 % 7 = 3
-----------
0 null
1 null
2 null
3 -----> (14913, 20) --> (17020,19) --> (14920,40)
4 null
5 -----> (15930, 10)
6 -----> (16918, 50)
Collisions are all on the same list. So lookup of 14920 only requires
3 probes instead of 5; no values that hash to a different value are
are involved unlike linear probing.
Key Number of Probes to Lookup
14913 1
15930 1
16918 1
17020 2
14920 3
------
8
Average #probes (comparisons) to lookup a value which is present
8/5 = 1.6
The load factor for separate chaining is the average chain length.
Ideally, the load factor should be small, e.g. 1.0 or less.
The default load factor for Java's Hash classes is 0.75
N = 10 Keys, M = 5 chains, Load Factor=2; bad hash function
Here the average chain length is N/M = 2.
But suppose the hash function is "bad".
For example, suppose all the keys end in 0 and the hash function is
just the key mod 5.
Then all keys map to chain 0.
chain length
0 10
1 0
2 0
3 0
4 0
The average chain length is λ = 2 and 1 + λ/2 is also 2.
For a successful find
key probes
1-st 1
2nd 2
3rd 3
... ...
10th 10
------ -----
total 55
Average #probes for successful find = 55/10 = 5.5
Average #probes for an unsuccessful find is 10 (again assuming all
keys map to chain 0).
N=10 keys, M=5 chains; Load Factor = 2; good hash function
Here the average chain length is still N/M = 2.
But we now suppose the hash function is "good".
For example, suppose the keys are uniformly distributed so that all
chains have length 2.
Then
chain length
0 2
1 2
2 2
3 2
4 2
The average chain length is 2, and
For a successful find
key probes
1-st 1
2nd 2
3rd 1
4th 2
5th 1
6th 2
... ...
9th 1
10th 2
------ -----
total 15
Average #probes for successful find = 15/10 = 1.5
Suppose we reduce the load factor to be 1 instead of 2 by doubling the
number of chains.
What happens to the two previous examples: bad hash and good hash?
N = 10 Keys, M = 10 chains, Load Factor=1; bad hash function
Here the average chain length is N/M = 1.
With the "bad" hash function, increasing the chains means:
chain length
0 10
1 0
2 0
3 0
4 0
5 0
... ...
9 0
The average chain length is now 1 instead of 2, but
For a successful find
key probes
1-st 1
2nd 2
3rd 3
... ...
10th 10
------ -----
total 55
Average #probes for successful find is still = 55/10 = 5.5
N = 10 Keys, M = 10 chains, Load Factor=1; good hash function
Here the average chain length is N/M = 1.
With the "good" hash function, increasing the chains means:
chain length
0 1
1 1
2 1
3 1
4 1
5 1
... ...
9 1
The average chain length is now 1 instead of 2, and
For a successful find
key probes
1-st 1
2nd 1
3rd 1
... ...
10th 1
------ -----
total 10
Average #probes for successful find is slightly better = 10/10 = 1
For separate chaining, we can make the load factor smaller and
smaller by increasing the number of chains.
Keys will be spread out further on different chains.
This randomness is not necessarily good for memory cache
performance. Accessing one chain may bring a portion of the
chain array into the L1 cache. But the next chain accessed may
not be in the same cache block resulting in a cache miss.
This randomness may result in worse performance because of
increased cache misses.
Moral: Don't make the number of chains excessively large.
Deletions from a hash table with separate chaining are
straightforward.
Open addressing is trickier. Consider deleting key
17020 from this hash table and then
lookup 14920:
hash(14913) = 14913 % 7 = 3
hash(15930) = 15930 % 7 = 5
hash(16918) = 16918 % 7 = 6
hash(17020) = 17020 % 7 = 3
hash(14920) = 14920 % 7 = 3
-----------
0 (14920, 40)
1 null
2 null
3 (14913, 20)
4 null (17020, 19) deleted
5 (15930, 10)
6 (16918, 50)
Pairs in blue are not in their hashed
positions.
To lookup 14920, we look for 14920 at positions 3, 4, ... until found
or until a null entry is encountered.
It appears that 14920 is not in the table. But it is.
What is the problem?
What is the solution?
To use the ideas above, we first need to produce an hashcode:
an integer associated with the object to be inserted or
searched for.
If the keys are already integers this is not a problem, but what about
other types such as String, etc.
But in Java, every class type inherits the hashCode()
method from Object.
Does this method need to be overriden? Why or why not?
As noted earlier, the String class does override hashCode().
The hashCode() function gives an integer. But this integer might be
negative. So before using it as an index for the open addresssing table
or the table of chains, this value must be modified:
int getIndex(Object x, int tblSize)
{
int location = x.hashCode() % tblSize;
if ( location < 0 )
{
location += tableSize;
}
return location;
}
Consider a new class type that you will use as the key for insertions into a HashSet or
HashMap. For example:
public class TokenPos
{
private int level;
private String token;
...
public boolean equals(Object other) {...}
public int hashCode() {...}
...
}
How/why should we implement equals and
hashCode for this class?
public class TokenPos
{
private int level;
private String token;
...
public boolean equals(Object other) {
if ( ! (other instanceof TokenPos ) ) {
return false;
} else {
TokenPos tp = (TokenPos) other;
return token.equals(tp.token) && level == tp.level;
}
...
}
Problem with equals and hashCode[27] [top]
Suppose we override the equals method from Object but do not
override the hashCode method.
Consider this code:
TokenPos t1, t2;
HashSet<TokenPos> hs = new HashSet<TokenPos>();
t1 = new TokenPos("count", 3);
t2 = new TokenPos("count", 3);
hs.add(t1);
if ( hs.contains(t2) ) {
System.out.println("YES");
} else {
System.out.println("NO");
}
What will be printed? Why?
The contract between equals and hashCode[28] [top]
If equals is overriden by a class, then hashCode
should also be overriden and this condition (the contract)
should be satisfied for any variables x and y of the class:
if x.equals(y) is true, then x.hashCode() should == y.hashCode()
Overriding hashCode[29] [top]
public class TokenPos
{
private int level;
private String token;
...
public boolean equals(Object other) {
if ( ! (other instanceof TokenPos ) ) {
return false;
} else {
TokenPos tp = (TokenPos) other;
return token.equals(tp.token) && level == tp.level;
}
public int hashCode() {
return token.hashCode() + level;
}
...
}
Many other possibilities for hashCode will satisfy the
contract. E.g.
- return 42;
- return level;
- return token.hashCode();
- return token.hashCode() + 1;
Hash tables can provide O(1) time for these operations
- insert(x)
- delete(x)
- contains(x)
but O(1) performance requires...
- The hash function is good and avoids many collisions
- The load factor (for open addressing or for separate
chaining) is maintained at an acceptable value.
The following operations are not well supported by hash
tables:
- findMin() or findMax() - that is, find the minimum or maximum key
- print or iterate in sorted order
In fact, the hash tables don't really require that the keys be
comparable and so the findMin and findMax functions are
typically omitted.
- 3.4.1
- 3.4.5
- 3.4.10 (but with integer keys: 1, 9, 17, 2, 10 and a table
of size 8) Use linear probing first, and then "Python"
probing. What is the average number of probes to get each of
these 5 keys after they have been inserted?