CSC301 May01

slide version

single file version

Contents

  1. Which Data Structure
  2. Example Problem 1
  3. Hash Tables
  4. Hash Table Implementation
  5. Example
  6. Example (continued)
  7. Collisions
  8. Resolving Collisions
  9. Example (continued)
  10. Clusters
  11. Caveats
  12. A Sample hash function for Java Strings
  13. The String implementation of the hashCode method
  14. Separate Chaining
  15. Example
  16. Load Factor
  17. Example 1
  18. Example 2
  19. Reducing the Load Factor
  20. Example 1a
  21. Example 2a
  22. Table Size, Load Factor and Caches
  23. Deletions
  24. Hashing Functions for Objects
  25. Writing a Class for use with Hash tables (Java)
  26. Writing the equals method
  27. Problem with equals and hashCode
  28. The contract between equals and hashCode
  29. Overriding hashCode
  30. Summary
  31. Problems

Which Data Structure[1] [top]

Suppose we will need these operations (and only these) on data to be stored and all operations must have the indicated running time restriction:

  1. insert(x) - must be O(log(N), preferably O(1)
  2. contains(x) - must be O(1)

A class is to be written with these public methods. A private data member will store the actual data and be used to implement the public methods.

Which type could be used for the private data member?

Example Problem 1[2] [top]

How fast can we make both the insert and the lookup?

O( log(N) )?

O( 1 )?

Hash Tables[3] [top]

A hash table is conceptually a collection of (key, value) pairs. (Like a Map).

Two main operations on a hash table, h are

        h.put(key, value)
        value = h.get(key)
      

A hash table should guarantee that these two operations are very efficient (at least on average).

        In particular, if the hash table contains N keys (and
        associated values) we want these operations to be O(1). 
      

Hash Table Implementation[4] [top]

There are two main structures used for implementing a hash table.

        Open addressing using arrays
        Separate chaining: an array of linked lists.
      

To achieve O(1) time for put and get,

        hash function is applied to the key to produce an index in the
        structure at which to store the (key, value).
      

Example[5] [top]

        Keys will be 'Product codes'
        Values will be 'quantity on hand'

        Pairs to be inserted into a hash:
        (14913, 20)
        (15930, 10)
        (16918, 50)
        (17020, 19)
      

We will put these into a hash table using open addressing and storing the values in an array.

The success of hashing depends on how full the array is. As a rule of thumb, one should try to keep the array no more than 75% full.

We will use size 7 for this example.

        The hash function should map each key to the array index range: 
        0 - 6.

        An easy way to do this is to use the remainder on division by 7:

        hash(key) = key % 7;
      

Example (continued)[6] [top]

Inserting the first three pairs (14913, 20), (15930, 10), (16918, 50) we get:

        hash(14913) = 14913 % 7 = 3
        hash(15930) = 15930 % 7 = 5
        hash(16918) = 16918 % 7 = 6

        -----------
        0 null
        1 null
        2 null
        3 (14913, 20)
        4 null
        5 (15930, 10)
        6 (16918, 50)
      

To lookup the quantity for product code 15930, we just apply the hash function again:

        Lookup quantity for 15930: hash(15930) = 15930 % 7 = 5
        At index 5 we get (15930, 10), so the quantity is 10.
      

Collisions[7] [top]

Now try to insert the fourth pair, (17020, 19), we get a collision:

        hash(14913) = 14913 % 7 = 3
        hash(15930) = 15930 % 7 = 5
        hash(16918) = 16918 % 7 = 6
        hash(17020) = 17020 % 7 = 3
        -----------
        0 null
        1 null
        2 null
        3 (14913, 20) (17020, 19)
        4 null
        5 (15930, 10)
        6 (16918, 50)

      

Resolving Collisions[8] [top]

One simple way to resolve collisions in an array implementation is to use linear probing. That is, just try the next position:

        hash(14913) = 14913 % 7 = 3
        hash(15930) = 15930 % 7 = 5
        hash(16918) = 16918 % 7 = 6
        hash(17020) = 17020 % 7 = 3
        -----------
        0 null
        1 null
        2 null
        3 (14913, 20) (17020, 19) goes in next available spot
        4 (17020, 19)
        5 (15930, 10)
        6 (16918, 50)


        How many comparisons must be made now, to look up key 14913?
        What about key 17020?
        Where would (14920, 40) go? hash(14920) = 14920 % 7 = 3
        Answer: Position 0; that is, next available just wraps around to the
        beginning of the array if necessary.
      

Example (continued)[9] [top]

So inserting all these values gives:

        hash(14913) = 14913 % 7 = 3
        hash(15930) = 15930 % 7 = 5
        hash(16918) = 16918 % 7 = 6
        hash(17020) = 17020 % 7 = 3
        hash(14920) = 14920 % 7 = 3
        -----------
        0 (14920, 40)
        1 null
        2 null
        3 (14913, 20) 
        4 (17020, 19)
        5 (15930, 10)
        6 (16918, 50)

        Pairs in blue are not in their hashed positions:

        Key     Number of Probes to Lookup
        14913         1
        15930         1
        16918         1
        17020         2
        14920         5
        ------
        10

        For the 5 keys in the table

        Average #probes (comparisions) to lookup a value which is present

        10/5 = 2
      

Clusters[10] [top]

The mod operator does a resonably good job of distributing random keys and this helps avoid collisions.

When collisions occur, clusters can form so that instead of being able to put the key at the next position, many positions must be skipped over because they are occupied by keys with a different hash value.

There are several techniques that seem to help to avoid clustering to some extent:

        If a collision occurs for a key with hash(key) = k, the positions to try are
        (mod table size, of course):

        Method            Next Position
        linear probing          k + 1, k + 2  k + 3, ...
        quadratic probing       k + 1, k + 4, k + 9, ...
        double hashing          k + j, k + 2j, k + 3j ...
        (A second hash function, hash2 is used 
        to compute the skip value, j = hash2(key) ) 
      

There are problems with all these.

Linear probing can cause clusters.

Quadratic probing avoids primary clustering, but can fail to find a spot if the table is more than 50% full.

Double hashing requires a second hash function and so is likely to be a bit slower to compute than quadratic probing.

Caveats[11] [top]

A Sample hash function for Java Strings[12] [top]

Recall the general principle about the Java hashCode() method inherited from Object:

        If x.equals(y) returns true, then x.hashCode() should equal y.hashCode()
      

The Java String class overrides both the equals and the hashCode methods. The first method returns true if two Strings represent the same sequence of characters, not just if they are identical as objects. So hashCode was rewritten so that two equal strings return equal hash values.

But to be a "good" hash function, we also would like to have two different String values return different hash values.

So a "good" hash function for String should involve all the characters in the string and it should not give the same value for permutations of the same string.

E.g, "bat" and "tab" should return different hash values.

The String implementation of the hashCode method[13] [top]

Here is what is advertised in the Java API to be the implementation of hashCode() in the String class:


        public int hashCode()
        {
	  int result = 0;

	  for(int i = 0; i < this.length(); i++)
	  {
	    result = 31 * result + (int) this.charAt(i);
	  }
	  return result;
        }

      

Separate Chaining[14] [top]

For each hash value, keep a linked list (also called a chain) of the pairs whose keys map to that value.

When inserting a new key, first compute its hash value, then insert the key at the beginning of the list for that hash value.

When searching for a key, first compute its hash value, then search the list for that hash value.

Example[15] [top]

Using the same example as for linear probing:

        hash(14913) = 14913 % 7 = 3
        hash(15930) = 15930 % 7 = 5
        hash(16918) = 16918 % 7 = 6
        hash(17020) = 17020 % 7 = 3
        hash(14920) = 14920 % 7 = 3
        -----------
        0 null
        1 null
        2 null
        3 -----> (14913, 20) --> (17020,19) --> (14920,40)
        4 null
        5 -----> (15930, 10)
        6 -----> (16918, 50)

        Collisions are all on the same list. So lookup of 14920 only requires
        3 probes instead of 5; no values that hash to a different value are
        are involved unlike linear probing.

        Key     Number of Probes to Lookup
        14913         1
        15930         1
        16918         1
        17020         2
        14920         3
        ------
        8

        Average #probes (comparisons) to lookup a value which is present

        8/5 = 1.6

      

Load Factor[16] [top]

The load factor for separate chaining is the average chain length.

Ideally, the load factor should be small, e.g. 1.0 or less.

The default load factor for Java's Hash classes is 0.75

Example 1[17] [top]

        N = 10 Keys, M = 5 chains, Load Factor=2; bad hash function

        Here the average chain length is N/M = 2.

        But suppose the hash function is "bad". 

        For example, suppose all the keys end in 0 and the hash function is
        just the key mod 5.  

        Then all keys map to chain 0.

        chain          length
        0              10
        1               0
        2               0
        3               0
        4               0

        The average chain length is λ = 2 and  1 + λ/2 is also 2.

        For a successful find

        key            probes
        1-st             1
        2nd              2
        3rd              3
        ...              ...
        10th             10
        ------          -----
        total                   55

        Average #probes for successful find = 55/10 = 5.5

        Average #probes for an unsuccessful find is 10 (again assuming all
        keys map to chain 0).
      

Example 2[18] [top]

        N=10 keys, M=5 chains; Load Factor = 2; good hash function

        Here the average chain length is still N/M = 2.

        But we now suppose the hash function is "good". 

        For example, suppose the keys are uniformly distributed so that all
        chains have length 2.


        Then 

        chain          length
        0               2
        1               2
        2               2
        3               2
        4               2

        The average chain length is 2, and

        For a successful find

        key            probes
        1-st             1
        2nd              2
        3rd              1
        4th              2
        5th              1
        6th              2
        ...              ...
        9th              1
        10th             2
        ------          -----
        total           15

        Average #probes for successful find = 15/10 = 1.5

      

Reducing the Load Factor[19] [top]

        Suppose we reduce the load factor to be 1 instead of 2 by doubling the
        number of chains.

        What happens to the two previous examples: bad hash and good hash?
      

Example 1a[20] [top]

        N = 10 Keys, M = 10 chains, Load Factor=1; bad hash function

        Here the average chain length is N/M = 1.

        With the "bad" hash function, increasing the chains means:


        chain          length
        0              10
        1               0
        2               0
        3               0
        4               0
        5               0
        ...             ...
        9               0

        The average chain length is now 1 instead of 2, but

        For a successful find

        key            probes
        1-st             1
        2nd              2
        3rd              3
        ...              ...
        10th             10
        ------          -----
        total                   55

        Average #probes for successful find is still = 55/10 = 5.5

      

Example 2a[21] [top]

        N = 10 Keys, M = 10 chains, Load Factor=1; good hash function

        Here the average chain length is N/M = 1.

        With the "good" hash function, increasing the chains means:


        chain          length
        0               1
        1               1
        2               1
        3               1
        4               1
        5               1
        ...             ...
        9               1

        The average chain length is now 1 instead of 2, and

        For a successful find

        key            probes
        1-st             1
        2nd              1
        3rd              1
        ...              ...
        10th              1
        ------          -----
        total                   10

        Average #probes for successful find is slightly better = 10/10 = 1

      

Table Size, Load Factor and Caches[22] [top]

For separate chaining, we can make the load factor smaller and smaller by increasing the number of chains.

Keys will be spread out further on different chains.

This randomness is not necessarily good for memory cache performance. Accessing one chain may bring a portion of the chain array into the L1 cache. But the next chain accessed may not be in the same cache block resulting in a cache miss.

This randomness may result in worse performance because of increased cache misses.

Moral: Don't make the number of chains excessively large.

Deletions[23] [top]

Deletions from a hash table with separate chaining are straightforward.

Open addressing is trickier. Consider deleting key 17020 from this hash table and then lookup 14920:

        hash(14913) = 14913 % 7 = 3
        hash(15930) = 15930 % 7 = 5
        hash(16918) = 16918 % 7 = 6
        hash(17020) = 17020 % 7 = 3
        hash(14920) = 14920 % 7 = 3
        -----------
        0 (14920, 40)
        1 null
        2 null
        3 (14913, 20) 
        4 null        (17020, 19) deleted
        5 (15930, 10)
        6 (16918, 50)

        Pairs in blue are not in their hashed
        positions.

        To lookup 14920, we look for 14920 at positions  3, 4, ... until found
        or until a null entry is encountered.

        It appears that 14920 is not in the table. But it is. 
        What is the problem?
        What is the solution?
        
      

Hashing Functions for Objects[24] [top]

To use the ideas above, we first need to produce an hashcode: an integer associated with the object to be inserted or searched for.

If the keys are already integers this is not a problem, but what about other types such as String, etc.

But in Java, every class type inherits the hashCode() method from Object.

Does this method need to be overriden? Why or why not?

As noted earlier, the String class does override hashCode().

The hashCode() function gives an integer. But this integer might be negative. So before using it as an index for the open addresssing table or the table of chains, this value must be modified:

        int getIndex(Object x, int tblSize)
        {
	   int location = x.hashCode() % tblSize;
	   if ( location < 0 )
	   {
	     location += tableSize;
	   }
	   return location;
        }
      

Writing a Class for use with Hash tables (Java)[25] [top]

Consider a new class type that you will use as the key for insertions into a HashSet or HashMap. For example:

        public class TokenPos
        {
	   private int level;
	   private String token;
	   ...
	   public boolean equals(Object other) {...}
	   public int hashCode() {...}
	   ...
        }
      

How/why should we implement equals and hashCode for this class?

Writing the equals method[26] [top]

        public class TokenPos
        {
	   private int level;
	   private String token;
	   ...
	   public boolean equals(Object other) {
             if ( ! (other instanceof TokenPos ) ) {
                return false;
             } else {
               TokenPos tp = (TokenPos) other;
               return token.equals(tp.token) && level == tp.level;
             }
	   ...
        }
    

Problem with equals and hashCode[27] [top]

Suppose we override the equals method from Object but do not override the hashCode method.

Consider this code:

      TokenPos t1, t2;
      HashSet<TokenPos> hs = new HashSet<TokenPos>();
      t1 = new TokenPos("count", 3);
      t2 = new TokenPos("count", 3);

      hs.add(t1);
      if ( hs.contains(t2) ) {
         System.out.println("YES");
      } else {
         System.out.println("NO");
      }
    

What will be printed? Why?

The contract between equals and hashCode[28] [top]

If equals is overriden by a class, then hashCode should also be overriden and this condition (the contract) should be satisfied for any variables x and y of the class:

      if  x.equals(y) is true, then x.hashCode() should == y.hashCode()
    

Overriding hashCode[29] [top]

        public class TokenPos
        {
	   private int level;
	   private String token;
	   ...
	   public boolean equals(Object other) {
             if ( ! (other instanceof TokenPos ) ) {
                return false;
             } else {
               TokenPos tp = (TokenPos) other;
               return token.equals(tp.token) && level == tp.level;
             }

           public int hashCode() {
              return token.hashCode() + level;
           }
	   ...
        }
    

Many other possibilities for hashCode will satisfy the contract. E.g.

Summary[30] [top]

Hash tables can provide O(1) time for these operations

  1. insert(x)
  2. delete(x)
  3. contains(x)

but O(1) performance requires...

The following operations are not well supported by hash tables:

  1. findMin() or findMax() - that is, find the minimum or maximum key
  2. print or iterate in sorted order

In fact, the hash tables don't really require that the keys be comparable and so the findMin and findMax functions are typically omitted.

Problems[31] [top]

  1. 3.4.1
  2. 3.4.5
  3. 3.4.10 (but with integer keys: 1, 9, 17, 2, 10 and a table of size 8) Use linear probing first, and then "Python" probing. What is the average number of probes to get each of these 5 keys after they have been inserted?