DS2

Suppose we will need these operations (and only these) on data to be stored and all operations must have the indicated running time restriction:

insert(x) - must be O(log(N), preferably O(1)
contains(x) - must be O(1)

A class is to be written with these public methods. A private data member will store the actual data and be used to implement the public methods.

Which type could be used for the private data member?

LinkedList
ArrayList
Binary Search Tree
AVL Tree
Min Heap
Max Heap

Example Problem 1 [2] [top]

Read a large file of English dictionary words.
Insert each word in a data structure
Lookup each word.

How fast can we make both the insert and the lookup?

O( log(N) )?

O( 1 )?

Hash Tables [3] [top]

A hash table is conceptually a collection of (key, value) pairs. (Like a Map).

Two main operations on a hash table, h are

        h.put(key, value)
        value = h.get(key)

A hash table should guarantee that these two operations are very efficient (at least on average).

        In particular, if the hash table contains N keys (and
        associated values) we want these operations to be O(1).

Hash Table Implementation [4] [top]

There are two main structures used for implementing a hash table.

        Open addressing using arrays
        Separate chaining: an array of linked lists.

To achieve O(1) time for put and get,

        hash function is applied to the key to produce an index in the
        structure at which to store the (key, value).

Example [5] [top]

        Keys will be 'Product codes'
        Values will be 'quantity on hand'

        Pairs to be inserted into a hash:
        (14913, 20)
        (15930, 10)
        (16918, 50)
        (17020, 19)

We will put these into a hash table using open addressing and storing the values in an array.

The success of hashing depends on how full the array is. As a rule of thumb, one should try to keep the array no more than 75% full.

We will use size 7 for this example.

        The hash function should map each key to the array index range: 
        0 - 6.

        An easy way to do this is to use the remainder on division by 7:

        hash(key) = key % 7;

Example (continued)[6] [top]

Inserting the first three pairs (14913, 20), (15930, 10), (16918, 50) we get:

        hash(14913) = 14913 % 7 = 3
        hash(15930) = 15930 % 7 = 5
        hash(16918) = 16918 % 7 = 6

        -----------
        0 null
        1 null
        2 null
        3 (14913, 20)
        4 null
        5 (15930, 10)
        6 (16918, 50)

To lookup the quantity for product code 15930, we just apply the hash function again:

        Lookup quantity for 15930: hash(15930) = 15930 % 7 = 5
        At index 5 we get (15930, 10), so the quantity is 10.

Collisions [7] [top]

Now try to insert the fourth pair, (17020, 19), we get a collision:

        hash(14913) = 14913 % 7 = 3
        hash(15930) = 15930 % 7 = 5
        hash(16918) = 16918 % 7 = 6
        hash(17020) = 17020 % 7 = 3
        -----------
        0 null
        1 null
        2 null
        3 (14913, 20) (17020, 19)
        4 null
        5 (15930, 10)
        6 (16918, 50)

Resolving Collisions [8] [top]

One simple way to resolve collisions in an array implementation is to use linear probing. That is, just try the next position:

        hash(14913) = 14913 % 7 = 3
        hash(15930) = 15930 % 7 = 5
        hash(16918) = 16918 % 7 = 6
        hash(17020) = 17020 % 7 = 3
        -----------
        0 null
        1 null
        2 null
        3 (14913, 20) (17020, 19) goes in next available spot
        4 (17020, 19)
        5 (15930, 10)
        6 (16918, 50)


        How many comparisons must be made now, to look up key 14913?
        What about key 17020?
        Where would (14920, 40) go? hash(14920) = 14920 % 7 = 3
        Answer: Position 0; that is, next available just wraps around to the
        beginning of the array if necessary.

Example (continued)[9] [top]

So inserting all these values gives:

        hash(14913) = 14913 % 7 = 3
        hash(15930) = 15930 % 7 = 5
        hash(16918) = 16918 % 7 = 6
        hash(17020) = 17020 % 7 = 3
        hash(14920) = 14920 % 7 = 3
        -----------
        0 (14920, 40)
        1 null
        2 null
        3 (14913, 20) 
        4 (17020, 19)
        5 (15930, 10)
        6 (16918, 50)

        Pairs in blue are not in their hashed positions:

        Key     Number of Probes to Lookup
        14913         1
        15930         1
        16918         1
        17020         2
        14920         5
        ------
        10

        For the 5 keys in the table

        Average #probes (comparisions) to lookup a value which is present

        10/5 = 2

Clusters [10] [top]

The mod operator does a resonably good job of distributing random keys and this helps avoid collisions.

When collisions occur, clusters can form so that instead of being able to put the key at the next position, many positions must be skipped over because they are occupied by keys with a different hash value.

There are several techniques that seem to help to avoid clustering to some extent:

        If a collision occurs for a key with hash(key) = k, the positions to try are
        (mod table size, of course):

        Method            Next Position
        linear probing          k + 1, k + 2  k + 3, ...
        quadratic probing       k + 1, k + 4, k + 9, ...
        double hashing          k + j, k + 2j, k + 3j ...
        (A second hash function, hash2 is used 
        to compute the skip value, j = hash2(key) )

There are problems with all these.

Linear probing can cause clusters.

Quadratic probing avoids primary clustering, but can fail to find a spot if the table is more than 50% full.

Double hashing requires a second hash function and so is likely to be a bit slower to compute than quadratic probing.

Caveats [11] [top]

For quadratic probing, the probe sequence can fail to find an empty entry even if empty entries exist.

This can be avoiding provided:
- The size of the table is a prime number, and
- The number of keys is always less than half the table size. (This will occasionally require increasing the table size and rehashing all the keys.)
For double hashing, it must be guaranteed that the second hash function never returns 0 for any key.

Since the second hash function is only used to determine the skip value, a value of 0 would mean to not skip, but to keep trying the original location!

A Sample hash function for Java Strings [12] [top]

Recall the general principle about the Java hashCode() method inherited from Object:

        If x.equals(y) returns true, then x.hashCode() should equal y.hashCode()

The Java String class overrides both the equals and the hashCode methods. The first method returns true if two Strings represent the same sequence of characters, not just if they are identical as objects. So hashCode was rewritten so that two equal strings return equal hash values.

But to be a "good" hash function, we also would like to have two different String values return different hash values.

So a "good" hash function for String should involve all the characters in the string and it should not give the same value for permutations of the same string.

E.g, "bat" and "tab" should return different hash values.

The String implementation of the hashCode method [13] [top]

Here is what is advertised in the Java API to be the implementation of hashCode() in the String class:


        public int hashCode()
        {
	  int result = 0;

	  for(int i = 0; i < this.length(); i++)
	  {
	    result = 31 * result + (int) this.charAt(i);
	  }
	  return result;
        }

Separate Chaining [14] [top]

For each hash value, keep a linked list (also called a chain) of the pairs whose keys map to that value.

When inserting a new key, first compute its hash value, then insert the key at the beginning of the list for that hash value.

When searching for a key, first compute its hash value, then search the list for that hash value.

Example [15] [top]

Using the same example as for linear probing:

        hash(14913) = 14913 % 7 = 3
        hash(15930) = 15930 % 7 = 5
        hash(16918) = 16918 % 7 = 6
        hash(17020) = 17020 % 7 = 3
        hash(14920) = 14920 % 7 = 3
        -----------
        0 null
        1 null
        2 null
        3 -----> (14913, 20) --> (17020,19) --> (14920,40)
        4 null
        5 -----> (15930, 10)
        6 -----> (16918, 50)

        Collisions are all on the same list. So lookup of 14920 only requires
        3 probes instead of 5; no values that hash to a different value are
        are involved unlike linear probing.

        Key     Number of Probes to Lookup
        14913         1
        15930         1
        16918         1
        17020         2
        14920         3
        ------
        8

        Average #probes (comparisons) to lookup a value which is present

        8/5 = 1.6

Load Factor [16] [top]

The load factor for separate chaining is the average chain length.

Ideally, the load factor should be small, e.g. 1.0 or less.

The default load factor for Java's Hash classes is 0.75

Example 1 [17] [top]

        N = 10 Keys, M = 5 chains, Load Factor=2; bad hash function

        Here the average chain length is N/M = 2.

        But suppose the hash function is "bad". 

        For example, suppose all the keys end in 0 and the hash function is
        just the key mod 5.  

        Then all keys map to chain 0.

        chain          length
        0              10
        1               0
        2               0
        3               0
        4               0

        The average chain length is λ = 2 and  1 + λ/2 is also 2.

        For a successful find

        key            probes
        1-st             1
        2nd              2
        3rd              3
        ...              ...
        10th             10
        ------          -----
        total                   55

        Average #probes for successful find = 55/10 = 5.5

        Average #probes for an unsuccessful find is 10 (again assuming all
        keys map to chain 0).

Example 2 [18] [top]

        N=10 keys, M=5 chains; Load Factor = 2; good hash function

        Here the average chain length is still N/M = 2.

        But we now suppose the hash function is "good". 

        For example, suppose the keys are uniformly distributed so that all
        chains have length 2.


        Then 

        chain          length
        0               2
        1               2
        2               2
        3               2
        4               2

        The average chain length is 2, and

        For a successful find

        key            probes
        1-st             1
        2nd              2
        3rd              1
        4th              2
        5th              1
        6th              2
        ...              ...
        9th              1
        10th             2
        ------          -----
        total           15

        Average #probes for successful find = 15/10 = 1.5

Reducing the Load Factor [19] [top]

        Suppose we reduce the load factor to be 1 instead of 2 by doubling the
        number of chains.

        What happens to the two previous examples: bad hash and good hash?

Example 1a [20] [top]

        N = 10 Keys, M = 10 chains, Load Factor=1; bad hash function

        Here the average chain length is N/M = 1.

        With the "bad" hash function, increasing the chains means:


        chain          length
        0              10
        1               0
        2               0
        3               0
        4               0
        5               0
        ...             ...
        9               0

        The average chain length is now 1 instead of 2, but

        For a successful find

        key            probes
        1-st             1
        2nd              2
        3rd              3
        ...              ...
        10th             10
        ------          -----
        total                   55

        Average #probes for successful find is still = 55/10 = 5.5

Example 2a [21] [top]

        N = 10 Keys, M = 10 chains, Load Factor=1; good hash function

        Here the average chain length is N/M = 1.

        With the "good" hash function, increasing the chains means:


        chain          length
        0               1
        1               1
        2               1
        3               1
        4               1
        5               1
        ...             ...
        9               1

        The average chain length is now 1 instead of 2, and

        For a successful find

        key            probes
        1-st             1
        2nd              1
        3rd              1
        ...              ...
        10th              1
        ------          -----
        total                   10

        Average #probes for successful find is slightly better = 10/10 = 1

Table Size, Load Factor and Caches [22] [top]

For separate chaining, we can make the load factor smaller and smaller by increasing the number of chains.

Keys will be spread out further on different chains.

This randomness is not necessarily good for memory cache performance. Accessing one chain may bring a portion of the chain array into the L1 cache. But the next chain accessed may not be in the same cache block resulting in a cache miss.

This randomness may result in worse performance because of increased cache misses.

Moral: Don't make the number of chains excessively large.

Deletions [23] [top]

Deletions from a hash table with separate chaining are straightforward.

Open addressing is trickier. Consider deleting key 17020 from this hash table and then lookup 14920:

        hash(14913) = 14913 % 7 = 3
        hash(15930) = 15930 % 7 = 5
        hash(16918) = 16918 % 7 = 6
        hash(17020) = 17020 % 7 = 3
        hash(14920) = 14920 % 7 = 3
        -----------
        0 (14920, 40)
        1 null
        2 null
        3 (14913, 20) 
        4 null        (17020, 19) deleted
        5 (15930, 10)
        6 (16918, 50)

        Pairs in blue are not in their hashed
        positions.

        To lookup 14920, we look for 14920 at positions  3, 4, ... until found
        or until a null entry is encountered.

        It appears that 14920 is not in the table. But it is. 
        What is the problem?
        What is the solution?

Hashing Functions for Objects [24] [top]

To use the ideas above, we first need to produce an hashcode: an integer associated with the object to be inserted or searched for.

If the keys are already integers this is not a problem, but what about other types such as String, etc.

But in Java, every class type inherits the hashCode() method from Object.

Does this method need to be overriden? Why or why not?

As noted earlier, the String class does override hashCode().

The hashCode() function gives an integer. But this integer might be negative. So before using it as an index for the open addresssing table or the table of chains, this value must be modified:

        int getIndex(Object x, int tblSize)
        {
	   int location = x.hashCode() % tblSize;
	   if ( location < 0 )
	   {
	     location += tableSize;
	   }
	   return location;
        }

Writing a Class for use with Hash tables (Java)[25] [top]

Consider a new class type that you will use as the key for insertions into a HashSet or HashMap. For example:

        public class TokenPos
        {
	   private int level;
	   private String token;
	   ...
	   public boolean equals(Object other) {...}
	   public int hashCode() {...}
	   ...
        }

How/why should we implement equals and hashCode for this class?

Writing the equals method [26] [top]

        public class TokenPos
        {
	   private int level;
	   private String token;
	   ...
	   public boolean equals(Object other) {
             if ( ! (other instanceof TokenPos ) ) {
                return false;
             } else {
               TokenPos tp = (TokenPos) other;
               return token.equals(tp.token) && level == tp.level;
             }
	   ...
        }

Problem with equals and hashCode[27] [top]

Suppose we override the equals method from Object but do not override the hashCode method.

Consider this code:

      TokenPos t1, t2;
      HashSet<TokenPos> hs = new HashSet<TokenPos>();
      t1 = new TokenPos("count", 3);
      t2 = new TokenPos("count", 3);

      hs.add(t1);
      if ( hs.contains(t2) ) {
         System.out.println("YES");
      } else {
         System.out.println("NO");
      }

What will be printed? Why?

The contract between equals and hashCode[28] [top]

If equals is overriden by a class, then hashCode should also be overriden and this condition (the contract) should be satisfied for any variables x and y of the class:

      if  x.equals(y) is true, then x.hashCode() should == y.hashCode()

Overriding hashCode[29] [top]

        public class TokenPos
        {
	   private int level;
	   private String token;
	   ...
	   public boolean equals(Object other) {
             if ( ! (other instanceof TokenPos ) ) {
                return false;
             } else {
               TokenPos tp = (TokenPos) other;
               return token.equals(tp.token) && level == tp.level;
             }

           public int hashCode() {
              return token.hashCode() + level;
           }
	   ...
        }

Many other possibilities for hashCode will satisfy the contract. E.g.

return 42;
return level;
return token.hashCode();
return token.hashCode() + 1;

Summary [30] [top]

Hash tables can provide O(1) time for these operations

insert(x)
delete(x)
contains(x)

but O(1) performance requires...

The hash function is good and avoids many collisions
The load factor (for open addressing or for separate chaining) is maintained at an acceptable value.

The following operations are not well supported by hash tables:

findMin() or findMax() - that is, find the minimum or maximum key
print or iterate in sorted order

In fact, the hash tables don't really require that the keys be comparable and so the findMin and findMax functions are typically omitted.

CSC301 May01

slide version

single file version

Contents

Which Data Structure [1] [top]

Example Problem 1 [2] [top]

Hash Tables [3] [top]

Hash Table Implementation [4] [top]

Example [5] [top]

Example (continued)[6] [top]

Collisions [7] [top]

Resolving Collisions [8] [top]

Example (continued)[9] [top]

Clusters [10] [top]

Caveats [11] [top]

A Sample hash function for Java Strings [12] [top]

The String implementation of the hashCode method [13] [top]

Separate Chaining [14] [top]

Example [15] [top]

Load Factor [16] [top]

Example 1 [17] [top]

Example 2 [18] [top]

Reducing the Load Factor [19] [top]

Example 1a [20] [top]

Example 2a [21] [top]

Table Size, Load Factor and Caches [22] [top]

Deletions [23] [top]

Hashing Functions for Objects [24] [top]

Writing a Class for use with Hash tables (Java)[25] [top]

Writing the equals method [26] [top]

Problem with equals and hashCode[27] [top]

The contract between equals and hashCode[28] [top]

Overriding hashCode[29] [top]

Summary [30] [top]

Problems [31] [top]

CSC301 May01

slide version

single file version

Contents

Which Data Structure[1] [top]

Example Problem 1[2] [top]

Hash Tables[3] [top]

Hash Table Implementation[4] [top]

Example[5] [top]

Example (continued)[6] [top]

Collisions[7] [top]

Resolving Collisions[8] [top]

Example (continued)[9] [top]

Clusters[10] [top]

Caveats[11] [top]

A Sample hash function for Java Strings[12] [top]

The String implementation of the hashCode method[13] [top]

Separate Chaining[14] [top]

Example[15] [top]

Load Factor[16] [top]

Example 1[17] [top]

Example 2[18] [top]

Reducing the Load Factor[19] [top]

Example 1a[20] [top]

Example 2a[21] [top]

Table Size, Load Factor and Caches[22] [top]

Deletions[23] [top]

Hashing Functions for Objects[24] [top]

Writing a Class for use with Hash tables (Java)[25] [top]

Writing the equals method[26] [top]

Problem with equals and hashCode[27] [top]

The contract between equals and hashCode[28] [top]

Overriding hashCode[29] [top]

Summary[30] [top]

Problems[31] [top]

Which Data Structure [1] [top]

Example Problem 1 [2] [top]

Hash Tables [3] [top]

Hash Table Implementation [4] [top]

Example [5] [top]

Collisions [7] [top]

Resolving Collisions [8] [top]

Clusters [10] [top]

Caveats [11] [top]

A Sample hash function for Java Strings [12] [top]

The String implementation of the hashCode method [13] [top]

Separate Chaining [14] [top]

Example [15] [top]

Load Factor [16] [top]

Example 1 [17] [top]

Example 2 [18] [top]

Reducing the Load Factor [19] [top]

Example 1a [20] [top]

Example 2a [21] [top]

Table Size, Load Factor and Caches [22] [top]

Deletions [23] [top]

Hashing Functions for Objects [24] [top]

Writing the equals method [26] [top]

Summary [30] [top]

Problems [31] [top]