CSC393

Topics

Sets and Maps

Hashing

Function Objects

Sorting Vectors and Deques

Hash Tables

A hash table is conceptually a collection of (key, value) pairs.

Two main operations on a hash table, h are

insertion

 h.insert(pair(key, value));       

or
 
 h[key] = value;


lookup

             value = h[key];

A hash table should guarantee that these two operations are very efficient (at least on average).

In particular, if the hash table contains N keys (and
associated values) we want these operations to be O(1).

Hash Table Implementation

There are two main structures used for implementing a hash table.

Open addressing using arrays
Separate chaining: an array of linked lists.

To achieve O(1) time for put and get,

hash function is applied to the key to produce an index in the
structure at which to store the (key, value).

Example

Keys will be 'Product codes'
Values will be 'quantity on hand'

Pairs to be inserted into a hash:
(14913, 20)
(15930, 10)
(16918, 50)
(17020, 19)

We will put these into a hash table using open addressing and storing the values in an array.

The success of hashing depends on how full the array is. As a rule of thumb, one should try to keep the array no more than 75% full.

We will use size 7 for this example.

The hash function should map each key to the array index range: 
0 - 7.

An easy way to do this is to use the remainder on division by 7:

hash(key) = key % 7;

Example (continued)

Inserting the first three pairs (14913, 20), (15930, 10), (16918, 50) we get:

hash(14913) = 14913 % 7 = 3
hash(15930) = 15930 % 7 = 5
hash(16918) = 16918 % 7 = 6

	  -----------
	0 null
	1 null
	2 null
	3 (14913, 20)
	4 null
	5 (15930, 10)
	6 (16918, 50)

To lookup the quantity for product code 15930, we just apply the hash function again:

Lookup quantity for 15930: hash(15930) = 15930 % 7 = 5
At index 5 we get (15930, 10), so the quantity is 10.

Collisions

Now try to insert the fourth pair, (17020, 19), we get a collision:

hash(14913) = 14913 % 7 = 3
hash(15930) = 15930 % 7 = 5
hash(16918) = 16918 % 7 = 6
hash(17020) = 17020 % 7 = 3
	  -----------
	0 null
	1 null
	2 null
	3 (14913, 20) (17020, 19)
	4 null
	5 (15930, 10)
	6 (16918, 50)

Resolving Collisions

One simple way to resolve collisions in an array implementation is to use linear probing. That is, just try the next position:

hash(14913) = 14913 % 7 = 3
hash(15930) = 15930 % 7 = 5
hash(16918) = 16918 % 7 = 6
hash(17020) = 17020 % 7 = 3
	  -----------
	0 null
	1 null
	2 null
	3 (14913, 20) (17020, 19) goes in next available spot
	4 (17020, 19)
	5 (15930, 10)
	6 (16918, 50)


How many comparisons must be made now, to look up key 14913?
What about key 17020?
Where would (14920, 40) go? hash(14920) = 14920 % 7 = 3
Answer: Position 0; that is, next available just wraps around to the
beginning of the array if necessary.

Example (continued)

So inserting all these values gives:

hash(14913) = 14913 % 7 = 3
hash(15930) = 15930 % 7 = 5
hash(16918) = 16918 % 7 = 6
hash(17020) = 17020 % 7 = 3
hash(14920) = 14920 % 7 = 3
	  -----------
	0 (14920, 40)
	1 null
	2 null
	3 (14913, 20) 
	4 (17020, 19)
	5 (15930, 10)
	6 (16918, 50)

Pairs in blue are not in their hashed positions:

  Key     Number of Probes to Lookup
  14913         1
  15930         1
  16918         1
  17020         2
  14920         5
              ------
               10

For the 5 keys in the table

Average #probes (comparisions) to lookup a value which is present

               10/5 = 2

Clusters

The mod operator does a resonably good job of distributing random keys and this helps avoid collisions.

When collisions occur, clusters can form so that instead of being able to put the key at the next position, many positions must be skipped over because they are occupied.

There are several techniques that seem to help to avoid clustering to some extent:

If a collision occurs for a key with hash(key) = k, the positions to try are
(mod table size, of course):

  Method	    Next Position
linear probing          k + 1, k + 2  k + 3, ...
quadratic probing       k + 1, k + 4, k + 9, ...
double hashing          k + j, k + 2j, k + 3j ...
   (A second hash function, hash2 is used 
    to compute the skip value, j = hash2(key) )

There are problems with all these.

Linear probing can cause clusters.

Quadratic probing avoids primary clustering, but can fail to find a spot if the table is more than 50% full.

Double hashing requires a second hash function and so is likely to be a bit slower to compute than quadratic probing.

The Skip value for Open Addressing Hash Tables


The general formulation of which index to use for the i-th probe of
value x is

	hash(x) + f(i)

for i = 0, 1, ...

where f(0) = 0.

Linear probing:      f(i) = i
Quadratic probing:   f(i) = i²
Double hashing:      f(i) = i * hash₂(x)

where hash₂ is a second hash function.

But this formulation tells us which location to probe on the i-th
probe in relation to the original hash location.

Alternatively, we can express each one method in terms of how many
locations to advance for each successive probe.

Alternative Formulation

If the i-th probe has just been made at some location k, then how much do we add to k for the the location for the (i+1)-th probe?

Answer f(i+1) - f(i)

So from the location for the i-th probe,

Probing method	f(i)	Skip Value f(i+1) - f(i)	Skip sequence
Linear	i	1	1,1,1,... Just skip to the next location
Quadratic	i²	2i+1	1,3,5,7,... Each skip is 2 more than the previous skip
Double Hashing	i*hash₂(x)	hash₂(x)	hash₂(x),hash₂(x),... Each skip is equal to hash₂(x)

Caveats

For quadratic probing, the probe sequence can fail to find an empty entry even if empty entries exist.

This can be avoiding provided:
- The size of the table is a prime number, and
- The number of keys is always less than half the table size. (This will occasionally require increasing the table size and rehashing all the keys.)
For double hashing, it must be guaranteed that the second hash function never returns 0 for any key.

Since the second hash function is only used to determine the skip value, a value of 0 would mean to not skip, but to keep trying the original location!

A hash function for strings


  int hash(const string& s)
  {
    int result = 0;

    for(int i = 0; i < s.size(); i++)
      {
	result = 31 * result + (int) s[i];
      }
    return result;
  }

Separate Chaining

For each hash value, keep a linked list of the pairs whose keys map to that value.

Using the same example as for linear probing:

hash(14913) = 14913 % 7 = 3
hash(15930) = 15930 % 7 = 5
hash(16918) = 16918 % 7 = 6
hash(17020) = 17020 % 7 = 3
hash(14920) = 14920 % 7 = 3
	  -----------
	0 null
	1 null
	2 null
	3 -----> (14913, 20) --> (17020,19) --> (14920,40)
	4 null
	5 -----> (15930, 10)
	6 -----> (16918, 50)

Collisions are all on the same list. So lookup of 14920 only requires
3 probes instead of 5; no values that hash to a different value are
are involved unlike linear probing.

  Key     Number of Probes to Lookup
  14913         1
  15930         1
  16918         1
  17020         2
  14920         3
              ------
                8

Average #probes (comparisons) to lookup a value which is present

               8/5 = 1.6

The general number of probes for a successful search is about

	1 + (λ/2)

where the load factor for separate chaining is λ = N/M

In this case, λ = 5/7 and 1 + (λ/2) = 1.4 (to one
decimal place.

The number of probes for insertion or unsuccessful search is about

        λ

Simple Analysis Question

If separate chaining is used for a hash table that contains N keys and
M chains, and we want the average number of probes to be no more than
2, how many separate chains, M, should we have?

The answer to this question depends on two factors:

        (1) the number of keys in the table
        (2) the hash function; that is, on how uniformly the keys are
            distributed among the chains.

Load Factor

What is the average length of the chains? For N keys and M chains, the
average chain length is

			N/M

This is the load factor of a hash table with separate chaining.

All keys on a given chain have the same hash value.

However, it is not necessarily true that the average number of probes
is equal to the expressions just given.

Example 1

N = 10 Keys, M = 5 chains, Load Factor=2; bad hash function

Here the average chain length is N/M = 2.

But suppose the hash function is "bad". 

For example, suppose all the keys end in 0 and the hash function is
just the key mod 5.  

Then all keys map to chain 0.

        chain          length
          0              10
          1               0
          2               0
          3               0
          4               0

The average chain length is = 2.

For a successful find

        key            probes
        1-st             1
        2nd              2
        3rd              3
        ...              ...
       10th             10
       ------          -----
total                   55

Average #probes for successful find = 55/10 = 5.5

Average #probes for an unsuccessful find is 10 (again assuming all
keys map to chain 0).

Example 2

N=10 keys, M=5 chains; Load Factor = 2; good hash function

Here the average chain length is still N/M = 2.

But we now suppose the hash function is "good". 

For example, suppose the keys are uniformly distributed so that all
chains have length 2.


Then 

        chain          length
          0               2
          1               2
          2               2
          3               2
          4               2

The average chain length is 2, and

For a successful find

        key            probes
        1-st             1
        2nd              2
        3rd              1
        4th              2
        5th              1
        6th              2
        ...              ...
        9th              1
       10th              2
       ------          -----
total                   15

Average #probes for successful find = 15/10 = 1.5

Reducing the Load Factor

Suppose we reduce the load factor to be 1 instead of 2.
What happens to the two previous examples: bad hash and good hash?

Example 1a

N = 10 Keys, M = 10 chains, Load Factor=1; bad hash function

Here the average chain length is N/M = 1.

With the "bad" hash function, increasing the chains means:


        chain          length
          0              10
          1               0
          2               0
          3               0
          4               0
          5               0
         ...             ...
          9               0

The average chain length is now 1 instead of 2, but

For a successful find

        key            probes
        1-st             1
        2nd              2
        3rd              3
        ...              ...
       10th             10
       ------          -----
total                   55

Average #probes for successful find is still = 55/10 = 5.5

Example 2a

N = 10 Keys, M = 10 chains, Load Factor=1; good hash function

Here the average chain length is N/M = 1.

With the "good" hash function, increasing the chains means:


        chain          length
          0               1
          1               1
          2               1
          3               1
          4               1
          5               1
         ...             ...
          9               1

The average chain length is now 1 instead of 2, and

For a successful find

        key            probes
        1-st             1
        2nd              1
        3rd              1
        ...              ...
       10th              1
       ------          -----
total                   10

Average #probes for successful find is slightly better = 10/10 = 1

Generalizing the Examples

Generally, the load factor for separate chaining should be 1; that is,
the number of chains should be about equal to the number of keys.

However, with a "bad" hash function, in the worst case (all keys
colliding on one chain), the total number of probes for successful
finds for N keys will be

   1 + 2 + 3 + ... + N = N(N+1)/2

and so the average number of probes in this worst case for a
successful find would be

	(N+1)/2

However, for a "good" hash function that uniformly distributes the
keys over the chains and a load factor of 1, the average number of
probes for a successful find would be

	1

Using Prime Numbers

Suppose the table size in separate chaining is 8. If all the keys happen to be even, then using the hash function

If key is even,

hash(key) = key % tablesize = key % 8

hash(key) will always be even also. So none of the odd indices will be
used. 

That is, half the table entries will not be used.

In general if all the keys and tablesize have common divisors, then
key % tablesize wil have the same divisors and so the other indices
will not be used.

Moral: Use a prime number for the tablesize.

Deletions

Deletions from a hash table with separate chaining are straightforward.

Open addressing is trickier. Consider deleting key 17020 from this hash table and then lookup 14920:

hash(14913) = 14913 % 7 = 3
hash(15930) = 15930 % 7 = 5
hash(16918) = 16918 % 7 = 6
hash(17020) = 17020 % 7 = 3
hash(14920) = 14920 % 7 = 3
	  -----------
	0 (14920, 40)
	1 null
	2 null
	3 (14913, 20) 
	4 null        (17020, 19) deleted
	5 (15930, 10)
	6 (16918, 50)

Pairs in blue are not in their hashed
positions.

To lookup 14920, we look for 14920 at positions  3, 4, ... until found
or until a null entry is encountered.

It appears that 14920 is not in the table. But it is. What is the problem?

Deletions using Open Addressing

To handle deletions we need to have 3 possibilities for each entry

1. empty (=> available for insertions)
2. occupied
3. deleted (=> available for insertions)

Deleted entries are available for insertions; that is, they are treated just like empty positions for insertions.

Deleted entries are treated similar to occupied entries for searches in that the search must keep going.

What would happen if a hash table was filled completely and then each of the items deleted?

All the items would be marked deleted. So when would a search terminate?

When searching for a key that is not in the hash table:

The search must continue until
a. an empty entry is found or
b. all entries have been examined

Summary

Comparing the data types below with the assumption collisions do not degrade the hash table:

			Time Requirement
                  of Operation if ADT contains
                         N elements

			    HashTables
                        
         
                        HashTable     AvlTree
insert(x)                   1            ?
find(x)                     1            ?
enumerate in order      N * log(N)       ?
findMax()                   N            ?
findMin()                   N            ?