CSC403 Nov14

slide version

single file version

Contents

  1. Sorting by Comparison of Elements
  2. Bubble Sort for an array of size 3
  3. Bubble Sort Decision Tree
  4. Minimum number of Comparisons
  5. Minimum height of a binary tree with L leaves
  6. log(N!)
  7. Sorting without Comparisons
  8. Key Index Counting Example
  9. Compute the Counts
  10. Compute the Index
  11. Insert the Elements
  12. Insert the Elements (next 2 elements)
  13. Result after the First Pass
  14. The count after the Second Pass
  15. LSD - Sorting Strings
  16. Sorting Variable Length Strings
  17. LSD Problem
  18. MSD Algorithm
  19. MSD Problem
  20. Problems
  21. Trie Data Structure
  22. The Node class for TrieST
  23. Trie insertion
  24. Trie put Method
  25. Performance for Trie Implemented Symbol Tables
  26. Symbol Table Implementations
  27. Frequency Counter Execution I
  28. Frequency Counter Execution II
  29. Additional Trie Methods
  30. The keys() method for TrieST
  31. The Trie get Method
  32. Problems

Sorting by Comparison of Elements[1] [top]

Every sorting algorithm that sorts by comparing elements with each other can be represented by a decision tree.

For example, consider the bubble sort method:

public class Bubble
{
  public static <E extends Comparable<E>> void sort(E[] a)
  {
    for(int len = a.length; len > 1; len--) {
      for(int i = 0; i < len - 1; i++) {
	if (a[i].compareTo(a[i + 1]) > 0) {
	  E tmp = a[i];
	  a[i] = a[i + 1];
	  a[i + 1] = tmp;
        }
      }
    }
}

Bubble Sort for an array of size 3[2] [top]

For simplicity, assume there are no duplicates and consider an array of size 3:

      A[0] = v1
      A[1] = v2
      A[2] = v3
    

where the three values v1, v2, v3 are the values 1, 2, 3 in some order

Bubble sort must sort them so that the result is

      A[0] = 1
      A[1] = 2
      A[2] = 3
    

for each possible ordering v1, v2, v3 of 1, 2, 3.

Which of the values {v1, v2, v3} are compared by the first comparison of Bubble.sort?

Bubble Sort Decision Tree[3] [top]

For A[0] = v1, A[1] = v1, A[2] = v1, bubble sort will make the comparisons indicated by :.

Bubble sort swaps values if they are out of order. This determines the next comparison.

Assumming no duplicates, there are only 2 possible results of each comparison. So we get a binary tree.

The red edges are never true, so some comparisons are not necessary.

However, we must have 6 leaf nodes corresponding to the 6 possible orderings of {v1, v2, v3}

Minimum number of Comparisons[4] [top]

What is the minimum number of comparisons needed to sort N items?

Bubble sort sometimes makes unnecessary comparisons.

So maybe the height is smaller for some other algorithms.

We need to know the minimum height of a decision tree knowing only that it must have N! leaves, but not assumming any particular sorting algorithm.

Minimum height of a binary tree with L leaves[5] [top]

  Height       maximum number of leaves            Tree
    0                   1                           x

    1                   2                           n
                                                   / \
                                                  x   x

    2                   4                            n
                                                   /   \
                                                  n     n
                                                 / \   / \
                                                x   x x   x

    3                   8                            ...

    h                   2h
   
    

So if a binary tree has L leaves, and is of height h, then L must be less than or equal to the maximum number of leaves possible for a tree of that height:

      
 L <= 2h

or
 
 log(L) <= h
    

If a binary tree has L leaves, the height must be at least log(L).

log(N!)[6] [top]

 log(N!) = log(1) + log(2) + ... + log(N)
	

If we keep only the last N/2 terms

 log(N!) > log(N/2) + log(N/2 + 1) + ... + log(N)
    

Now replace all the terms by log(N/2) and the sum is even smaller:

 log(N!) > log(N/2) + log(N/2) + ... + log(N/2) = 1/2 * Nlog(N/2)
    

Since log(N/2) = log(N) - 1,

      log(N!) > 1/2 * Nlog(N) - 
    

Sorting without Comparisons[7] [top]

The best sorting algorithm is said to be Nlog(N), but this is counting compares between elements.

Some sorting methods don't use any element comparisons!

One such method uses "key index counting".

Key Index Counting Example[8] [top]

Consider sorting the 3 digit numbers:

      
 412
 012
 101
 432
 302
 201
 410
    

The idea to make 3 passes (in this case, since these are 3 digit numbers) starting with the least significant digit (LSD) on the first pass.

After the k-th pass, the elements will be in sorted order if only the last k digits are considered.

Compute the Counts[9] [top]

On the k-th pass Set count[d + 1] = the number of occurrences of digit d in position k (starting at the least significant digit position - 0)

So count[1] = number of occurrences of digit 0.

This means count[0] will always be 0.

      a               count      
 0   412           0    0
 1   012           1    1
 2   101           2    2
 3   432           3    4
 4   302           4    0
 5   201           5    0
 6   410           6    0
		   .    .
		   .    .
		   9    0
    

Compute the Index[10] [top]

Now update count array so that count[d] is the index in the array where the first number with digit d should go in sorted order for the position for this pass.

Just update count like this starting with d = 0:

 count[d] = count[d] + count[d - 1]
    
                       old    updated
      a               count   count
 0   412           0    0       0
 1   012           1    1       1
 2   101           2    2       3
 3   432           3    4       7
 4   302           4    0       7
 5   201           5    0       7
 6   410           6    0       7
		   .    .       .
		   .    .       .
		   9    0       7
    

Note that index 7 is out of bounds. But for this pass the position is the least significant digit (position 0) and only the digits 0, 1, and 2 occur.

Insert the Elements[11] [top]

Copy the values from the array a to an auxilliary array, aux, at the index indicated by the count array and update that index for the digit position for this pass.

                       old    updated
      a               count   count           aux
 0   412           0    0       0	  0   
 1   012           1    1       1	  1   
 2   101           2    2       3	  2   
 3   432           3    4       7	  3   412
 4   302           4    0       7	  4   
 5   201           5    0       7	  5   
 6   410           6    0       7	  6   
		   .    .       .
		   .    .       .
		   9    0       7
    

412 is copied to the position indicated by count[2], which is position 3.

count[2] should now be incremented to 4, to be ready for the next element whose digit is 2.

Insert the Elements (next 2 elements)[12] [top]

Copy the values from the array a to an auxilliary array, aux, at the index indicated by the count array and update that index for the digit position for this pass.

                       old    updated
      a               count   count           aux
 0   412           0    0       0	  0   
 1   012           1    1       12	  1   101
 2   101           2    2       345	  2   
 3   432           3    4       7	  3   412
 4   302           4    0       7	  4   012
 5   201           5    0       7	  5   
 6   410           6    0       7	  6   
		   .    .       .
		   .    .       .
		   9    0       7
    

Result after the First Pass[13] [top]

              old    updated
      a               count   count           aux
 0   412           0    0       01	  0   410
 1   012           1    1       123	  1   101
 2   101           2    2       34567	  2   201
 3   432           3    4       7	  3   412
 4   302           4    0       7	  4   012
 5   201           5    0       7	  5   432
 6   410           6    0       7	  6   302
		   .    .       .
		   .    .       .
		   9    0       7
    

For the last step, copy aux back into a.

The count after the Second Pass[14] [top]

              
      a               count   
 0   410           0    0     
 1   101           1    3      
 2   201           2    3
 3   412           3    0     
 4   012           4    1
 5   432           5    0     
 6   302           6    0     
		   .    .     
		   .    .     
		   9    0     
    

LSD - Sorting Strings[15] [top]

Instead of 10 digits, LSD uses 256 possible characters for each position of Strings.

It is appropriate for sorting strings that are all the same length; that is, for strings of fixed length

A bigger the character set, will require a bigger count array.

LSD uses on the order of

      7WN + 3WR
    

array accesses where

 W is the (fixed) length of the strings
 N is the number of strings to sort
 R is the number of characters in the character set.
    

For N social security numbers without dashes

 W = 9
 R = ?

So

 7WN + 3WR = 63N + 270    (for R = 10; just the digits 0 - 9)

 7WN + 3WR = 63N + 6912   (for R = 256; all 256 ascii characters)
    

Is LSD faster than Java's Arrays.sort for Strings?

Sorting Variable Length Strings[16] [top]

MSD (most significant 'digit') sorts variable length strings by considering the most significant character (left most) first.

Read section 4.1 for LSD details and for MSD to be discussed next time.

But is MSD faster than Arrays.sort for Strings?

LSD Problem[17] [top]

For the following fixed length keys with alphabet

      {a, b, c, d, e, f, g}
    

use the LSD algorithm to sort on the least significant digit.

(Should repeat for the next position)

      
      ca
      bd
      ef
      ga
      ag
      ba
      
    

MSD Algorithm[18] [top]

Sorts on the most significant character first!

Result is groups with the same first character, but within each group, not sorted.

So in EACH group do it again.

How many groups? Potentially as many groups as possible characters; e.g 256 for ASCII.

MSD Problem[19] [top]

For the following fixed length keys with alphabet

      {a, b, c, d, e, f, g}
    

use the LSD algorithm to sort on the least significant digit.

(Should repeat for the next position)

      
      ace
      ab
      b
      cg
      cd
      c
    

Problems[20] [top]

Not turned in!

Trie Data Structure[21] [top]

The Node class for TrieST[22] [top]

The private Node class


private static class Node<V> 
{
  private V val;
  private final Node<V>[] next = new Node[R]; // e.g. R = 256
}

Trie insertion[23] [top]

Insert these strings into a trie

      she sells seashells by the shore
    

How do you search for a string?

How long is a path for a hit? (i.e., key is found)

How long is a path for a miss?

Trie put Method[24] [top]

public void put(String key, V val) {
  root = put(root, key, val, 0);
}

private Node<V> put(Node<V> x, String key, V val, int d) {
  if (x == null) x = new Node<V>();
  if (d == key.length()) {
    x.val = val;
    return x;
  }
  char c = key.charAt(d);
  x.next[c] = put(x.next[c], key, val, d+1);
  return x;
}

Performance for Trie Implemented Symbol Tables[25] [top]

Try the Frequency Counter Program with different symbol table implementations.

Input Files
Alphabets

Symbol Table Implementations[26] [top]

Frequency Counter Execution I[27] [top]

Input File: leipzig1M.txt - has long words

Minimum word length: 8 (more than 3,000,000 of these)

Words are sequence of letters (converted to lower case).

So input contains long words.

Symbol Table Time to 'get' every distinct word.
BST
RedBlackBST
TrieST (Alphabet size: 256)
LinearProbingHashST
LowerCaseTrieST (Alphabet size: 26)

Frequency Counter Execution II[28] [top]

Input File: largeT.txt - has words of length 6

Minimum word length: 0 (1,000,000 such words)

Words are sequence of digits.

So input contains long words.

Symbol Table Time to 'get' every distinct word.
BST
RedBlackBST
TrieST (Alphabet size: 256)
LinearProbingHashST
DigitsTrieST (Alphabet size: 10)

Additional Trie Methods[29] [top]

The keys() method for TrieST[30] [top]

A String key in a TrieST is not stored in in a single node.

The node structure does not contain a String (or a char) member!

So iterating through the Strings (in alphabetical order) is a bit different.


public Iterable<String> keys()
{
  Queue<String> q = new Queue<String>();
  collect(root, "", q);
  return q;
}

private void collect(Node<V> x, String key, Queue<String> queue) 
{
  if (x == null) return;

  if (x.val != null) {
    queue.enqueue(key);
  }

  for (int c = 0; c < TrieST.R; c++) {
    collect(x.next[c], key + (char) c, queue);
  }
}

Egad! How long does this take to execute?

The Trie get Method[31] [top]

public V get(final String key) {
  final Node<V> x = get(root, key, 0);
  if (x == null) return null;
  return x.val;
}

private Node<V> get(final Node<V> x, final String key, final int d) {
  if (x == null) return null;
  if (d == key.length()) return x;
  final char c = key.charAt(d);
  return get(x.next[c], key, d+1);
}

Problems[32] [top]