Every sorting algorithm that sorts by comparing elements with each other can be represented by a decision tree.

For example, consider the bubble sort method:

public class Bubble
{
  public static <E extends Comparable<E>> void sort(E[] a)
  {
    for(int len = a.length; len > 1; len--) {
      for(int i = 0; i < len - 1; i++) {
	if (a[i].compareTo(a[i + 1]) > 0) {
	  E tmp = a[i];
	  a[i] = a[i + 1];
	  a[i + 1] = tmp;
        }
      }
    }
}

Bubble Sort for an array of size 3 [2] [top]

For simplicity, assume there are no duplicates and consider an array of size 3:

      A[0] = v₁
      A[1] = v₂
      A[2] = v₃

where the three values v₁, v₂, v₃ are the values 1, 2, 3 in some order

Bubble sort must sort them so that the result is

      A[0] = 1
      A[1] = 2
      A[2] = 3

for each possible ordering v₁, v₂, v₃ of 1, 2, 3.

Which of the values {v₁, v₂, v₃} are compared by the first comparison of Bubble.sort?

Bubble Sort Decision Tree [3] [top]

For A[0] = v₁, A[1] = v₁, A[2] = v₁, bubble sort will make the comparisons indicated by :.

Bubble sort swaps values if they are out of order. This determines the next comparison.

Assumming no duplicates, there are only 2 possible results of each comparison. So we get a binary tree.

The red edges are never true, so some comparisons are not necessary.

However, we must have 6 leaf nodes corresponding to the 6 possible orderings of {v₁, v₂, v₃}

Minimum number of Comparisons [4] [top]

What is the minimum number of comparisons needed to sort N items?

Any sorting algorithm that uses comparisons corresponds to a decision tree.
The decision tree must have N! leaves corresponding to all the possible orderings of N items.
The height of the tree gives the most number of comparisons the algorithm will have to make.

Bubble sort sometimes makes unnecessary comparisons.

So maybe the height is smaller for some other algorithms.

We need to know the minimum height of a decision tree knowing only that it must have N! leaves, but not assumming any particular sorting algorithm.

Minimum height of a binary tree with L leaves [5] [top]

  Height       maximum number of leaves            Tree
    0                   1                           x

    1                   2                           n
                                                   / \
                                                  x   x

    2                   4                            n
                                                   /   \
                                                  n     n
                                                 / \   / \
                                                x   x x   x

    3                   8                            ...

    h                   2^h

So if a binary tree has L leaves, and is of height h, then L must be less than or equal to the maximum number of leaves possible for a tree of that height:

      
 L <= 2^h

or
 
 log(L) <= h

If a binary tree has L leaves, the height must be at least log(L).

log(N!)[6] [top]

 log(N!) = log(1) + log(2) + ... + log(N)

If we keep only the last N/2 terms

 log(N!) > log(N/2) + log(N/2 + 1) + ... + log(N)

Now replace all the terms by log(N/2) and the sum is even smaller:

 log(N!) > log(N/2) + log(N/2) + ... + log(N/2) = 1/2 * Nlog(N/2)

Since log(N/2) = log(N) - 1,

      log(N!) > 1/2 * Nlog(N) -

Sorting without Comparisons [7] [top]

The best sorting algorithm is said to be Nlog(N), but this is counting compares between elements.

Some sorting methods don't use any element comparisons!

One such method uses "key index counting".

Key Index Counting Example [8] [top]

Consider sorting the 3 digit numbers:

The idea to make 3 passes (in this case, since these are 3 digit numbers) starting with the least significant digit (LSD) on the first pass.

After the k-th pass, the elements will be in sorted order if only the last k digits are considered.

Compute the Counts [9] [top]

On the k-th pass Set count[d + 1] = the number of occurrences of digit d in position k (starting at the least significant digit position - 0)

So count[1] = number of occurrences of digit 0.

This means count[0] will always be 0.

      a               count      
 0   412           0    0
 1   012           1    1
 2   101           2    2
 3   432           3    4
 4   302           4    0
 5   201           5    0
 6   410           6    0
		   .    .
		   .    .
		   9    0

Compute the Index [10] [top]

Now update count array so that count[d] is the index in the array where the first number with digit d should go in sorted order for the position for this pass.

Just update count like this starting with d = 0:

 count[d] = count[d] + count[d - 1]

                       old    updated
      a               count   count
 0   412           0    0       0
 1   012           1    1       1
 2   101           2    2       3
 3   432           3    4       7
 4   302           4    0       7
 5   201           5    0       7
 6   410           6    0       7
		   .    .       .
		   .    .       .
		   9    0       7

Note that index 7 is out of bounds. But for this pass the position is the least significant digit (position 0) and only the digits 0, 1, and 2 occur.

Insert the Elements [11] [top]

Copy the values from the array a to an auxilliary array, aux, at the index indicated by the count array and update that index for the digit position for this pass.

                       old    updated
      a               count   count           aux
 0   412           0    0       0	  0   
 1   012           1    1       1	  1   
 2   101           2    2       3	  2   
 3   432           3    4       7	  3   412
 4   302           4    0       7	  4   
 5   201           5    0       7	  5   
 6   410           6    0       7	  6   
		   .    .       .
		   .    .       .
		   9    0       7

412 is copied to the position indicated by count[2], which is position 3.

count[2] should now be incremented to 4, to be ready for the next element whose digit is 2.

Insert the Elements (next 2 elements)[12] [top]

Copy the values from the array a to an auxilliary array, aux, at the index indicated by the count array and update that index for the digit position for this pass.

                       old    updated
      a               count   count           aux
 0   412           0    0       0	  0   
 1   012           1    1       12	  1   101
 2   101           2    2       345	  2   
 3   432           3    4       7	  3   412
 4   302           4    0       7	  4   012
 5   201           5    0       7	  5   
 6   410           6    0       7	  6   
		   .    .       .
		   .    .       .
		   9    0       7

Result after the First Pass [13] [top]

              old    updated
      a               count   count           aux
 0   412           0    0       01	  0   410
 1   012           1    1       123	  1   101
 2   101           2    2       34567	  2   201
 3   432           3    4       7	  3   412
 4   302           4    0       7	  4   012
 5   201           5    0       7	  5   432
 6   410           6    0       7	  6   302
		   .    .       .
		   .    .       .
		   9    0       7

For the last step, copy aux back into a.

The count after the Second Pass [14] [top]

              
      a               count   
 0   410           0    0     
 1   101           1    3      
 2   201           2    3
 3   412           3    0     
 4   012           4    1
 5   432           5    0     
 6   302           6    0     
		   .    .     
		   .    .     
		   9    0

LSD - Sorting Strings [15] [top]

Instead of 10 digits, LSD uses 256 possible characters for each position of Strings.

It is appropriate for sorting strings that are all the same length; that is, for strings of fixed length

A bigger the character set, will require a bigger count array.

LSD uses on the order of

      7WN + 3WR

array accesses where

 W is the (fixed) length of the strings
 N is the number of strings to sort
 R is the number of characters in the character set.

For N social security numbers without dashes

 W = 9
 R = ?

So

 7WN + 3WR = 63N + 270    (for R = 10; just the digits 0 - 9)

 7WN + 3WR = 63N + 6912   (for R = 256; all 256 ascii characters)

Is LSD faster than Java's Arrays.sort for Strings?

Sorting Variable Length Strings [16] [top]

MSD (most significant 'digit') sorts variable length strings by considering the most significant character (left most) first.

Read section 4.1 for LSD details and for MSD to be discussed next time.

But is MSD faster than Arrays.sort for Strings?

LSD Problem [17] [top]

For the following fixed length keys with alphabet

      {a, b, c, d, e, f, g}

use the LSD algorithm to sort on the least significant digit.

(Should repeat for the next position)

MSD Algorithm [18] [top]

Sorts on the most significant character first!

Result is groups with the same first character, but within each group, not sorted.

So in EACH group do it again.

How many groups? Potentially as many groups as possible characters; e.g 256 for ASCII.

MSD Problem [19] [top]

For the following fixed length keys with alphabet

      {a, b, c, d, e, f, g}

use the LSD algorithm to sort on the least significant digit.

(Should repeat for the next position)

Problems [20] [top]

Not turned in!

5.1.2 (LSD - fixed length)
5.1.3 (MSD - all same length)
5.1.5 (MSD - variable length)

Trie Data Structure [21] [top]

The Node class for TrieST [22] [top]

The private Node class


private static class Node<V> 
{
  private V val;
  private final Node<V>[] next = new Node[R]; // e.g. R = 256
}

Trie insertion [23] [top]

Insert these strings into a trie

      she sells seashells by the shore

How do you search for a string?

How long is a path for a hit? (i.e., key is found)

How long is a path for a miss?

Trie put Method [24] [top]

public void put(String key, V val) {
  root = put(root, key, val, 0);
}

private Node<V> put(Node<V> x, String key, V val, int d) {
  if (x == null) x = new Node<V>();
  if (d == key.length()) {
    x.val = val;
    return x;
  }
  char c = key.charAt(d);
  x.next[c] = put(x.next[c], key, val, d+1);
  return x;
}

Performance for Trie Implemented Symbol Tables [25] [top]

Effect of the alphabet size
Effect of string length of keys

Try the Frequency Counter Program with different symbol table implementations.

Input Files

leipzig1M.txt - has long words
largeT.txt - has only short words (6 characters)

Alphabets

Extended Ascii - 256 characters
Lower case letters - 26
Digits - 10

Symbol Table Implementations [26] [top]

BST
RedBlackBST
TrieST (with Extended Ascii)
LinearProbingHashST
LowerCaseTrieST (lower case letters)
DigitTrieST (digits)

Frequency Counter Execution I [27] [top]

Input File: leipzig1M.txt - has long words

Minimum word length: 8 (more than 3,000,000 of these)

Words are sequence of letters (converted to lower case).

So input contains long words.

Symbol Table	Time to 'get' every distinct word.
BST
RedBlackBST
TrieST (Alphabet size: 256)
LinearProbingHashST
LowerCaseTrieST (Alphabet size: 26)

Frequency Counter Execution II [28] [top]

Input File: largeT.txt - has words of length 6

Minimum word length: 0 (1,000,000 such words)

Words are sequence of digits.

So input contains long words.

Symbol Table	Time to 'get' every distinct word.
BST
RedBlackBST
TrieST (Alphabet size: 256)
LinearProbingHashST
DigitsTrieST (Alphabet size: 10)

Additional Trie Methods [29] [top]

Iterable<String> keysWithPrefix(String pre)
Iterable<String> keysThatMatch(String pattern)

The keys() method for TrieST [30] [top]

A String key in a TrieST is not stored in in a single node.

The node structure does not contain a String (or a char) member!

So iterating through the Strings (in alphabetical order) is a bit different.


public Iterable<String> keys()
{
  Queue<String> q = new Queue<String>();
  collect(root, "", q);
  return q;
}

private void collect(Node<V> x, String key, Queue<String> queue) 
{
  if (x == null) return;

  if (x.val != null) {
    queue.enqueue(key);
  }

  for (int c = 0; c < TrieST.R; c++) {
    collect(x.next[c], key + (char) c, queue);
  }
}

Egad! How long does this take to execute?

The Trie get Method [31] [top]

public V get(final String key) {
  final Node<V> x = get(root, key, 0);
  if (x == null) return null;
  return x.val;
}

private Node<V> get(final Node<V> x, final String key, final int d) {
  if (x == null) return null;
  if (d == key.length()) return x;
  final char c = key.charAt(d);
  return get(x.next[c], key, d+1);
}

CSC403 Nov14

slide version

single file version

Contents

Sorting by Comparison of Elements [1] [top]

Bubble Sort for an array of size 3 [2] [top]

Bubble Sort Decision Tree [3] [top]

Minimum number of Comparisons [4] [top]

Minimum height of a binary tree with L leaves [5] [top]

log(N!)[6] [top]

Sorting without Comparisons [7] [top]

Key Index Counting Example [8] [top]

Compute the Counts [9] [top]

Compute the Index [10] [top]

Insert the Elements [11] [top]

Insert the Elements (next 2 elements)[12] [top]

Result after the First Pass [13] [top]

The count after the Second Pass [14] [top]

LSD - Sorting Strings [15] [top]

Sorting Variable Length Strings [16] [top]

LSD Problem [17] [top]

MSD Algorithm [18] [top]

MSD Problem [19] [top]

Problems [20] [top]

Trie Data Structure [21] [top]

The Node class for TrieST [22] [top]

Trie insertion [23] [top]

Trie put Method [24] [top]

Performance for Trie Implemented Symbol Tables [25] [top]

Input Files

Alphabets

Symbol Table Implementations [26] [top]

Frequency Counter Execution I [27] [top]

Frequency Counter Execution II [28] [top]

Additional Trie Methods [29] [top]

The keys() method for TrieST [30] [top]

The Trie get Method [31] [top]

Problems [32] [top]

CSC403 Nov14

slide version

single file version

Contents

Sorting by Comparison of Elements[1] [top]

Bubble Sort for an array of size 3[2] [top]

Bubble Sort Decision Tree[3] [top]

Minimum number of Comparisons[4] [top]

Minimum height of a binary tree with L leaves[5] [top]

log(N!)[6] [top]

Sorting without Comparisons[7] [top]

Key Index Counting Example[8] [top]

Compute the Counts[9] [top]

Compute the Index[10] [top]

Insert the Elements[11] [top]

Insert the Elements (next 2 elements)[12] [top]

Result after the First Pass[13] [top]

The count after the Second Pass[14] [top]

LSD - Sorting Strings[15] [top]

Sorting Variable Length Strings[16] [top]

LSD Problem[17] [top]

MSD Algorithm[18] [top]

MSD Problem[19] [top]

Problems[20] [top]

Trie Data Structure[21] [top]

The Node class for TrieST[22] [top]

Trie insertion[23] [top]

Trie put Method[24] [top]

Performance for Trie Implemented Symbol Tables[25] [top]

Input Files

Alphabets

Symbol Table Implementations[26] [top]

Frequency Counter Execution I[27] [top]

Frequency Counter Execution II[28] [top]

Additional Trie Methods[29] [top]

The keys() method for TrieST[30] [top]

The Trie get Method[31] [top]

Problems[32] [top]

Sorting by Comparison of Elements [1] [top]

Bubble Sort for an array of size 3 [2] [top]

Bubble Sort Decision Tree [3] [top]

Minimum number of Comparisons [4] [top]

Minimum height of a binary tree with L leaves [5] [top]

Sorting without Comparisons [7] [top]

Key Index Counting Example [8] [top]

Compute the Counts [9] [top]

Compute the Index [10] [top]

Insert the Elements [11] [top]

Result after the First Pass [13] [top]

The count after the Second Pass [14] [top]

LSD - Sorting Strings [15] [top]

Sorting Variable Length Strings [16] [top]

LSD Problem [17] [top]

MSD Algorithm [18] [top]

MSD Problem [19] [top]

Problems [20] [top]

Trie Data Structure [21] [top]

The Node class for TrieST [22] [top]

Trie insertion [23] [top]

Trie put Method [24] [top]

Performance for Trie Implemented Symbol Tables [25] [top]

Symbol Table Implementations [26] [top]

Frequency Counter Execution I [27] [top]

Frequency Counter Execution II [28] [top]

Additional Trie Methods [29] [top]

The keys() method for TrieST [30] [top]

The Trie get Method [31] [top]

Problems [32] [top]