Every sorting algorithm that sorts by comparing elements with
each other can be represented by a decision tree.
For example, consider the bubble sort method:
public class Bubble
{
public static <E extends Comparable<E>> void sort(E[] a)
{
for(int len = a.length; len > 1; len--) {
for(int i = 0; i < len - 1; i++) {
if (a[i].compareTo(a[i + 1]) > 0) {
E tmp = a[i];
a[i] = a[i + 1];
a[i + 1] = tmp;
}
}
}
}
For simplicity, assume there are no duplicates and consider an
array of size 3:
A[0] = v1
A[1] = v2
A[2] = v3
where the three values v1, v2,
v3 are the values 1, 2, 3 in some order
Bubble sort must sort them so that the result is
A[0] = 1
A[1] = 2
A[2] = 3
for each possible ordering v1, v2,
v3 of 1, 2, 3.
Which of the values {v1, v2,
v3} are compared by the first comparison of Bubble.sort?
For A[0] = v1, A[1] = v1, A[2] =
v1, bubble sort will make the comparisons indicated
by :.
Bubble sort swaps values if
they are out of order. This determines the next comparison.
Assumming no duplicates, there are only 2 possible results of
each comparison. So we get a binary tree.
The red edges are never true, so
some comparisons are not necessary.
However, we must have 6 leaf nodes corresponding to the 6
possible orderings of {v1, v2, v3}

What is the minimum number of comparisons needed to sort N
items?
Any sorting algorithm that uses comparisons corresponds to a
decision tree.
The decision tree must have N! leaves corresponding to all
the possible orderings of N items.
-
The height of the tree gives the most number of comparisons the
algorithm will have to make.
Bubble sort sometimes makes unnecessary comparisons.
So maybe the height is smaller for some other algorithms.
We need to know the minimum height of a decision tree knowing
only that it must have N! leaves, but not assumming any particular
sorting algorithm.
Height maximum number of leaves Tree
0 1 x
1 2 n
/ \
x x
2 4 n
/ \
n n
/ \ / \
x x x x
3 8 ...
h 2h
So if a binary tree has L leaves, and is of height h, then L
must be less than or equal to the maximum number of leaves
possible for a tree of that height:
L <= 2h
or
log(L) <= h
If a binary tree has L leaves, the height
must be at least log(L).
log(N!) = log(1) + log(2) + ... + log(N)
If we keep only the last N/2 terms
log(N!) > log(N/2) + log(N/2 + 1) + ... + log(N)
Now replace all the terms by log(N/2) and the sum is even
smaller:
log(N!) > log(N/2) + log(N/2) + ... + log(N/2) = 1/2 * Nlog(N/2)
Since log(N/2) = log(N) - 1,
log(N!) > 1/2 * Nlog(N) -
The best sorting algorithm is said to be Nlog(N), but this is
counting compares between elements.
Some sorting methods don't use any element comparisons!
One such method uses "key index counting".
Consider sorting the 3 digit numbers:
412
012
101
432
302
201
410
The idea to make 3 passes (in this case, since these are 3
digit numbers) starting with the least significant digit (LSD) on
the first pass.
After the k-th pass, the elements will be in sorted order if
only the last k digits are considered.
On the k-th pass Set count[d + 1] = the number of occurrences of digit d in
position k (starting at the least significant digit position - 0)
So count[1] = number of occurrences of digit 0.
This means count[0] will always be 0.
a count
0 412 0 0
1 012 1 1
2 101 2 2
3 432 3 4
4 302 4 0
5 201 5 0
6 410 6 0
. .
. .
9 0
Now update count array so that count[d] is the index in the
array where the first number with digit d should go in sorted
order for the position for this pass.
Just update count like this starting with d = 0:
count[d] = count[d] + count[d - 1]
old updated
a count count
0 412 0 0 0
1 012 1 1 1
2 101 2 2 3
3 432 3 4 7
4 302 4 0 7
5 201 5 0 7
6 410 6 0 7
. . .
. . .
9 0 7
Note that index 7 is out of bounds. But for this pass the
position is the least significant digit (position 0) and only the
digits 0, 1, and 2 occur.
Copy the values from the array a to an auxilliary array, aux, at the index
indicated by the count array and update that index for the digit
position for this pass.
old updated
a count count aux
0 412 0 0 0 0
1 012 1 1 1 1
2 101 2 2 3 2
3 432 3 4 7 3 412
4 302 4 0 7 4
5 201 5 0 7 5
6 410 6 0 7 6
. . .
. . .
9 0 7
412 is copied to the position indicated by count[2],
which is position 3.
count[2] should now be incremented to 4, to be ready for the next
element whose digit is 2.
Copy the values from the array a to an auxilliary array, aux, at the index
indicated by the count array and update that index for the digit
position for this pass.
old updated
a count count aux
0 412 0 0 0 0
1 012 1 1 12 1 101
2 101 2 2 345 2
3 432 3 4 7 3 412
4 302 4 0 7 4 012
5 201 5 0 7 5
6 410 6 0 7 6
. . .
. . .
9 0 7
old updated
a count count aux
0 412 0 0 01 0 410
1 012 1 1 123 1 101
2 101 2 2 34567 2 201
3 432 3 4 7 3 412
4 302 4 0 7 4 012
5 201 5 0 7 5 432
6 410 6 0 7 6 302
. . .
. . .
9 0 7
For the last step, copy aux back into a.
a count
0 410 0 0
1 101 1 3
2 201 2 3
3 412 3 0
4 012 4 1
5 432 5 0
6 302 6 0
. .
. .
9 0
Instead of 10 digits, LSD uses 256 possible characters for each position
of Strings.
It is appropriate for sorting strings that are all the same
length; that is, for strings of fixed length
A bigger the character set, will require a bigger count
array.
LSD uses on the order of
7WN + 3WR
array accesses where
W is the (fixed) length of the strings
N is the number of strings to sort
R is the number of characters in the character set.
For N social security numbers without dashes
W = 9
R = ?
So
7WN + 3WR = 63N + 270 (for R = 10; just the digits 0 - 9)
7WN + 3WR = 63N + 6912 (for R = 256; all 256 ascii characters)
Is LSD faster than Java's Arrays.sort for Strings?
MSD (most significant 'digit') sorts variable length strings by
considering the most significant character (left most) first.
Read section 4.1 for LSD details and for MSD to be discussed
next time.
But is MSD faster than Arrays.sort for Strings?
For the following fixed length keys with alphabet
{a, b, c, d, e, f, g}
use the LSD algorithm to sort on the least significant
digit.
(Should repeat for the next position)
ca
bd
ef
ga
ag
ba
Sorts on the most significant character first!
Result is groups with the same first character, but within each
group, not sorted.
So in EACH group do it again.
How many groups? Potentially as many groups as possible
characters; e.g 256 for ASCII.
For the following fixed length keys with alphabet
{a, b, c, d, e, f, g}
use the LSD algorithm to sort on the least significant
digit.
(Should repeat for the next position)
ace
ab
b
cg
cd
c
Not turned in!
- 5.1.2 (LSD - fixed length)
- 5.1.3 (MSD - all same length)
- 5.1.5 (MSD - variable length)
The private Node class
private static class Node<V>
{
private V val;
private final Node<V>[] next = new Node[R]; // e.g. R = 256
}
Insert these strings into a trie
she sells seashells by the shore
How do you search for a string?
How long is a path for a hit? (i.e., key is found)
How long is a path for a miss?
public void put(String key, V val) {
root = put(root, key, val, 0);
}
private Node<V> put(Node<V> x, String key, V val, int d) {
if (x == null) x = new Node<V>();
if (d == key.length()) {
x.val = val;
return x;
}
char c = key.charAt(d);
x.next[c] = put(x.next[c], key, val, d+1);
return x;
}
- Effect of the alphabet size
- Effect of string length of keys
Try the Frequency Counter Program with different symbol table
implementations.
Input Files
- leipzig1M.txt - has long words
- largeT.txt - has only short words (6 characters)
Alphabets
- Extended Ascii - 256 characters
- Lower case letters - 26
- Digits - 10
- BST
- RedBlackBST
- TrieST (with Extended Ascii)
- LinearProbingHashST
- LowerCaseTrieST (lower case letters)
- DigitTrieST (digits)
Input File: leipzig1M.txt - has long words
Minimum word length: 8 (more than 3,000,000 of these)
Words are sequence of letters (converted to lower case).
So input contains long words.
Symbol Table |
Time to 'get' every distinct word. |
BST |
|
RedBlackBST |
|
TrieST (Alphabet size: 256) |
|
LinearProbingHashST |
|
LowerCaseTrieST (Alphabet size: 26) |
|
Input File: largeT.txt - has words of length 6
Minimum word length: 0 (1,000,000 such words)
Words are sequence of digits.
So input contains long words.
Symbol Table |
Time to 'get' every distinct word. |
BST |
|
RedBlackBST |
|
TrieST (Alphabet size: 256) |
|
LinearProbingHashST |
|
DigitsTrieST (Alphabet size: 10) |
|
- Iterable<String> keysWithPrefix(String pre)
- Iterable<String> keysThatMatch(String pattern)
A String key in a TrieST is not stored in in a single node.
The node structure does not contain a String (or a char)
member!
So iterating through the Strings (in alphabetical order) is a
bit different.
public Iterable<String> keys()
{
Queue<String> q = new Queue<String>();
collect(root, "", q);
return q;
}
private void collect(Node<V> x, String key, Queue<String> queue)
{
if (x == null) return;
if (x.val != null) {
queue.enqueue(key);
}
for (int c = 0; c < TrieST.R; c++) {
collect(x.next[c], key + (char) c, queue);
}
}
Egad! How long does this take to execute?
public V get(final String key) {
final Node<V> x = get(root, key, 0);
if (x == null) return null;
return x.val;
}
private Node<V> get(final Node<V> x, final String key, final int d) {
if (x == null) return null;
if (d == key.length()) return x;
final char c = key.charAt(d);
return get(x.next[c], key, d+1);
}
- 5.2.3
- Write a a non-recursive get method for TrieST