Topics

 1. Graphs
    1.2.  Graph Definitions
    1.3.  Uses of Graphs
    1.4.  Representing Graphs as Data Structures
    1.5.  Complexity for graphs
    1.6.  Shortest Path - Unweighted Graph: BFS
    1.7.  Checking for a Connected Graph - DFS
    1.8.  Alternate Representations of Graphs
    1.9.  A Graph interface
    1.10. Some General Graph Properties
    1.11. Algorithms
    1.12. Number of edges in a connected graph
    1.13. Minimum Weight Path Algorithm (Correctness)

Summary Comparision of Balanced BSTs versus Hash Tables (Using separate chaining)

Both balanced binary search trees and hash tables provide efficient insertion, lookup and deletion. One would expect similarities in functionality, but some differences in performance.

There are some other differences as well. In what follows Tree will mean balanced binary search tree.

Tree implementation and efficiency

Trees

C++ set, map, multiset, and multimap
classes are all implemented as balanced binary search trees. 

Performance: This guarantees O(log(n)) performance for many
operations. 

Restriction: But note that this means there must be some
ordering on the keys for the purpose of insertions, finds,
etc. These classes provide two ways for the user to specify what
ordering it should use. Using the set class as an
example, 

   a. If a set is constructed with the
   no-argument constructor, set(), there is a
   default comparison used, namely <.
   This set  object will assume that the keys implement
   the < operation and will use it for determining the ordering 
   on the keys.

   b. If a set is constructed with a function
   object, then the set object will use that function
   object to determine the ordering on the keys.

Hash Table implementation and efficiency

Hash Table

A common implementation of a hash table uses an array of lists of key
value pairs.

    This implementation of a hash table is called "separate chaining". 

    That term comes from the fact that all the (key,value) pairs in
    any one list (also called a "chain" or a "bucket") have the same
    hash value for their keys. And also this means that (key,value)
    pairs on separate chains have different hash values. 

Performance: The insertion, find, etc. operations are O(1).

Restrictions: Note it is not assumed that there is any ordering
on the keys. However, the O(1) performance depends on
   
   a. the load factor, and also
   b. a good hash function which spreads the keys relatively
   uniformly over the chains. 

Notes:
1. The key to performance for the Hash Table is keeping the length of the
chains short and independent of N (i.e., chain length = O(1)).

2. The catch is that in order to keep the length of the chains short,
you must have lots of chains.

3. This is similar to how the height of the balanced tree determines the
efficiency of an avl tree. For a balanced binary search tree, the height
is O(log(n)). 

4. For a Hash Table, if the load factor (definition below)
determines the average chain length.

Load Factor for separate chaining hash tables

Load Factor

Load Factor: The load factor for a hash table using
separate chaining is largest ratio N/M allowed, where N is the number
of (key,value) pairs in the hash table and M is the number of chains. 

So the load factor is the average chain length. 

Capacity: The capacity of a Hash Table is defined to be M,
the number of chains. 

Growth: If insertion of an additional (key, value) pair will
make make N/M be larger than the chosen load factor, the capacity of
the Hash Table is roughly doubled in order to maintain the load factor. 

  This is somewhat similar to the way a Vector automatically grows. In
  the case of a Vector, all the elements in the Vector must be copied
  when the storage is doubled. 

Growth Costs: When a Hash Table must grow, in addition to
allocating the new memory, all the elements must be rehashed
and inserted into the correct chain. 

Default Load Factor: The default load factor for Hash Table is
0.75. So
  a. the number of keys, is not allowed to be more than 75%
  of the number of chains
  b. the average chain length is 0.75.

Setting the load factor and the initial capacity

Load Factor/Initial Capacity

The initial capacity and the load factor can be set when the Hash Table
is constructed. 

If the maximum capacity is known at construction time, it can be used
as the initial capacity and thus avoid having to grow and rehash. 

Restriction: On the other hand using a capacity that is much
larger than needed for the desired load factor is not good. 

The reason is that the time to iterate through the keys is
proportional to the capacity regardless of how many keys are in the
Hash Table.

Graph Definitions

Graph Definitions

Graph = (V,E) 
    V = a set of vertices
    E = a set of edges

Edge  = a pair (u,v) of vertices or 

Weight edge = pair of vertices with numeric weight: (u,v,wt)

A graph is directed if the edges are ordered pairs.

Vertex degree:
  For an undirected graph, the degree of a vertex is the number of
  edges that have the vertex as an "end point".

  For a directed graph, we can define both the "in degree" and the
  "out degree" of a vertex in an obvious way.

Digraph = a directed graph

Path = a sequence w₁, w₂, ... w_N
       such that (w_i, w_i+1) is in E
       for i = 1,2,...,N-1

Path length = number of edges (i.e., N-1); 
              A path of length 0 is allowed.

Loop = a path with one edge of the form (v,v). 
       Graphs will generally be assumed to exclude loops.

Simple Path = a path such that all vertices are distinct, except the
              first and last are allowed to be the same.

In a Directed Graph
  Cycle = a path of length at least 1 such that w₁ = w_N
  Simple Cycle = simple path that is also a cycle

In an Undirected Graph
  Cycle = a path of length at least 1 such that w₁ = w_N
          such that all edges are distinct.
  Simple Cycle = (same) a simple path that is also a cycle  

Directed Acyclic Graph (DAG) = A directed graph with no cycles.

Connected undirected graph 
       = There is a path from every vertex to every other vertex.

Strongly connected directed graph 
       = There is a path from every vertex to every other
         vertex. (same as connected for an undirected graph)

Weakly connected directed graph
       = A directed graph that is not strongly connected, but is
         connected when considered as an undirected graph.

Complete graph = There is an edge between every pair of vertices.

Properties of Graphs

If a graph has been generated to model some problem, we are often confronted with needing to know various properties of the particular graph. Some such properties include:

Questions: What properties does a give graph have?

   a. Are 2 particular vertices connected? is the graph connected?
   b. What is the shortest path between 2 given vertices
   c. What is the cheapest network (spanning tree)
   d. Are there any cycles?
   e. Is the graph planar? (Can it be drawn in 2 dimensions without
      crossing edges?

Uses of Graphs

Some uses of graphs to model:

Graphs used to model

   a networks (computer, transportation)
   b. circuit layout (cost as well as planar?)
   c. dependencies (any cycles?)
   d. manufacturing processes (critical path)
   e. games - each node is a state of the game (search)

The number of edges in a graph

In general the relation between the number of edges and the number of vertices in a graph can vary so that one cannot determine the number of edges by knowing the number of vertices.

Number of Edges for an Undirected Graph

  ∑ degree(w_i) = 2|E|

where the sum is over all vertices.

Number of Edges for a Directed Graph

   ∑ out_degree(w_i) = |E|

where the sum is again over all vertices.

Representing Graphs

There are two basic ways to represent directed graphs. Both ways also work for undirected graphs with no modification required (although some space could be eliminated).

Two Ways of Representing Graphs

 1. adjacency matrix. A |V| x |V| sized matrix, A

    For unweighted graph, 

	A[i][j] is 1 if there is an edge from vertex i to vetex j
                   0 otherwise.

   For weighted graph,

	A[i][j] is the weight of the edge if there is an edge from
                   vertex i to vetex j 
                   otherwise some chosen value not to be confused with
                   a weight (e.g., largest or smallest possible int
                   value)

   (Note for an undirected graph, the matrix A represents the same
    edge twice since A[i][j] will be equal to A[j][i].)



 2. adjacency list. An |V| sized array of lists. 

       A[i] = list of vertices j for which there is an edge from i to
              j.
  
    (Note for an undirected graph, every edge is represented twice.)

Complexity for Graphs

Some Complexity Requirements

1. Adjacency Matrix Representation
   space = O(|V|²)

   find all vertices adjacent to a given vertex w = O(|V|)

   The number of edges could range from 0 (no edges) to |V|(|V|+1)/2 (for
   a complete graph). 

   If there are few edges, many entries will be 0, but will still have
   to be examined when searching for adjacent vertices.

2. Adjacency List Representation
   space = O(|V| + |E|)

   find all vertices adjacent to a given vertex w = O(degree(w))
   where degree(w) is the (out) degree of vertex w.

   degree(w) <= |E| for every w. In fact, for an undirected graph

   ∑ degree(w_i) = 2|E|

   where the sum is over all the vertices. So the average degree of a
   vertex in an undirected graph will be

	2|E|/|V|

   For a complete graph with |E| = |V|(|V| + 1)/2,

        2|E|/|V| <= |V| + 1 = O(|V|) again.

   But taking the stree grid as an example with intersections as the
   vertices, the degree of each vertex would typically be 4,
   independent of the number of vertices.

   So if there are lots of edges, i.e.

	|E| = O(|V|²)

   then the adjancency matrix is appropriate.

   But in many applications, this is not the case and the adjacency
   list representation can perform better for operations that must
   examine adjacent vertices.

Shortest Path - Unweighted Graph

We usually want the shortest path in a weighted graph. However, for an unweighted graph the shortest path length from a given vertex to some target vertex can be found by performing a breadth first search.

That is, first visit all vertices that are on paths at length 1 away. In a tree this would be like first visiting all nodes that are at depth 1 from the root.

Then visit all vertices that on paths at length 2, etc. Stopping the first time the target vertex is reached will give the shortest path.

Conceptually a queue is needed, and we must "mark" each vertex as being visited. Below is pseudo code for breadth first search to visit every node (once) in a connected graph. If the graph is not connected, just find another vertex that is not marked, and call BFS for that vertex until all vertices are marked.

Breadth First Search starting a vertex u
  int BFS(vertex u)
  {
    queue q = new queue();
    mark(u);
    q.insert(u);
    while( !q.empty())
    {
      v = q.remove();
      visit(v);
      for each vertex w adjacent to v
      {
        if (not marked(w))
        {
          mark(w);	
          q.insert(w);
        }
      }
    }
  }

Connected Graphs and DFS

To check if a graph is connected (for any two vertices, there is a path from the first to the second), one can use the depth first search model. Starting at any vertex, a depth first search of a connected graph would reach every vertex.

Here is the basic model of the depth first search method. Variations on it provide solutions to many kinds of problems.

void DFS(vertex v)
{
    visit(v);
    foreach w adjacent to v
    {
      if ( !visited(w) ) {
          DFS(w);
    }
}

Topological Sort

This algorithm is applicable to a directed graph whose edges represent dependencies. It outputs the vertices in an order that respects the dependencies.

The idea is to first output vertices with no dependencies; that is, with in-degree = 0.

Once a vertex has been output, we can remove it from the graph and decrease the in-degree of all its adjacent vertices.

Repeat the process until no more vertices remain with in-degree 0.

If all vertices have been output, then the output order is one of perhaps several possible topological sorted orders.

If all vertices have not been output, then a cycle exists and topological sort is not possible.

Minimum Cost Spanning Tree

A minimum cost spanning tree of a weighted graph with N vertices is a connected subgraph with N-1 edges and no cycles connecting all vertices whose total weight is minimal among all such subgraphs.

An way to find a minimum cost spanning tree is to use a minPriorityQueue, to inserted the weighted edges.

Remove edges and see if the edge should be added to the solution. It should be unless it would cause a cycle. In that case just skip it. Stop when N-1 edges have been added.

An abstract Graph class

Here is a modest abstract Graph class:


template <typename vertexType>
class GraphADT
{
public: 
  virtual void addEdge(const vertexType& v1, const vertexType& v2) = 0;
  virtual int vertices() = 0;
  virtual int edges() = 0;

  virtual list<vertexType>::iterator dfs(const vertexType& start) = 0;
  virtual list<vertexType>::iterator bfs(const vertexType& vstart) = 0;
}

An Implementation of the abstract class GraphADT

Here is part of an con of the GraphADT interface:

template <typename vertexType>
class Graph : public GraphADT<vertexType>
{
  
  boolean directed;
  map<vertexType, list<vertexType> > adjList;
public:
  Graph()
  {
    directed = false;
  }
  Graph(bool dir)
  {
    directed = dir;
  }

  
  void addEdge(const vertexType& v1, const vertexType& v2)
    throws(invalid_argument)
  {
    addOneEdge(v1,v2);
    if ( !directed ) {
      addOneEdge(v2,v1);
    } else {
      addVertex(v2);
    }
  }

  int vertices() 
  {
    return adjList.size();
  }
  public int edges()
  {
    map<vertexType>::iterator it = adjList.begin();
    int e = 0;
    while( it != adjList.end() )
      {
	e += (it->second).size();
      }
    return e;
  }
  ... 

}

The AddOneEdge method

  bool addOneEdge(const vertexType& v1, const vertexType& v2)
  {
    // Preconditions: 
    //  v1 == v2 should be false (Loops not allowed)
    if ( v1 == v2) ) {
      throw 
	invalid_argument(
             "Edge cannot have equal endpoints.");   
    }
    bool added = true;
    if ( adjList.find(v1)) == adjList.end() ) {
       adjList[v1] = list<vertexType>();
       adjList[v1].push_back(v2);
    } else {
       list<vertexType>::iterator p = 
       find(adjList[v1].begin(), adjList[v1].end(), v1);
       if ( p == adjList[v1].end() ) {
          adjList[v1].push_back(v2);
       } else {
          added = false;
       }
    }
    return added;
  }

Graph class (continued)

The public dfs method
  iterator dfs(const vertexType& start)
  {
    list<vertexType> result;
    map<vertexType, bool> visited;
    if ( adjList.find(start) != adjList.end() ) {
      map<vertexType, list<vertexType> >::iterator it = adjList.begin();
      while( it != adjList.end() )
	{
	  visited[it->first] = false;
	}
      dfsList(result, visited, start);
    }
    return result.iterator();
  }

The private dfsList method

This method follows the general outline of the depth first search with these concrete changes:

 void dfsList(list<vertexType> r, map<vertexType, bool> visited,
const vertexType& start)


1. A vertex is of a type determined by the user, not necessarily an
integer. 

2. The map, visited, consists of (key, value) pairs where
key is a vertex and value is a bool.

3. start is the vertex where the depth first search
begins.

4. Visiting a vertex means adding the vertex to the list
r.

The bfs method

The bfs method follows the general outline of the (non-recursive) breadth first search model with concrete additions similar to the dfsList method

Number of edges in a connected graph

Property A. For a connected undirected graph G = (V, E) with |V|
vertices and |E|, 

           |E| >= |V| - 1.

This is easy if every degree(w_i) >= 2 for every vertex
w_i:

Recall:  2|E| = degree(w₁) + degree(w₂) + ... + degree(w_|V|)

With the assumption on the degrees, the right sides is >=

                2 + 2 + ... + 2 = 2|V|

So	
                2|E| >= 2|V|

and              |E| >= |V|

That is, the number of edges is >= the number of vertices in this
case.

Number of edges in a connected graph (cont.)

On the other hand, if a connected graph G = (V,E) has some vertex w with
degree(w) = 1, then removing w and the edge incident to w from G gives
a graph G1 = (V1, E1) with

(a)	|V1| = |V| - 1
(b)     |E1| = |E| - 1
(c)     G1 is connected

Assuming G1 satisfied property A, then we would have

        |E1| >= |V1| - 1

and this would show that G also has property A:

	|E| = |E1| + 1 >= (|V1| - 1) + 1 = |V1| = |V| - 1


Using these observations, one can easily prove property A by induction
on the number of vertices in the graph.

Minimum Weight Path Algorithm (Correctness)

Minimum cost path from some vertex u to all other verices

// For each vertex v, the minimum cost of all paths from u to v will
// be calculated and stored in d(v)

Initialize S, the "selected set of vertices" to be { u }
Set d(v) = weight of edge (u,v) or else INF if no such edge exists

for(int i = 1; i <= |V| - 1; i++)
{
  pick a w in V - S with minimum d(w)
  Add w to S
  for each v in V - S
  {
    if there is an edge (w,v) {
       d(v) = minimum( d(v),  d(w) + Weight of edge(w,v)  )
    }
  }
}

Correctness of the Algorithm

The algorithm correctly computes the minimum 
path weight from vertex u to all other vertices.

Let S_k be the set of vertices in S after the k-th iteration
of the loop.

Let d_k(v) be the value of the function d at vertex v,
computed after the k-th iteration of the loop.

The proof of correctness is by induction on the number of iterations.

Let P_k be the assertion:

   (a) For each v in S_k d_k(v) is the minimum
       cost of all possible paths in the graph from u to v.

   (b) For each w in V - S_k, d_k(w) is the
       minimum cost of all paths from u to w which lie entirely in S
       except for the end vertex w.

Basis Case

Basis case
We show that P_k is true for each k = 0, 1, 2, ..., |V|-1

(1) P₀. This refers to the S after 0 iterations; that is,
                   after the initialization only.

    (a) S₀ = { u } Since the only vertex in this set is u
        and d₀(u) is defined to be 0, (a) is correct.

    (b) For each w in V - S₀, the only possible path from u
        to w lying in S (except for w) would have to be an edge from u
        to w. Since d₀(w) is defined to be the weight of
        this edge, (b) is also correct.

Induction Step

Induction Step

(2) Now suppose that P_k is true for some k, 0<= k < |V|-1
    We need to show P_k+1 is also true.

    Let w_k+1 be the vertex chosen on the k+1-th iteration.
    Then S_k+1 = S_k ∪ { w_k+1}
   
    (a) By part (b) of P_k, d_k(w) was the minimum
        cost of all paths from u to w_k+1 lying in
        S_k. And the algorithm just sets d_k+1(w)
        equal to this same value since w was the chosen vertex on th
        k+1-th iteration. We must now show that this value is the
        smallest among all paths in the graph, not just the
        ones lying in S.

        But if there were a path from u to w_k+1 with
        smaller weight, that path would have to go through some vertex
        not in S_k+1. Let v be the first vertex along this
        path from u that is not in S_k+1. Then 

        d_k+1(w_k+1) = d_k(v) + Cost of path from v to w

        Since there are no negative weights, the cost of the path from
        v to w is > 0 and so  

            d_k(v) < d_k(w_k+1)

        But this contradicts the choice of w_k+1 on the
        k+t-th iteration. It was supposed to be the vertex with minimum
        d_k value.

        This contradiction shows that (a) is satisfied.

Induction Step (continued, part b)

Part (b)

    (b) If a vertex is in V - S_k+1, then it is also in V -
        S_k. So d_k(v) is already the smallest
        weight of all paths lying in S_k. We need to show
        that d_k+1(v) is the smallest weight of all paths
        lying in S_k+1.

        But a path lying in S_k+1 either contains
        w_k+1 or not. If it does not, then the path lies in
        S_k and we already know that the value
        d_k(v) is minimum for such paths.

        On the other hand, if the path lying in S_k+1 does
        contain w_k+1 and it has smaller weight, then it
        must be a path from u to w_k+1 followed by an edge from
        w_k+1 to v. (If it went back into S_k after
        passing through w_k+1 and then on to v, we could
        shorten the path by going directly to this vertex in
        S_k, omitting w_k+1) 

	The update calculation takes care of these two possibilites,
        to calculate the smaller value of either possible type of path:

	d_k+1(v) = 

        minimum( d_k(v), d_k(w_k+1) + cost of edge from w_k+1 to v)

        So (b) is also correct for P_k+1