Spring 2024

5/14/2024 updated

CSC 578 HW 5: Backprop Hyperparameters

Graded out of 10 points.
Do all questions

Introduction

Similar to HW#3, you make required modifications to the NNDL book code and write a small application code. The objective of the assignment is to enhance you understanding of some of the hyperparameters of neural networks.

The original NNDL code "network2.py" is hard-coding several things, including the sigmoid activation function, L2 regularization and the input data format (for MNIST). In this assignment, we make the code general so that it (implements and) accepts various hyper-parameters as (literally) parameters of the network.

For the application code, you do some systematic experiments that test various combinations of the hyper-parameter values.

Overview

You develop code following these steps:

Run the startup code provided (a slightly modified NNDL network code and a test code) in Jupyter and ensure your environment is compatible.
Make required modifications to the network code.
Create your own Jupyter notebook application that uses your modified network code.

Part 1: Initial tests of application notebook

Download the network definition code NN578_network2.ipynb (html) and the test application code 578hw5_Check1.ipynb and an html file (if you are using your own local machine, modify the ipynb file on you own). You also need data and network files: iris.csv, iris-423.dat, iris4-20-7-3.dat (same files used in HW#3), iris-train-2.csv and iris-test-2.csv (new in this assignment). Run all cells in the network notebook and the test application notebook. Execution should succeed, and you should see the same output.

Part 2: MODIFICATIONS to be made in the network code

Intro

For this part, you will need to make modifications to the Network Code, adding some hyperparameters and modifying some functions.

(A) Hyper-parameters:
- Cost functions
  - QuadraticCost, CrossEntropy, LogLikelihood
- Activation functions
  - Sigmoid, Tanh, ReLU, LeakyReLU, Softmax
- Regularizations
  - L1 and L2
- Dropout rate
(B) Functions to modify (minimally):
- set_model_parameters()
- feedforward()
- backprop()
- update_mini_batch()
- total_cost()

Note that, depending on how you write your code, you may need to modify other functions to implement Dropout.

(A) Hyperparameters

Model architecture is passed in as sizes in the constructor as before. Other parameters are set by two functions (set_model_parameters()) and set_compile_parameters()) using values pass in as keywords:

    def __init__(self, sizes):
        """The list ``sizes`` contains the number of neurons in the
        respective layers of the network, for example [2, 3, 1].
        The biases and weights are initialized in a separate function.
        Model parameters are set here."""
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.default_weight_initializer()

    def set_model_parameters(self, cost=CrossEntropyCost, act_hidden=Sigmoid,
                             act_output=None):
        self.cost=cost
        self.act_hidden = act_hidden
        if act_output == None:
            self.act_output = self.act_hidden
        else:
            self.act_output = act_output
        
    def set_compile_parameters(self, regularization=None, lmbda=0.0,
                               dropoutpercent=0.0):
        """Function for setting compilation hyperparameters."""
        self.regularization = regularization
        self.lmbda = lmbda
        self.dropoutpercent = dropoutpercent

1. cost functions

This hyperparameter argument specifies the cost function.
Options are 'QuadraticCost', 'CrossEntropy' or 'LogLikelihood'.
Each one must be implemented as a Python class. The scheme for class function is explained later in this document (see Class Function notes below).

The class should have two static functions: fn() executes the definition of the function (to compute the cost in during evaluation), and derivative() executes the function's derivative (to compute the error during learning).

QuadraticCost is fully implemented already in the starter code, as shown below (and you do not need to modify it).
For CrossEntropy and LogLikelihood, partial/skeleton code is written. You replace the line 'pass' and write your code.

class QuadraticCost(object):
    @staticmethod
    def fn(a, y):
        """Return the cost associated with an output ``a`` and desired output ``y``."""
        return 0.5*np.linalg.norm(y-a)**2

    @staticmethod
    def derivative(a, y):
        """Return the first derivative of the function."""
        return -(y-a)

class CrossEntropyCost(object):
    @staticmethod
    def fn(a, y):
        """Return the cost associated with an output ``a`` and desired output
        ``y``.  Note that np.nan_to_num is used to ensure numerical
        stability.  In particular, if both ``a`` and ``y`` have a 1.0
        in the same slot, then the expression (1-y)*np.log(1-a)
        returns nan.  The np.nan_to_num ensures that that is converted
        to the correct value (0.0)."""
        return np.sum(np.nan_to_num(-y*np.log(a)-(1-y)*np.log(1-a)))

    @staticmethod
    def derivative(a, y):
        """Return the first derivative of the function."""
        ###
        ### FILL IN HERE
        ###
        pass

All cost functions receive a (activation) and y (target output), which are from one data instance and represented by column vectors.
fn() returns a scalar, while derivative() returns a column vector (containing the cost derivative for each node in the output layer; no multiplication by the derivative of the activation function).
NOTES on LogLikelihood:
1. You have written the function (fn()) in HW#3.
2. For the derivative formula, you can look at Slide 14 in Lecture note #4 (Optimizations). Basically you return a column vector of zeros except for the node for which the target output is 1 -- its value should be -1/activation of the node.

2. act_hidden

This parameter specifies the activation function for nodes on all hidden layers (and EXCLUDING the output layer).
Parameter options are 'Sigmoid', 'Tanh', 'ReLU', 'LeakyReLU' and 'Softmax'.
Each one must be implemented as a Python class (see Class Function notes below). The class should have two functions: a static method fn() executes the definition of the function (to compute the node activation value), and a class method derivative() executes the function's derivative (to compute the error during learning).
1. Sigmoid is fully implemented already in the starter code, as shown below (and you do not need to modify it).
2. Softmax is partially written. You must add fn()..
3. For Tanh, ReLU and LeakyReLU, skeleton code is written. You replace the line 'pass' and write your code.
4. For LeakyReLU, look at the formula (of the function) on Slide 8 in the lecture note #4 (Optimizations). Use alpha = 0.3 (which you can hard-code it if you like). Note that in the slide, the variable z is meant to be a scalar. But in your code, z (in fn(z) also derivative())) is a vector/array. One reference is TensorFlow tf.keras.layers.LeakyReLU. Figure out the derivative of the function by yourself too.
```
class Sigmoid(object):    
    @staticmethod
    def fn(z):
        """The sigmoid function."""
        return 1.0/(1.0+np.exp(-z))

    @classmethod
    def derivative(cls,z):
        """Derivative of the sigmoid function."""
        return cls.fn(z)*(1-cls.fn(z))

class Softmax(object):
    @staticmethod
    # Parameter z is an array of shape (len(z), 1).
    def fn(z):
        """The softmax of vector z."""
        ###
        ### FILL IN HERE
        ###
        pass

    @classmethod
    def derivative(cls,z):
        """Derivative of the softmax.  
        IMPORTANT: The derivative is an N*N matrix.
        """
        a = cls.fn(z) # obtain the softmax vector
        return np.diagflat(a) - np.dot(a, a.T)
```

3. act_output

This parameter specifies the activation function for nodes on the output layer.
Parameter options are 'Sigmoid' and 'Softmax'. But for simplicity, you don't have to check if other functions are passed in this assignment.

4. regularization:

This parameter specifies the regularization method.
Parameter options are 'L2', 'L1' and None.
If the parameter value is None, you do NOT apply regularization at all.
The selected method is applied to all hidden layers and the output layer.
You can implement them in any way you like, for example as function classes or inline if-else conditionals. For definitions/formulas and explanation, see Slides 24, 25, 28, 29 Lecture note #4 (Optimizations).

IMPORTANT NOTE: The start-up code has L2 hard-coded in (unchanged from the original NNDL code for this part). You make necessary changes to the code by yourself to incorporate the two methods.
The regularization is relevant at two places in the backprop algorithm:
1. During training, when weights are adjusted at the end of a mini-bath – the function update_mini_batch().
2. During evaluation, when the cost is computed – the function total_cost().
NOTE: Both L2 and L1 functions utilize the hyperparameter lmbda, which was passed in when the network is created initially. The parameter is stored in self.lmbda..

5. dropoutpercent

This parameter specifies the percentage of dropout.
The value is between 0 and 1. For example, 0.4 means 40% of the nodes on a layer are dropped (or made inaccessible).
Dropout consists in randomly setting a fraction rate of units in a layer to 0 at each update during training time, which helps prevent overfitting.
Assume the same dropout percentage is applied to all hidden layers. Dropout should NOT be applied to input or output layer.
You can implement the parameter in any way you like. You make necessary changes to the code by yourself, wherever needed.
Many dropout schemes have been proposed in neural networks. For this assignment, you implement the following scheme.
1. Dropout is applied during the training phase only. No dropout is applied during the testing/evaluation phrase.
2. Use the same dropout nodes/masks during one mini-batch. That means you have to store which nodes were used/dropped somewhere else. Think about it and implement in your way.
3. Scale the output values of the layer. This scheme is explained at this site. In particular, the following code is very useful. The first line is generating a dropout mask (u1) and the second line is applying the mask to the activation of a hidden layer (h1) during the forward propagation phase in the backprop function.
```
# Dropout training, notice the scaling of 1/p
u1 = np.random.binomial(1, p, size=h1.shape) / p
h1 *= u1
```
  Then during the backward propagation phase, you apply the mask to the delta of a hidden layer (dh1). This is necessary because, since a dropout mask is applied as an additional multiplier (function) after the activation function, it essentially became a constant coefficient of the activation function (i.e., c*a(z)), and shows up in the derivative of that function – Let \(f(z) = c*a(z)\), then \(f'(z) = c*a'(z)\).
```
dh1 *= u1 
```
  More IMPORTANT NOTES:
  - The variable p above is the ratio of nodes to RETAIN, not to remove. So essentially, p = 1 - self.dropoutpercent.
  - To select nodes to retain/filter, you MUST use random.sample() method (in Python's standard library, not Numpy), NOT np.random.binomial(), in order to guarantee the exact p proportion of selection.
  - During forward propagation, dropout should be applied to/after activation, NOT to the z/weighted sum (e.g. sigma(0) = 0.5, which is not right or convenient here).
  - During backward propagation, dropout should be applied to the delta (the error at a given hidden layer).

(*) Class Function notes

Static or class functions in a class can be called through the class name. When a class name is bound to an instance variable, you can invoke a specific static/class function in the class by prefixing the instance variable. For example, here is a line in the function backprop():

a_prime = (self.act_output).derivative(zs[-1])

So whatever self.act_output is bound to, the function derivative() defined inside the class is being invoked.

(B) Functions to modify.

1. Tanh for act_output (illegal)

In set_model_parameters(),the original code allows any activation function for the output layer. Change so that if Tanh is passed for act_output, print an error message 'Error: Tanh cannot be used for output layer. Changing to Sigmoid..' and you implement just that in the function.

2. Other functions

In addition, make appropriate modifications wherever needed the code. It's up to you to figure out and decide.

Whatever you did, you should describe and explain it in the documentation. You may not get full points if modifications you made were not explained sufficiently.

Part 3: Your Own Application Code

Write Jupyter Notebook code to do these experiments shown in the tables below. Ensure your code works for each experiment, and show the output in an Jupyter notebook. For each experiment,

Use iris-train-2.csv as the training set and iris-test-2.csv as the test/evaluation set.
Start with the saved network (iris-423.dat for experiments 1 through 7-2, and iris4-20-7-3.dat for experiments 8 through 10).
Use #epochs=30, mini-batch size=8 and eta=0.1, for all runs.

Note the output results in the right-most column were generated on Google Colab. If you are running on a PC, you may see some minute differences.

	act_hidden	act_output	cost	regularization	lmbda	dropout	OUTPUT
1	Sigmoid	Sigmoid	CrossEntropy	(default)	(default)	(default)	Out-C-1.txt
2	Sigmoid	Softmax	CrossEntropy	(default)	(default)	(default)	Out-C-2.txt
3	Sigmoid	Softmax	LogLikelihood	(default)	(default)	(default)	Out-C-3.txt
4	ReLU	Softmax	LogLikelihood	(default)	(default)	(default)	Out-C-4.txt
4-2	LeakyReLU	Softmax	LogLikelihood	(default)	(default)	(default)	Out-C-4-2.txt
5	Tanh	Sigmoid	Quadratic	(default)	(default)	(default)	Out-C-5.txt
5-2	Tanh	Tanh	Quadratic	(default)	(default)	(default)	Out-C-5-2.txt	(*) This should produce the same result as Experiment 5, with error message.
6	Sigmoid	Sigmoid	CrossEntropy	L2	1.0	(default)	Out-C-6.txt
6-2	LeakyReLU	Softmax	LogLikelihood	L2	1.0	(default)	Out-C-6-2.txt
7	Sigmoid	Sigmoid	CrossEntropy	L1	1.0	(default)	Out-C-7.txt
7-2	Sigmoid	Sigmoid	CrossEntropy	(default)	1.0	(default)	Out-C-7-2.txt	(*) This should produce the same result as Experiment 1.

Use iris4-20-7-3.dat for below. Note that results may vary for these experiments because dropout nodes are randomly (but probabilistically) chosen.

	act_hidden	act_output	cost	regularization	lmbda	dropout	OUTPUT
8	Sigmoid	Softmax	CrossEntropy	(default)	(default)	0.3	Out-C-8.txt
9	ReLU	Softmax	LogLikelihood	(default)	(default)	0.3	Out-C-9.txt
10	Sigmoid	Softmax	CrossEntropy	L1	1.0	0.3	Out-C-10.txt	(*) This should NOT produce the same result as Experiment 8.

Note that those are the absolute minimal experiments. Although not required for this homework, you should try other combinations of hyperparameters to test your code more thoroughly.

Submission

Completed network file "NN578_network2.ipynb" and its html/pdf version, and the application Notebook file (which you started from 578hw5_Check1.ipynb) and its html/pdf version.
IMPORTANT: Make sure the application files are showing all OUTPUTs. We will look at them when we grade.
Also, be sure to add your name, course/section number and the assignment name at the top of all code files.
Documentation.
- In pdf or docx format.
- A minimum of 2.5 pages (i.e., two and a half pages filled, and some more).
- Write as much as you can. I consider terse answers insufficient therefore won't give a full credit when I grade.
- Create a presentable document. Don't make me work hard to find the information I asked for. (There are a lot of these to read.) Remember it's not to impress me; it's for YOUR exercise.
- Content should include:
  - Your name, course/section number and the assignment name at the top of the file.
  - Then, in separate, labeled sections, include:
    
    Experiment results
    Whether or not your results matched with or came close to the given results. If your results were different, describe the discrepancies and what you speculated the discrepancies came from.
    
    Implementation
    Explain how you implemented each of the requirements, and for each if they were Complete, meaning you did the code and verified that it worked; Not attempted, meaning you didn't get there; or Partial, meaning that you have some code but it did not completely work, and explain why. Give as detailed explanations as possible. In particular, be sure to explain everything you did to implement Dropout correctly.
    
    Reflections
    Your reaction and reflection on this assignment overall (e.g. difficulty level, challenges you had).

DO NOT Zip the files – Submit files separately.