Spring 2024

CSC 578 HW 4: Written exercises + Keras Experiment

Graded out of 8 points.
Do all questions.
DO NOT include the question in your submission. Just number your answers.

1. Code for forward propagation

In the code NN578_network.ipynb (its html), there are two places where forward propagation is performed -- the function feedforward(), and a block of code in the function backprop(). The former is a function called from the function evaluate() to obtain the network output/activation for a given test/validation instance (using the learned weights), while the latter is the first phase in the backpropagation algorithm.

    def feedforward(self, a):
        """Return the output of the network if ``a`` is input.  Note this 
        function is called during evaluation; not during training/backprop."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a) + b)
        return a

    def backprop(self, x, y):
	.... 
        # forward pass
        activation = x
        activations = [x]  # list to store all the activations, layer by layer

        zs = []  # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation) + b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)

Both of them do essentially the same task but are written differently. Why is that? Could backprop() call feedforward() to propagate the input (x) forward (as is, without making change in feedforward())? Explain fully and in detail.

2. Add comments to a code snippet

For the following code snippet (from the function backprop() in NN578_network.ipynb (its html), explain what each line is doing.

1: delta = self.cost_derivative(activations[-1], y) * \
2:           sigmoid_prime(zs[-1]) 
3: nabla_b[-1] = delta
4: nabla_w[-1] = np.dot(delta, activations[-2].transpose())

For each of the 3 lines (you can count lines 1 and 2 as one line), explain in detail what the line is doing, meaning-wise. Explain what it is trying to accomplish. Do NOT just say what is happening syntactically (for example "a is multiplied by b" -- you will get 0 points for such answers).
Include these points in your explanation (REQUIRED):
- Why is the first line using * but the third line is using np.dot()?
- Can you change the order of the two arguments in line 1 and 2 (i.e., "sigmoid_prime(zs[-1] * activations[-1]" to "activations[-1] * sigmoid_prime(zs[-1]") and get the same expected/correct result? Why or why not?
- Why is .transpose() used in the last line? Is it really needed? Would it work without it (i.e., just "activations[-2]")?
- Can you change the order of the two arguments in line 4 (i.e., "np.dot(activations[-2].transpose(), delta)" to "np.dot(delta, activations[-2].transpose())") and get the same expected/correct result? Why or why not?

3. [Extra Credit] Backpropagation of Error

This question is NOT required. You only give your answer if you are interested in getting an extra credit.

Another question on the function backprop(). The following block of code is propagating the error backward (from the output layer to the input layer; continuing the code snippet in the last question).

# Note that the variable l in the loop below is used a little
# differently to the notation in Chapter 2 of the book.  Here,
# l = 1 means the last layer of neurons, l = 2 is the
# second-last layer, and so on.
for l in range(2, self.num_layers):
    z = zs[-l]
    sp = sigmoid_prime(z)
    delta = np.dot(self.weights[-l + 1].transpose(), delta) * sp
    nabla_b[-l] = delta
    nabla_w[-l] = np.dot(delta, activations[-l - 1].transpose())

The current code is accomplishing the task by a loop, during which delta for the layer l+1 and nabla_w and nabla_b between any two layers (-l and -l+1) are computed, one layer at a time (going backwards).
But could it possibly be written such that delta's for ALL inner layers are computed first (probably by using a loop), then use them to compute nabla_w and nabla_b for any two layers (by using another loop, going backward)? Explain in detail.

4. A Problem on the Cross-entropy cost function in NNDL 3

This problem is asking you to show that the term vanishes for the bias nodes as well, particular in . Show the derivation for biases, comparing the quadratic and the cross-entropy cost functions, similar to the one shown on Slide 17 in Lecture note #4 (Optimizations). You must show a trace of derivation step, not just the final formula (REQUIRED).

5. Alternative syntax

For the following code snippet from the function SGD(), give an alternative syntax that accomplishes the same.

mini_batches = [ training_data[k : k + mini_batch_size] 
                 for k in range(0, n, mini_batch_size) ]

Propose a sufficiently different syntax as much as you can. Efficiency is a concern, but give an alternative code that is reasonably efficient.
Do NOT make the change in the previous network file from HW#3 and submit it.
Create an NEW example code in Jupyter Notebook that demonstrates that your code produces the desired output. Submit the code file and its html/pdf file, separately from the files for Question 6, with the submission.

6. TensorFlow/Keras tutorial

Use the TensorFlow (& Keras) tutorial Basic classification: Classify images of clothing to learn how to use Keras to create and train a neural network model (in particular, a Multi-layer Feed-Forward network in this exercise). The task here is Fashion MNIST, a "drop-in replacement" for MNIST. That is, it's got the same structure, but different types of images.

6.1 Preliminary notes

You can run the code directly on Google Colab by going to the Basic classification tutorial link above, then clicking the Run in Google Colab button near the top. That will open up a copy of the notebook in Colab which you can modify.

Alternatively, you can download the notebook (with the Download notebook button), and run it on a different cloud platform or on your own machine, but then you'll have to ensure that the version of tensorflow is consistent with what's in the notebook. If you prefer running locally on your machine, you need to install TensorFlow and Keras by yourself and configure your system.
- Running the code: If you can run your code with hardware acceleration (GPU or TPU), it will greatly increase the speed! With Colab, you can do this with Runtime > Change Runtime Type > Hardware Accelerator > GPU or TPU. There is a limit on how long you can use these, and it's not a published formula, but it will make your training much faster. You might want to disconnect your session in between training runs to avoid hitting the limit. You can also sign up for "Colab Pro" (under settings). I think it's just $10/month. It provides faster hardware and more relaxed runtime restrictions.

You may modify any line in the tutorial code to do experiments. However the following must NOT be changed (otherwise your code won't work at all or work well):
- In the model, the first layer has to be Flattened with input_shape=(28, 28), and the last layer has to be a Dense layer with 10 nodes.
- In compile(), the loss function has to be sparse_categorical_crossentropy as shown below because the target (y) in Fashion MNIST data is an integer(0 through 9 in this case) instead of a binarized "one-hot" vector (FYI, in which case we must use categorical_crossentropy instead). Also the parameter 'from_logits=True' indicates the network output value for each target class (for multi-class targets) is logit (between 0 and 1), i.e., no softmax applied.
```
sparse_categorical_crossentropy(from_logits=True)
```
To look up the descriptions of Keras hyperparameters, look at Keras Documentation, for example, Guide on the Sequential model and the API on models, layers, etc.

6.2. Objectives

The goal of this exercise is two fold::

To run various model configurations and hyperparameter values to empirically examine their effects in learning; and
To experience searching for an optimal model/hyperparameter setting for neural networks (i.e., a journey to tune the model). Read the NNDL Chapter 3, section How to choose a neural network's hyper-parameters as a guide.

6.3 Experiment Description:

First experiment if adding another hidden layer would improve performance. Run the original model (with one hidden layer) and another model (you create) where a (hidden) Dense layer of size 32 is added (right before the output layer). Report the results.
Using the model that produced a better result, experiment with the following hyperparameters:
1. Learning rate (set through Keras Optimizer -- look at the first cell in the page)
2. Regularization methods (set through Keras Regularizer -- look at the first cell in the page)
3. Dropout (example; add just one of this layer before the output layer)
For the purpose of this assignment, you experiment with the hyperparameters sequentially, by trying different values for one parameter (and recording the results), choosing the best value (and fixing the parameter with the value), then moving on to experiment with the next parameter (from 1 down to 3).

IMPORTANT NOTE: The hyperparameter tuning exercise in this question is meant to be a big picture, intuitive one (rather than a brute-force parameter optimization process). For the learning rate, you should try values of different magnitude (as shown in Slide 36 in lecture note (#4) Optimizations). Other parameters do not have as wide a range as learning rate, but pick a few values that are far apart.

Another note: the "best parameter value" can be defined from various perspectives, e.g. classification accuracy (with respect to training or test set?), loss value (and with respect to train or test set?), learning stability, speed of learning/convergence, and degree of overfitting. Be sure to write in the report how you chose the best values.

6.4 Submission Code:

Create a clean and streamlined submission code. Do NOT submit the entire tutorial code as is.
Delete cells that are not directly relevant to this question, such as the 'Explore the data' section, visualization of data and various prediction verification code. You should also remove any text/markdown cells that are there for the purpose of tutorial. You should replace them with your comments.
Write your name, the course and section number in which you are registered and the assignment name/number (e.g. "CSC 578 HW#4") .
Poorly formatted code will receive some point deduction.

6.5 Write-up Report:

Write at least 1.5 pages on your experiments and results, including your comments on the value you chose as the best one for each parameter.
Also include the following in the report (REQUIRED):
- Your expectations for the various values of those parameters before the experiment, and your reactions on the results (as you had expected or otherwise, and possible reasons why).
- A chart or table of the experiment results. Visualization of the results (similar to HW#3; training vs. validation, for different parameter values).

Submission

Submit two document files and two versions of your notebook (thus a total of 4 files):

Documents
- One document file containing your answers to Q1 through Q5.
- One document file which is the write-up of Q6.
Notebooks
- Two Jupyter notebook files, for Q5 and Q6, and their respective html/pdf files. The files (notebook as well as html/pdf) must show ALL outputs (REQUIRED).

IMPORTANT: Write your name, class/section and the assignment number at the top of EACH source code file and the write-up file. Files without those information will receive a score of 0.

DO NOT Zip the files. Submit files separately.