PART 1 : UNDERSTANDING NEURAL NETWORKS USING AN EXAMPLE

Angad Sandhu
The Startup
Published in
10 min readDec 21, 2020

--

Information after looking at Neural Networks

I am TIRED!!! No really, hear me out.

I have read/seen articles, videos, books and blog posts about the mathematical functioning of Neural Networks, Algebra, Matrix multiplication and on and on…..

but none of the above mentioned resources give me the clear understanding that I want. So, now I am going to try to give you the full big picture. The code comes later, first we will focus on mastering the basics of a Neural Network and hence build our foundations to master Deep Learning Itself.

“ How will we do this? ” you ask. Simple, using an example

Design of Our Neural Network

the example I want to take is of a simple 3-layer NN (not including the input layer), where the input and output layers will have a single node each and the first and second layers will have 4 and 3 nodes respectively.

NN Architecture

Now, before we move on first we need to define some initial variables and components. These are:

- An input layer [x]- An arbitrary amount of hidden layers (2 for our case)- An output layer [ŷ]- A set of weights and biases between each layer, W and b- An activation function for each hidden layer (Sigmoid)- Loss Function ( L(x) = ( y - ŷ )² ) [Root Mean Square]- Learning Rate (alpha = 1.0 for our case)- Optimization Algorithm (Gradient Descent)- Number of iterations ( i = 2 for our case )

the sizes of the weight matrix and bias matrix between every 2 layers is determined automatically using the number of nodes between each layer

But basically you can say that

  • The size of weight matrix of layer ‘ i ’ ( Wi ) is -— — — — — — — >(number of nodes in layer i, number of nodes in layer i-1)
  • The size of bias matrix of layer ‘ i ’ ( bi ) is — — — — — — — — — ->(number of nodes in layer i, 1)

All of these matrices are initialized as random floats sampled from a univariate “normal” (Beta) distribution of mean 0 and variance 1. This basically means that all elements of the matrix will be randomly chosen between 0 and 1.

NOTE: INSTEAD OF TAKING A SINGLE BIAS FOR EACH LAYER, I HAVE TAKEN A BIAS FOR EACH NODE.

Shapes of W & b for each layer

Let's take the weights and biases to be initialized as:

Neural Network Variables

we have also taken our sample data as very simple 2 values of [2, 3] and [1, 0] respectively. This, means that this is a classifier that should in theory, take in numbers and return binary values.

Specifically, 1.0 for an even number and 0.0 for an odd number. Hence this is an Odd/Even classifier.

Forward Propagation [1st iteration — 1st example]

Forward prop is basically the use of the input to get an output (not necessarily correct). During training we mostly do not care about the output generated as we are “training” the network to produce correct outputs when we input a value in the future.

We take the outputs generated by the un-trained network, compare them to the correct output and then nudge our Weights and Biases to subsequently get better and more accurate outputs. Going through our calculations

Calculating Z1

For our first computation, we will take the dot product between W1 and Z0 (i.e. 2, which is basically our first training example). and then add a bias to the resulting matrix. The bias is added so that we don’t encounter the problem of repeating trivial values if W1 or Z0 are 0 by chance.

Calculating A1

Next, we enter our resultant values of Z1 into a SIGMOID squashing function, element-by-element. What this does is keeps our output in the range of 0 to 1.

The Sigmoid function is represented by the the greek symbol σ(x). The inside math of a sigmoid function is :

y = σ(x)
or
y = 1 / ( 1 - e^(-x) )

this function is applied element-by-element, i.e. each value in the matrix is input into this function and stored in the new matrix.

Calculating Z2 & A2

Using A1 from the last calculations to get Z2 and then calculating A2

Calculating Z3 & A3

We will repeat this process of multiplying, adding and squashing till we get A3, which is the last layer of our model and hence this is out predicted output or ŷ. As our actual correct output is y, we will now calculate the loss between these values.

Calculating Loss [1st iteration — 1st example]

The model output that we we received i.e. ŷ is now compared with our actual output y to see exactly, how “off the mark” was out Neural Network in predicting this output.

Calculating Loss

As we observe that 0.9968 and 1.0 are pretty close to each other. Hence, the loss is pretty low. Thus using the value of loss we judge how accurate is our model.

Here we use the RMS (root mean square) Loss, but using other loss functions (eg. logistic loss) might be more beneficial.

Backward Propagation [1st iteration — 1st example]

Backward prop, as the name suggests it is basically the use of the loss to correct and update our weights and biases. Unlike Forward prop where we start from the input, here we start from the last layer where the loss was calculated and move back towards the start of our gradient descent.

During backprop we try to find ‘dZ’ which is basically the short-hand version of saying, the partial derivative of Z with respect to the loss. Derivative gives us the slope or the value of variation of a parameter with respect to some other parameter.

Each layer that has weights and biases also have a dZ (eg. dZ1, dZ2, dZ3). All of these are also matrices that have the same shape as that of their respective Z [i.e. shape of Z1 == shape of dZ1].

Calculating dZ3

The calculation of the dZ of the last layer is a little tricky and different than the rest as we have to differentiate our loss function, where as all the other dZ calculation follow a similar pattern.

As we can see the presence of (ŷ - y) here, hence the more these values are different, more will the value of dZ3 get altered. As the value of our output is pretty close to the answer we want, the value of dZ3 will be very small to make sure our weights and biases do not change much.

Also, we see the symbol of σ’(x) which is different from σ(x), as the apostrophe represents a function that is the derivative of the sigmoid function. The inside math of the derivative of the sigmoid function is :

y = σ'(x)
or
y = σ(x) * ( 1 - σ(x) )
Calculating dZ2 & dZ1

Finally, We calculate the derivative (or the degree of change) of the weights and biases with respect to the loss and and then nudge our Weights and Biases to subsequently get better and more accurate outputs.

But Instead of differentiating the Loss every time, we use the values of dZ to get dW and db, which is done using the chain rule.

Update Weights, Biases [1st iteration — 1st example]

Now, dot multiplying dZ with the A of the previous layer gets us dW and dZ is directly equals to db, these values are used for updating our weights and biases.

We can also see a value called Alpha here, which is multiplied to dW and db. This value is the LR (Learning Rate) of our model, which is used as a hyperparameter to define “how fast do we want to optimize our value”. A small Alpha will result in small incremental changes, but will also be more accurate.

After we have calculated our dW and db, and multiplied the LR to them, we subtract our original values of the respective weight and bias and replace the old values W and b with the newly calculated ones.

Updating W3 & b3
Updating W2 & b2
Updating W1 & b1

Now that we have calculated all our new values we start our process of training again, but this time with our updated weights and biases and also a new input/output example.

Neural Network Variables

Forward Propagation [1st iteration — 2nd example]

Similar to our first example, Now we will implement the feed forward step with our input as an odd number (3.0)

Calculating Z1 & A1
Calculating Z2 & A2
Calculating Z3 & A3

Now as we can see A3 here is our output which is still very close to 1.0, BUT, for this example our correct output was not 1.0 but 0.0. Which means our model has severely miss classified our input to be odd.

Calculating Loss [1st iteration — 2nd example]

As we observe that 0.9992 and 0.0 are very far from each other. Hence, the loss is pretty high. Thus, this will result in our weights and biases being changed severely.

Calculating Loss

Backward Propagation [1st iteration — 2nd example]

As shown earlier we will move on to backprop to get the new dZ values.

Calculating dZ3
Calculating dZ2 & dZ1

If we notice carefully, we observe that this time the dZ values are larger and positive (+VE). This means the direction of the matrix update is changing.

Update Weights, Biases [1st iteration — 2nd example]

As the values of dZs are +ve this time around, the values of our weights and biases will be decreased, to better bach our input data.

Updating W3 & b3
Updating W2 & b2
Updating W1 & b1

So we have successfully updated all values of W and b and we can see that these are all lower than out original initialization of 0.5

Hence, these are our Weights and Biases values after a full iteration, and now we will move onto the next iterations with the same examples as last time. Hence the input and output are 2.0 and 1.0 respectively, again.

Neural Network Variables

Forward Propagation [2nd iteration — 1st example]

As explained in the examples above, now you guys have plenty of practice in the department of feedforward and we will now calculate our model output, again.

Calculating Z1 & A1
Calculating Z2 & A2
Calculating Z3 & A3

This time around our output A3 is 0.9968 which is pretty similar to our last output for this example but it is a little lower in value, hence we can see the general trend that after 20~25 iterations the loss will balance out at an even 0.5 (i.e. 50% accuracy).

Calculating Loss [2nd iteration — 1st example]

Hence, repeat the above steps to get our loss and the update all our values.

Calculating Loss

Conclusion

As we can see we have a working Neural Network that updates variable values using the training data to get a fully trained model.

This is the basic functioning of a Neural Net, the only difference being that an actual industrial level architecture may have hundreds of layers and thousands of nodes, where it will be trained for tens of thousands of iterations. Bt, no problem 😁as we will have our trusty framework api to do this for us.

Some ways we can make our model better is :

  • Choosing a different/better loss function
  • Increasing our data (we used only 2 examples, whereas usually training data includes millions of lines of these examples)
  • Increasing/Decreasing our Layers or Nodes
  • Tuning our Alpha (hyperparameter) to better train our data.

Now we move onto writing all this mathematical logic in python code.

--

--

Angad Sandhu
The Startup

Data Science | AI Developer | Full Stack Developer. I Build Things.