We have a dataset which represents each sample and the class it belongs to by . The objective of classification problem is to develop a model, which when given can correctly predict .

In the previous post, we modelled the posterior probability function with a linear projection. In this post, we will model this posterior probabilities, with a 3 layer neural network. A standard feed forward neural network can be seen as a cascade of non-linear functions and hence is the obvious next step to modelling the posterior with linear functions.

A fully connected layer of neural network is a affine projection of the data followed by a pointwise non linear function .

A neural network layer

The layers are cascaded together such that output of one becomes the input of another. The output of the final layer is passed through softmax function to make the final projection look like a probability distribution over classes.

Finally the difference between the actual class and the predicted class is measured by cross entropy loss function

L(y,s) = \sum_i y_i\log s_i

The pointwise non linear function used in first and second layer is a vector valued function called relu which is defined as

\operatorname{relu}(z)_i = \begin{cases} z_i &z_i>0\\0&z_i\leq 0 \end{cases}

Code for forward computation

The complete functional representation of the network in its forward computation is the follows. z_1 &= W_1x+b_1 \\ a_1 &= \operatorname{relu}(z_1)\\ z_2 &= W_2a_1+b_2 \\ a_2 &= \operatorname{relu}(z_2)\\ z_3 &= W_3a_2+b_3 \\ s &= \operatorname{softmax}(z_3)\\

To implement this, we can start off by first implementing some convinience function which can do matrix-vector multiplications, vector-vector addition, computing non-linearities etc.

## forward ops
def relu(z):
  a = z.copy()
  return a

def matmul(x,W):
  return np.matmul(x,W.T)

def bias_add(xW,b):
  return xW+b

def softmax(z):
  M = np.max(z,axis=-1,keepdims=True)
  e = np.exp(z-M) # normalisation trick so 
                  # that largest of z is 0
  sigma = e.sum(axis=-1,keepdims=True)
  s = e/sigma
  return s

With the help of these convinience functions, now it is easy to implement the forward computation or the prediction step of the network. The following code(and the convinience functions above) have been written to handle a batch of data instead of a single sample in the equations.

def forward_pass(x_b,y_b,PARAMS):
  ACTS = {}

  ACTS['a0'] = x_b
  ACTS['z1'] = bias_add(
  ACTS['a1'] = relu(ACTS['z1'])

  ACTS['z2'] = bias_add(
  ACTS['a2'] = relu(ACTS['z2'])
  ACTS['z3'] = bias_add(
  ACTS['s']  = softmax(ACTS['z3'])
  ACTS['y']  = y_b
  ACTS['L']  = log_loss(

  return ACTS

Learning the parameters and Back propagation algorithm

The loss function tells us how good or bad are the predictions for each data samples. To improve the networks predictions, or to make it learn, we have to adjust its parameters so that the loss is minimises.

If we get the derivatives of each of the parameters with respect to the loss, we can tune the parameters using gradient descent algorithm to minimise the loss. \theta_{new} = \theta - \alpha \frac{\partial L}{\partial \theta} But the gradients are not that straight forward to compute. Many parameters output depends on other parameters output. To find derivatives in such situations, we use the chain rule of derivatives. For example, if there is a chain of dependency like , then,

\frac{\partial u}{\partial v} = \frac{\partial u}{\partial w}\frac{\partial w}{\partial v}

The chain rule can be generalised as

\frac{\partial u}{\partial v} = \sum_{w\in parents(u)}\frac{\partial u}{\partial w}\frac{\partial w}{\partial v}

The dependency of parameters of the neural network is given by the networks architecture. For example, The last layers weight 's dependency on loss and its derivative is.

L &\leftarrow s \leftarrow z_3 \leftarrow W_3 \\ \frac{\partial L}{\partial W_3} &= \frac{\partial L}{\partial s} \frac{\partial s}{\partial z_3} \frac{\partial z_3}{\partial W_3}

Using the same method, we can figure out all derivatives.

L & \dashleftarrow a_i \leftarrow W_i & \frac{\partial L}{\partial W_i} &= \frac{\partial L}{\partial a_i} \frac{\partial a_i}{\partial W_i} \\ L & \dashleftarrow a_i \leftarrow b_i & \frac{\partial L}{\partial b_i} &= \frac{\partial L}{\partial a_i} \frac{\partial a_i}{\partial b_i}

One particular thing to note here is the following dependencies. L \dashleftarrow a_{i+1} \leftarrow a_i \dashleftarrow a_0 What this dependency implies is that computing will need computing first. It is as if, gradient of a layer depends on derivatives of all previous layers. The gradients seems like propagating from last layer to the first layer. This form of computing gradients is back-propagation algorithm.

Some closed forms of various derivatives terms are given below

\frac{\partial L}{\partial W_i} &= \frac{\partial L}{\partial z_i} a_{i-1}^T \\ \frac{\partial L}{\partial b_i} &= \frac{\partial L}{\partial z_i} \\ \frac{\partial L}{\partial z_3} &= s-y \\ \left[\frac{\partial \operatorname{relu}(z)}{\partial z}\right]_{pq} &= \begin{cases} 1 &\operatorname{relu}(z)_p > 0~and~p=q\\ 0 & otherwise \end{cases} See appendix 1 for derivation of the first derivative and see this post for derivation of the 3rd derivative.

Code for gradient computation

def relu_derv(z):
  t = z.copy()
  t[t>0] = 1.0
  t[t<=0]= 0.0
  return t

def backward_pass(ACTS,PARAMS):
  GRAD = {}
  B = ACTS['a0'].shape[0]
  dz3_da2 = PARAMS['W3']
  da2_dz2 = relu_derv(ACTS['z2'])
  dz2_dW2 = ACTS['a1']
  dz2_da1 = PARAMS['W2']
  da1_dz1 = relu_derv(ACTS['z1'])
  dz1_dW1 = ACTS['a0']
  dl_dz3  = (ACTS['s']-ACTS['y'])
  dl_da2  =,dz3_da2)
  dl_dz2  = np.multiply(dl_da2,da2_dz2)
  dl_da1  =,dz2_da1)
  dl_dz1  = np.multiply(dl_da1,da1_dz1)
  dl_dW3 = (1/B)*,ACTS['a2'])
  dl_db3 = (1/B)*dl_dz3.sum(axis=0)
  dl_dW2 = (1/B)*,ACTS['a1'])
  dl_db2 = (1/B)*dl_dz2.sum(axis=0)
  dl_dW1 = (1/B)*,ACTS['a0'])
  dl_db1 = (1/B)*dl_dz1.sum(axis=0)

  GRAD['dl_dW3'] = dl_dW3
  GRAD['dl_db3'] = dl_db3
  GRAD['dl_dW2'] = dl_dW2
  GRAD['dl_db2'] = dl_db2
  GRAD['dl_dW1'] = dl_dW1
  GRAD['dl_db1'] = dl_db1
  return GRAD

Training using Gradient descent

PARAMS = init_params()
for idx in range(10000):
  x_b,y_b     = next(get_batch)
  ACTS = forward_pass(x_b, y_b, PARAMS)
  GRAD        = backward_pass(ACTS, PARAMS)

  ## gradient updates  
  PARAMS['W3'] = PARAMS['W3'] - lr*GRAD['dl_dW3']
  PARAMS['b3'] = PARAMS['b3'] - lr*GRAD['dl_db3']
  PARAMS['W2'] = PARAMS['W2'] - lr*GRAD['dl_dW2']
  PARAMS['b2'] = PARAMS['b2'] - lr*GRAD['dl_db2']
  PARAMS['W1'] = PARAMS['W1'] - lr*GRAD['dl_dW1']
  PARAMS['b1'] = PARAMS['b1'] - lr*GRAD['dl_db1']

That is it. The following plot shows the decrease in loss when the network is trained on MNIST dataset.

Test accuraccy was 92.37

Complete Code

The complete code is here. Do give stars to the repo if you like what you see.

Appendix 1

Consider the following system where is a matrix, and are vectors and is some function such that is a scalar. z &= Wa \\ l &= f(z)

The quantity where is a 3 dimensional tensor whose value is given by

\frac{\partial z_j}{\partial W_{i,j}} = a_i

Every other coordinate of this tensor is .

But, owing to special structure of this 3 dimensional tensor, the quantity

\frac{\partial l}{\partial W} = \frac{\partial l}{\partial z}a^T More details about computing tensor-tensor drivatives can be found here.