In the post about Linear Discriminant Analysis, we have seen how to implement a classifier using generative assumption. Let's built a linear classifier with discriminative assumption.

Softmax Classifier

Imagine we have a dataset where is a data point and indicates the class belongs to. For deriving LDA classifier, we had modeled the class conditional density as a gaussian and derived the posterior probabilities . Here, we will directly model the posterior with a linear function. Since the posterior directly models what class a data point belongs to, we don't have much to do after to get a classifier.

But modelling with only a linear projection has some problems. There is no easy way to restrict to always fall in nor assure that .

We want a projection of the data such that it forms a clear probability distribution over the classes. Since this is near impossible with only a linear function, we stack a non parametric transformation following the linear transformation.

Softmax is a vector valued function defined over a sequence as

\operatorname{softmax}(z)_k = \frac{\operatorname{exp}[z_k]}{\sum_j\operatorname{exp}[z_j]}

Softmax preserves the relative magnitude of its input i.e the larger input coordinate get the larger output value. Softmax also squashes the values to lie in range and makes their sum equal to .

\sum_k \frac{\operatorname{exp}[z_k]}{\sum_j\operatorname{exp}[z_j]} = \frac{\sum_k \operatorname{exp}[z_k]}{\sum_j\operatorname{exp}[z_j]} = 1

So our classifier is

P(y|x) = \operatorname{softmax}(w^Tx)

Derivative of the Softmax function

We will need the derivative of softmax function later on. So lets figure out what its is. We can begin by writing softmax in a concise form.

s_k = \frac{e_k}{\Sigma}

where and . With and , we can easily derive the derivative for softmax function as follows.

\text{ when $p \neq k$} \\ \frac{\partial s_k}{\partial x_p} &= e_k\left[ \frac{-1}{\Sigma^2} e_p\right] = -s_ks_p \\ \text{ when $p = k$} \\ \frac{\partial s_k}{\partial x_p} &= \frac{ e_k \Sigma- e_p e_k}{\Sigma^2} = s_k-s_ps_k \\ \text{in general} \\ \frac{\partial s_k}{\partial x_p} &= s_k(\delta_{kp} - s_p) is dirac delta function which is only when and otherwise.

Estimating Model Parameter using Likelyhood

Now that we have a complete model of the classifier, , all that is remaining is to estimate the model's parameters from the dataset. We can begin by using likelyhood of the model explaining the training data.

L(w) = \prod_x \prod_k P(k|x;w)^{y_k} = \prod_x \prod_k \operatorname{softmax}(w_k^Tx)^{y_k}

Likelyhood gives a measure of how much the model explains the data for a given parameter . To get the optimum value for the parameter, all we have to do find the value of which maximises the likelyhood.

The likelyhood function is a bit difficult to work with on its own. But we take negative of the log of likelyhood function[click me] is a strictly increasing function so that all products gets converted to sum and all exponentials gets converted to products. -\log L(w) = E(w) = -\sum_x \sum_k y_k \log s_k

All we have to do to get the best parameter is to minimise the negative log likelyhood (thereby maximising likelyhood)

w_{opt} = \operatorname{argmin}_w E(w)

Note: If we think of as a true probability of the data over all the classes and as our model's prediction of this distribution, then is cross entropy between the two distributions.

For computing , we simply have to find derivative of w.r.to and equate it to 0. But finding derivative over then entire is difficult and non-intuitive. So lets break it down and find derivatives over each the columns separately.

\nabla_{w_p} E(w) &= -\sum_x \sum_k y_k \frac{1}{s_k} \frac{\partial s_k}{\partial w_p} \\ &=-\sum_x \sum_k y_k \frac{1}{s_k} \frac{\partial s_k}{\partial z_p} \frac{\partial z_p}{\partial w_p} \\ &=-\sum_x \sum_k y_k \frac{1}{s_k} s_k(\delta_{kp} - s_p) x \\ &= -\sum_x \sum_k y_k (\delta_{kp} - s_p) x

can be expanded as . Only and . Thus the original term evaluates to . \nabla_{w_p} E(w) &= \sum_x (s_p - y_p) x \\ \nabla_{w} E(w) &= \sum_x (s - y) x

Now we have a problem here. Setting does not give any information about . However, what the gradient do tells is that in space of , which is the direction to move(change ) so that the change in is the largest. So if we move in the exact opposite direction (negative of gradient), we will get maximum reduction in . Since is measuring the difference between model's prediction and the true labels, decreasing would mean our model is getting better. Enter Gradient Descent.

Gradient Descent Algorithm

Gradient descent is exactly that; gradient-descent. If you want to minimise a function, keep moving in negative direction of its gradient (thereby descending).

Gradient descent is an optimisation algorithm for cases where there is no analytic or easy solution for the parameters, but gradients of the models can be computed at each point. The algorithm simply says, if is some loss function which measures how good the model is with parameter , then we can update as to make the model better. \theta_{new} = \theta - \alpha \nabla_\theta L

Now repeat the same with new and the model with keep getting better. There are some caveats however. The behaviour of gradient descent entirely depends on choice of step size and even then, convergence to global optimum is not guaranteed.

In practise however, gradient descent performs well. We do have some tricks to pick the (seemingly) best step size and some other ways to ensure model improves. In our case, the update step is simply,

w_{new} = w - \alpha \nabla_{w}E(w)

To compute the true gradient direction of the model, we need to evaluate the model over every possible data. This is impossible because (1) intractable amount of computation and (2) we don't have labels for all possible data (if we did, what is the point of building a classifier).

So, we are computing an approximate model gradient over the data that we have. But even this is a lot of work for getting an approximate gradient and that is not end of the story. We have to keep iterating. Since we are approximating, why not approximate further. Pick a sample at random, compute gradient over it and update the model. This is Stochastic Gradient Descent.

Stochastic Gradient Descent is not that stable. Gradient of each samples do not agree with each other and hence, the frequent updates simply results in random motion in the parameter space. To make more stable, we compute gradient over a small set of samples or a batch. This is batched stochastic gradient descent. Batched SGD updates the model more frequently than pure gradient descent and is more stable than vanilla SGD. Infact, batched SGD is so commonly used that now it has become the vanilla. SGD now refers to batched SGD.

This is how you implement SGD.


loop_index = 1
W = initialise_weights()
while True:
  x_b,y_b = next(get_batch)
  s_b     = get_model_predictions(x_b,W)
  g_b     = get_batch_gradient(x_b,y_b,s_b)
  W       = update_weights(W,g_b)

  if loop_index >= total_iterations: break
  loop_index = loop_index + 1

Implementing Softmax classifier and its gradients

Implementing the forward prediction of the classifier is pretty straight forward. First we have to do a matrix vector multiplication to implement and then point-wise exponentiate all the terms in .

However, we have to sum all of these exponentiated terms to get the denominator in the softmax step. Since exponentiation creates huge numbers if components of are greater than , this creates some numerical errors.

To get rid of the numerical errors we use the following trick.

\operatorname{softmax}(z)_k &= \frac{e^{z_k}}{\sum_i e^{z_i}} \\ &= \frac{e^{-M}e^{z_k}}{e^{-M}\sum_i e^{z_i}} \\ &= \frac{e^{z_k-M}}{\sum_i e^{z_i-M}} If we choose large enough such that all the terms in the powers are negative or , all the exponentiated terms will be small. So we set .

Following code sample shows how the model's prediction is implemented. The code has been vectorised so that it can predict for a batch of at once.


def get_predictions(x,W):
  z = np.matmul(x,W.T)
  M = np.max(z,axis=-1,keepdims=True)
  e = np.exp(z-M)  # normalisation trick so that largest of z is 0
  sigma = e.sum(axis=-1, keepdims=True)
  s = e/sigma
  return s


Unlike forward pass, implementing gradient is very simple. Its only a outer product between two vectors and . But, when we implement it for a batch of samples and its predictions, the outer product can be implemented as a matrix multiplication. See the code sample below.


def get_batch_gradient(x_b,y_b,s_b):
  g_b = np.matmul((s_b-y_b).T,x_b)
  return g_b/batch_size

See the full implementation in Sec:Code

Model Performance in Cifar 10

Cifar 10 is a image dataset having 10 image classes. In this section, we test our simple softmax classifier's performance on it.

To tune the model to get optimum performance, we first need to find the best hyper-parameters (batch size and learning rate). For this, we first split the training set into two. A smaller set for validation and the rest to be used solely for training. The model is trained only on this new training set and validation set will be like our proxy test set while we find the best hyper parameter.

So we train the model for relative shorter duration (10000 gradient updates) and see its performance on validation set for different choices of hyper parameters. The following table list model performances for different hyper parameter combinations.

Batch size and learning rate gave the best performance. So we will train the model for longer duration(1000000 gradient updates) with these parameters. The following plot shows validation loss during training progress.

We get a final test accuracy of 38.05%.

Conclusion

LDA model gave us 37.85% accuracy on Cifar 10 dataset. The softmax classifier is giving us 38% accuracy. It appears to be a close tie between both the models, but one important distinction is that LDA distinctly modeled the data as gaussian while we made no such assumption while designing the softmax classifier.

Our simple linear classifier appear useless when compared to bigger and complex models(CNNs) that achieves near perfect accuracy on cifar 10. But there is some values in learning these simple ones first. They do teach some very valuable lessons about data modelling. They are also very good to test implementing optimizer algorithms like SGD we have implemented for this post. Do test them out on some other problems.

Code

The code is here.