In my last post on machine learning I talked about how to implement simple linear regression in Python. This time I am going to implement logistic regression. This technique is used to classify input data into one of several classes.
First let's take a look at two class regression.
Let's have a set of input vectors $\{x_1, x_2, ... , x_N\}$ and a set of targets $\{t_1, t_2, t_3, ..., t_N\}$ as our training data. A logistic regression classifier defines a model $y(x,w) = h(w^Tx)$. The function $h(a)$ is a logistic sigmoid function or a "squashing" function that takes the output of the linear function $w^Tx$ and sqeezes it into a range of values from 0 to 1. Using the logistic sigmoid function allows us to interpret the output of the classifier as the probability that the input belongs to class 1 under certain assumptions.
So just like in linear regression we need to define cost function in terms of the parameters so that we have a measure of the performance of the model and a way to train the model. In the case of logistic regression there is a cost function  that is very effective and commonly used  called the
cross entropy cost function. This cost function is again selected based on a probabilistic interpretation of the classifier.
$$J = \sum_{n=1}^{N}t_n ln(y(x_n) + (1t_n)ln(1y(x_n))$$
The gradient of this cost function is surprisingly the same as the gradient of the square error cost function.
$$\frac{\partial J}{\partial W} = \sum_{n=1}^{N}(y(x_n)  t_n)x_n$$
The weights of the model can be adjusted using the gradient descent algorithm just like in linear regression. The classifier will generate a straight that will separate the two classes.
Multiclass logistic regression can be done using many logistic regression units if the goal is to perform multiple binary classifications. However, if you have a vector that needs to be assigned to one of K classes, the procedure is slightly different. Multiclass regression is performed using the model $y(x) = softmax(W^Tx + \vec{b})$ where x is the input vector, W is a matrix with K columns where each column is a parameter vector, $\vec{b}$ is a vector of biases and the softmax function is defined as follows.
$$softmax(a_j) = \frac{e^{a_j}}{\sum_{k=1}^{K}e^{a_k}}$$
Each of the targets to train the classifier will be an array of size K. If an input belongs to class k, the kth position of the array will be a one. The rest of the elements in the array will be zero. This is called 'onehot encoding'. It is easy to see from the definition of the softmax function that the output of the classifier will be an array of values that are between zero and one and that the sum of all the elements in the output array will be 1. This means that we can interpret the output array as a probability distribution that tells you the probabilities that the input vector belongs to each of the K classes.
Like in two class classification, the input data (with N vectors and dimension M) set can be arranged in a NxM array $X$ where each row is an input vector. The corresponding targets can be arranged in a matrix T that has N rows and K columns. If the input vector $x_n$ belongs to class K the element $T_{nk}$ will be one. The rest of the elements of the matrix will be zero.
The cost function is the same cross entropy function extended to work with the matrix T.
$$J = \sum_{n=1}^{N}\sum_{k=1}^{K}T_{nk}ln(y_k(x_n))$$
However, since we are using a softmax function instead of the logistic sigmoid function, the gradient of the cost function will be different. The derivation to obtain expression for the gradient gets very hairy but the expression for the gradient is surprisingly simple.
$$\frac{\partial J}{\partial W_{uv}} = \sum_{n=1}^{N}X^T_{vn}(t_{nu}  y_{nu})$$
Once we have this gradient, gradient descent can be performed on the parameters in the same way. I gathered all the vectorized implementations of the equations I've mentioned into a classifier class in Python for ease of usage.
To test the classifier I created a scatter of normally distributed points centred around different means on a 2D plane and applied the classifier. With a little bit of help from stackoverflow I was able to figure out how to color the regions of the plot according to the trained classifier.

This is a scatter plot of the unclassified data. The points were generated using multivariate normal distributions using different means. 

This is a plot of the error or the cost over time as the gradient descent algorithm runs. 

A plot of the decision regions learned by the classifier. If a region is shaded a particular color, that means that the classifier identifies all the points in that region as belonging to the same class. 