## Friday, 20 November 2015

### Machine Learning: Implementing Linear Regression and Gradient Descent in Python

It's been a while since I went through Andrew Ng's Machine Learning course on Coursera. The course is very good at introducing the basic concepts but it lacks the depth of knowledge that I need if I really want to understand how to implement and use some of the algorithms on my own. So I decided to work through a more difficult textbook on machine learning. One of the first things I decided to do was implement the machine learning algorithms from scratch using my language of choice: Python.

The most simple idea out there that can be called machine 'learning' is linear regression. I first learned about linear regression in high school math class. So when I saw that linear regression was the first topic in the machine learning course, I was a bit skeptical. I felt that linear regression was too simple to be called machine learning. But it turns out that linear regression is quite a powerful tool to make predictions from data. The term 'linear' in linear regression is a bit misleading because it makes you think about straight line graphs and gives you the impression that trying to fit straight lines to data can't really be that useful. The trick to levelling up the power of linear regression lies in how you choose your variables (or features)! So you can have features that are crazy functions of your initial variables.

To understand this think about how we can use a straight line to fit a set of data points that you think might be exponential in nature. Say yoususpect that output $y$ is related to the variable $x$ by $y = ke^{cx}$. You can rearrange this equation to $log(y) = cx + k_1$ by taking the natural logarithm of both sides of the equation. Now we can plot $log(y)$ vs $x$ on a graph and try to fit a straight line to it. Here the parameter that we 'learn' is $c$ and $k_1$. The variable $x$ is called a feature.

In a similar way you can add features that are functions of the existing features if you think that the relationship is not linear. In a sense we are choosing a 'basis' of functions which we can mix together in different amounts to (hopefully) accurately model the output $y$. The way a Fourier Series builds up a periodic sequence using sine waves of different frequencies is exactly the same as how linear regression works with non linear basis functions.

So I decided to make the system 'learn' the square wave in my first implementation of linear regression. First some math:

The output prediction $\hat{y}$ can is represented by the linear model below. Here 'n' is the number of feature vectors.

$$\hat{y} = \sum_{i=1}^{n} \theta_i x_i$$

To use gradient descent to train this model we need two things: a cost function and a gradient function.

The cost function is a function that takes the values of theta, and the features and the data samples and calculates a cost based on the error between the prediction that the model makes and the actual value. The most commonly used cost function is the mean square error cost function.

In this equation, m is the total number of training samples that we have and $x^{(i)}$ represents a feature vector that is ith training example. The cost function sums over all the training examples available to give one real number that represents the cost of the current values of the parameters.

$$J = \frac{1}{m} \sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})^2 \\ \frac{\partial J}{\partial \theta_j} = \frac{2}{m} \sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})x^{(i)}_j$$

The gradient of a function is a vector that points in the direction in which the function changes the most. So if I want to find the place where the cost is minimum all I have to do is to start out at a point, find the gradient at that point and go in the opposite direction. I have to keep doing this until the gradient is zero (or very close to it). Think of it like a ball rolling down slopes until it comes to rest at a place where there is no slope. So in Python I have to make a loop and each time the loop is run, I update the parameters theta using the rule. Here alpha is the learning parameter.
$$\theta_j := \theta_j - \alpha \frac{2}{m} \sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})x^{(i)}_j$$

So I got a large set of points that were in the shape of a square wave and I used sine wave basis functions as features. After running gradient descent for a thousand iterations I got coeffecients that were really close to the actual Fourier Series expansion of a square wave!
 The learned function superimposed over the original square wave

 How the error decreases over time.

You can find my implementation on my github page.