I watched this really fascinating video on prime spirals recently and decided to see if I could write my own code to generate them. It's not terribly efficient but it works! Code is on github! :D
Friday, 20 November 2015
Prime Spirals in Python
I watched this really fascinating video on prime spirals recently and decided to see if I could write my own code to generate them. It's not terribly efficient but it works! Code is on github! :D
Machine Learning: Implementing Linear Regression and Gradient Descent in Python
It's been a while since I went through Andrew Ng's Machine Learning course on Coursera. The course is very good at introducing the basic concepts but it lacks the depth of knowledge that I need if I really want to understand how to implement and use some of the algorithms on my own. So I decided to work through a more difficult textbook on machine learning. One of the first things I decided to do was implement the machine learning algorithms from scratch using my language of choice: Python.
The most simple idea out there that can be called machine 'learning' is linear regression. I first learned about linear regression in high school math class. So when I saw that linear regression was the first topic in the machine learning course, I was a bit skeptical. I felt that linear regression was too simple to be called machine learning. But it turns out that linear regression is quite a powerful tool to make predictions from data. The term 'linear' in linear regression is a bit misleading because it makes you think about straight line graphs and gives you the impression that trying to fit straight lines to data can't really be that useful. The trick to levelling up the power of linear regression lies in how you choose your variables (or features)! So you can have features that are crazy functions of your initial variables.
To understand this think about how we can use a straight line to fit a set of data points that you think might be exponential in nature. Say yoususpect that output $y$ is related to the variable $x$ by $y = ke^{cx}$. You can rearrange this equation to $log(y) = cx + k_1$ by taking the natural logarithm of both sides of the equation. Now we can plot $log(y)$ vs $x$ on a graph and try to fit a straight line to it. Here the parameter that we 'learn' is $c$ and $k_1$. The variable $x$ is called a feature.
In a similar way you can add features that are functions of the existing features if you think that the relationship is not linear. In a sense we are choosing a 'basis' of functions which we can mix together in different amounts to (hopefully) accurately model the output $y$. The way a Fourier Series builds up a periodic sequence using sine waves of different frequencies is exactly the same as how linear regression works with non linear basis functions.
So I decided to make the system 'learn' the square wave in my first implementation of linear regression. First some math:
The output prediction $\hat{y}$ can is represented by the linear model below. Here 'n' is the number of feature vectors.
$$ \hat{y} = \sum_{i=1}^{n} \theta_i x_i$$
To use gradient descent to train this model we need two things: a cost function and a gradient function.
The cost function is a function that takes the values of theta, and the features and the data samples and calculates a cost based on the error between the prediction that the model makes and the actual value. The most commonly used cost function is the mean square error cost function.
In this equation, m is the total number of training samples that we have and $x^{(i)}$ represents a feature vector that is ith training example. The cost function sums over all the training examples available to give one real number that represents the cost of the current values of the parameters.
$$
J = \frac{1}{m} \sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})^2 \\
\frac{\partial J}{\partial \theta_j} = \frac{2}{m} \sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})x^{(i)}_j
$$
The gradient of a function is a vector that points in the direction in which the function changes the most. So if I want to find the place where the cost is minimum all I have to do is to start out at a point, find the gradient at that point and go in the opposite direction. I have to keep doing this until the gradient is zero (or very close to it). Think of it like a ball rolling down slopes until it comes to rest at a place where there is no slope. So in Python I have to make a loop and each time the loop is run, I update the parameters theta using the rule. Here alpha is the learning parameter.
$$ \theta_j := \theta_j - \alpha \frac{2}{m} \sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})x^{(i)}_j$$
So I got a large set of points that were in the shape of a square wave and I used sine wave basis functions as features. After running gradient descent for a thousand iterations I got coeffecients that were really close to the actual Fourier Series expansion of a square wave!
You can find my implementation on my github page.
So I decided to make the system 'learn' the square wave in my first implementation of linear regression. First some math:
The output prediction $\hat{y}$ can is represented by the linear model below. Here 'n' is the number of feature vectors.
$$ \hat{y} = \sum_{i=1}^{n} \theta_i x_i$$
To use gradient descent to train this model we need two things: a cost function and a gradient function.
The cost function is a function that takes the values of theta, and the features and the data samples and calculates a cost based on the error between the prediction that the model makes and the actual value. The most commonly used cost function is the mean square error cost function.
In this equation, m is the total number of training samples that we have and $x^{(i)}$ represents a feature vector that is ith training example. The cost function sums over all the training examples available to give one real number that represents the cost of the current values of the parameters.
$$
J = \frac{1}{m} \sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})^2 \\
\frac{\partial J}{\partial \theta_j} = \frac{2}{m} \sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})x^{(i)}_j
$$
The gradient of a function is a vector that points in the direction in which the function changes the most. So if I want to find the place where the cost is minimum all I have to do is to start out at a point, find the gradient at that point and go in the opposite direction. I have to keep doing this until the gradient is zero (or very close to it). Think of it like a ball rolling down slopes until it comes to rest at a place where there is no slope. So in Python I have to make a loop and each time the loop is run, I update the parameters theta using the rule. Here alpha is the learning parameter.
$$ \theta_j := \theta_j - \alpha \frac{2}{m} \sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})x^{(i)}_j$$
So I got a large set of points that were in the shape of a square wave and I used sine wave basis functions as features. After running gradient descent for a thousand iterations I got coeffecients that were really close to the actual Fourier Series expansion of a square wave!
The learned function superimposed over the original square wave |
How the error decreases over time. |
You can find my implementation on my github page.
Sunday, 15 November 2015
The One Thing that my College Gets Wrong About Exams
I'll be honest. Despite it's shortcomings, many of the courses that are offered actually do contain material that I find interesting. In fact, when you're learning something at the level of a college course, there are few subjects that aren't interesting. It's like Richard Feynman said; everything is interesting if you go deep enough into it. My biggest complaint about the system isn't the lack of depth in the courses or for that matter the professor's depth of knowledge (although I have had a fair share of instructors who had a less than optimal grasp of the course material themselves). My biggest complaint is the focus of the examinations that they conduct that are supposed to evaluate how well the students know the material. The exams - in addition to having a low correlation to the students' grasp of the material - also actively discourage them from gaining a true understanding of the course material.
Typically, the exams I've written (particularly exams of courses that are highly mathematical) test only the ability of the students to memorize two things: derivations and formulae. I'm not saying that people should be exempted from memorizing formulae. There are plenty of formulae that are so important that they must be memorized. The problem is when you have to memorize formulae that look like this:
$L_F = -10 log(\cos{\theta} \{\frac{1}{2} - \frac{1}{\pi} p(1-p^2)^{\frac{1}{2}} - \frac{1}{\pi}arcsin(p) - q[\frac{1}{\pi}y(1-y^2)^{\frac{1}{2}} + \frac{1}{\pi}arcsin(y) + \frac{1}{2}]\})$
$p = \frac{\cos{(\theta_c)}(1-\cos{(\theta)})}{\sin{(\theta_c)}\sin{(\theta)}}$
$q = \frac{\cos^3(\theta_c)}{(\cos^2(\theta_c) - \sin^2(\theta_c))^\frac{1}{2}}$
$y = \frac{\cos^2(\theta_c)(1 - \cos \theta) - \sin^2(\theta)}{\sin{(\theta_c)} \cos{(\theta_c)} \sin{(\theta)}}$
For anyone interested it's a formula that tells you the coupling loss between two fiber optic cables when they're misaligned by an angle theta. The book doesn't even attempt to derive this formula because the authors know that the derivation on its own doesn't really add to a person's understanding of fiber optics. It may be a nice exercise that tests the student's algebraic manipulation but that's pretty much it. In the real world, if there is a situation in which this formula needs to be used, it's pretty much guaranteed that the engineer will just look it up. It's completely pointless to memorize it.
Why is this focus bad? Firstly, it does not actually test understanding and leads to students not getting the score that they deserve. There have been countless instances where I knew how to solve a problem but didn't get any credit because I had to remember some stupidly long equation. Secondly, it forces examiners to limit their questions to really simple ones because they know that remembering the formula or it's contrived derivation is half the question. So the number of questions that test my understanding of the material is next to zero. Thirdly, it actively discourages students from trying to understand the material. Once people figure out that they can get scores by just memorization, they will start focusing on memorization. Sometimes, there is so much to memorize that there isn't enough time to try and understand even if I wanted to.
If I were the person who was in charge of the setting the examinations, I would do the sensible thing and give every student a booklet full of the formulae that need to be used for the questions. This allows the questions to be far more interesting and test how the student is able to use the information at his disposal to solve actual problems that practicing engineers face. The brain is a processor with a limited amount of cache! It's not a hard disk! Introducing this simple change of focus will go a long way towards courses that are actually useful. It will also give the professors the freedom to ask tougher questions!
Now I know that this approach actually works! Before college I did my Cambridge IGCSEs and A-Levels. Both these exams were very focused on understanding rather than recall. Exams either had all the required formulae included as part of the questions or (as was the case for Further Mathematics at A-Levels) a large booklet of formulae. This didn't affect the difficulty of the exam at all. It's no use knowing formulae if you don't know how to use it. This might sound a bit strange but I actually enjoyed writing my IGCSEs and my A-Levels. The exams were designed well enough that sometimes I came out of the exam hall with a new understanding of what I had learned. I still feel that I understood and learned more in class during my high school than I did in most of my college courses. Any understanding that I have about my course material is a result of my independent reading and efforts rather than the course instruction.
Now I'm not the first person to talk about this. People have been complaining about the education system for a very very long time. And many have even tried to change it by approaching the administration. Invariably though, I've seen that - at least in my college - the people in charge are very reluctant to bring about any of these changes. The biggest roadblock is probably the fact that a lot the of the professors in my college disagree with what I have said above. They believe that memorization is very important perhaps because that is how they were taught. I've seen that things are changing, but not nearly fast enough. I don't really think that the changes that I've suggested will be implemented any time soon. So why am I writing all this down? I really just really needed to vent. XP
Subscribe to:
Posts (Atom)