Recently, I have embarked on the journey of learning machine learning. The CS229 module was recommended indirectly on the Lex Fridman podcast with Andrej Karpathy (Tesla Vision/OpenAI). Karpathy taught the module CS231n (which is a renowned course on vision/deep learning) and the CS229 module was a pre-requisite to the CS231n module, and so I begin my journey on picking up the necessary skills to better understand deep learning.
I have minified my notes of CS229 Lecture’s 1 and 2 below, for full context and detail see the original notes here. I may also add useful context from non-course material in my explanations.
Table of contents
Open Table of contents
CS229 course overview
The course topics can be summarised as follows:
- Supervised Learning
- Deep Learning
- Generalisation and Regularisation
- Unsupervised Learning
- Reinforcement Learning and Control
Supervised Learning
Definition
Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. - paraphrased from Wikipedia
Example
Given the data below, can we predict an unseen house’s price?
Square footage | #bedrooms | Price (£1000s) |
---|---|---|
300 | 4 | 450 |
200 | 2 | 300 |
100 | 1 | 120 |
Here, we can introduce some terminology:
-
, the input variables or features
-
, the output or target variable
-
a training example
-
a list of training examples = a training set
Our goal, is given a training set, to learn a function , so that is a “good” predictor for the corresponding value.
- When the target variable, , is continuous (e.g. in the housing example), this is called a regression problem
- When can take on a small number of discrete values (e.g. “is it a house or an apartment”) we call this a classification problem
Linear regression
Linear regression is one type of supervised machine-learning algorithm. It works by first approximating as a linear function of . Our approximation or hypothesis of can be defined by:
’s are the parameters, also called weights. We are aiming to learn these values through the algorithms discussed here on.
If we set , the above equation can be simplified to:
, where and are vectors (so the last part is the dot product.) is the number of input variables (not counting ).
Given a training set, how can we learn the parameters ? One method seems to be to make as close to , at least for the training examples we have available.
Cost function
To be able get as close to as possible we need a way to measure the closeness, such as through least means squared (LMS):
We will show later on why LMS is thought to be the best method.
Gradient descent
We now want to choose to minimise . The algorithm for doing this will involve: choosing an initial guess of , repeatedly change to make smaller, until hopefully we converge to a value of that minimises .
This can be shown in the gradient descent algorithm:
- means “update to”
- is the learning rate
The normal equations
Alternative to gradient descent. Closed form version (able to work out in one step)
Locally Weighted Regression (LWR)
A straight line isn’t always the line of best fit. We can use a quadratic function.
Do we want x^2 or sqrt(x) or log(x), in a later lecture we will discuss feature selection.
In ML we distinguish between parametric and non parametric learning algorithms.
Parametric learning algorithm: fit fixed set of parameters to data Non-parametric learning algorithm: Amount of data/parameters you need to keep grows (linearly) with size of data
Locally weighted is different to linear regression in that you look at the data around the point you want to make a prediction, and then you make a straight line / prediction around these close by points.
Modified cost function
Fit to minimise We have simply added . Where is a “weighting” function.
- This means if is small,
- This means if is large,
Explained in plain English: if an example x^i is far from where you want to make a prediction, multiply by 0 or something close to 0. if it’s close, then multiply by 1. So the net effect it sums over only the values of x close to where you want to make a prediction.
The w function plotted is shaped similarly to a gaussian density (normal distribution), bell curve, where the bell curve is the weight, w, assigned to the training data point, and so nearby ones are weighted higher up the bell, and outer ones are near 0.
How wide should you choose the gaussian density? We call it bandwith parameter, . The choice of this parameter has an effect on over or underfitting. Good to try this practically with code, try a varying /.
Andrew recommends using LWR when there is a relatively low dimensional data (low number of features, n is quite small) and you have a lot data and you don’t want to think about what features to use.
Probabilistic interpretation of Linear Regression
Why least squares? Why not or something else?
Assume a house’s price is a linear function of the living space and it’s number of bedrooms plus the error term, , which captures unmodelled effects, random noise (such as the seller is an unusually good or bad mood or it’s near a school district, which makes the price go up or down)
We assume the error is Gaussian distributed and IID (indepedently, and identicaly distributed from each other.)
If we perform maximum likelihood estimation (see lecture notes for full working out) it turns out to be the exact same equation as the cost function , so this is one method of proof for the least squares cost function finding the correct values of .
Mathematically, maximizing the likelihood function helps us obtain parameter estimates that are consistent with the observed data, making it a key concept in statistical inference and estimation.
The reason we have looked at this is it’s giving us a framework to our first classification problem:
- Make assumptions of P(Y | X; 0), Gaussian errors and IID
- Figure out maximum likelihood estimation (in this case we derived an algorithm which turned out to be the least squares algorithm)