Overview

In this chapter, we will focus on supervised learning, where given a dataset with labels that are known in advance, the task is to predict those labels on a holdout dataset. There are two common types of supervised learning algorithms: linear regression and classification. Linear regression models deal with problems where a continuous value is expected as the output (e.g., predicting house price). While classification models deal with problems where a discrete value is provided as the output (e.g., classifying emails into spam or non-spam). There are many other ways to distinguish both problems but I do believe this is the simplest way to differentiate them. If at this moment it is not clear what differentiaties both these problems, you will get a feel for it in the next couple paragraphs, so don't panic just yet!

One of the algorithms used for training supervised learning models is called Linear regression, in particular, univariate linear regression. Wait a minute! We went from supervised learning, to linear regression, and suddenly to univariate linear regression. What's next? Multivariate linear regression? Exactly! You see there are families of algorithms and the deeper we dive, newer algorithms will be introduced.

The goal of linear regression is to find the relationship between predictor variables (explanatory variables) and continuous response variable (outcomes). Once this relationship is found, then it is possible to predict an outcome given some new input. In the following sections, we will discuss how to represent a linear regression problem and some of the underlying operations needed to build an effective linear regression model.

Diving deep

To give you a basic intuition on the concept of linear regression, we will take a look at a very common and simple linear regression problem: predicting housing prices. In linear regression, the idea is that some independent variables $X$, usually called the explanatory variables, and a response variable $y$, typically referred to as the outcome (a numerical value), are used to fit a line through a series of data points. Notice that where there is more than one explanatory variable this problem is referred to as a multivariate linear regression, as clarified above. We will cover more about multiple variables in the next chapter, but for now let's just focus on the univartiate (one variable) linear regression problem.

So let us say we have a scatter plot as shown below. By the way, I used a few lines of code to generate the image, but don't worry about coding for right now -- we will get to that later.

In [2]:
% source code to generate plot
source ('helpers/chapter_1/displaySampleTable.m')

Intuitively, in order to follow the trend of the points of the diagram above, we would draw a diagonal line as I did with my bare hands below. As you can see I am not a great linear regression solver, and that's why we need computer algorithms to produce the most optimized line that represents the trend of the data -- that's where the linear regression algorithms comes in.

This is how I solved the linear regression problem, by drawing a straight red diagonal line across the data points:

alt txt

The line drawn across the data points is also known as the best-fitting line, which simply represents the overall minimized distance between the sample points and the fitted line, i.e., the line that optimizes a given error function (more on this in the upcoming sections). Think of this line as us trying to draw the line across while keeping it as close as possible to all the data points in the plot.

In mathematics, the fitted line has a slope and an intercept, which represent the parameter values of a linear function. We will discuss more about the parameter values later in this chapter but for now it is only important to understand what the purpose of linear regression is -- fitting the best line through the data points that contains the least amount of errors.

For now we won't discuss the specifics of the linear regression algorithm, but you can observe that the whole point of it is to draw a line through the sample points, where they lie as closest as possible to the red line ('best-fitting line'). There are various methods used to compute and generate the best-fitting line, and we will discuss some of them in the upcoming sections.

In this particular figure, $x$ represents the total square footage of houses and $y$ corresponds to their sales price. The fitted line can help us to make an estimation/prediction of how much a house is worth based on it's total square footage. For instance, let us say that we wanted to predict the price of a house which has a total square footage of 2600. According to the best-fitting line, the prediced price will be around $175000. The figure below shows a hand-drawn example of how this prediction was obtained.

alt txt

In machine learning, the best fitting line is formally called the hypothesis -- more of this in the upcoming sections. For now just keep in mind that this is what linear regression is really about, and that this is the kind of problem we will be solving in this chapter.


Model Representation

To better understand machine learning algorithms such as linear regression, it is important to represent them using mathematical notations. Since linear regression is one of the simplest machine learning algorithms to understand, it is reasonable to start representing algorithms by their mathematical notations here. Mathematical notation is important because it can be used to standardize model representations, which are useful for writing machine learning research papers, and sharing various machine-learning ideas [TODO: add more use cases here].

Before introducing the notations, take a quick look at the table provided below. The table represents a sample of a housing dataset (10 records). Together with the iris dataset [TODO: provide link], housing datasets are commonly used to better demonstrate the inner works of machine learning algorithms. In addition, the housing dataset has some interesting and desirable properties, which we will review in the upcoming sections of this book.

Total Square Footage House Sales Price
2566 208500
2524 181500
2706 223500
2473 140000
3343 250000
2158 143000
3380 307000
3197 200000
2726 129900
2068 118000

If you haven't guessed yet, the table above contains the sample data points used to generate the plot in the previous section. No need to go back up, this is a good time to start getting a bit creative and use some programming to generate a scatter plot with very little code. To keep things minimal we have already stored the sample dataset into a matrix file. We will discuss how to generate these matrices later in the book. Fow now we will just use this generated file to plot our data.

In [3]:
%% Displays scatter plot for housing prices

% load stored matrix file
load('data/chapter_1/sampleTable');

X = [ones(10,1), sampleTable(:,1)];
y = sampleTable(:,2);

% plotting
plot(X(:,2), y, 'bx', 'markersize', 10, 'lineWidth', 5);
ylabel('Sales Price (y)', 'fontsize', 14);
xlabel('Total Square Footage (x)', 'fontsize', 14);

% changes the size of the plot
set(gca, 'linewidth', 4, 'fontsize', 16)

Using this table as our guide, we now introduce some general notations and terminologies. Please note that the notations used here may not correspond to those used in other machine learning literature. There is no particular reason why some authors may prefer to use some notations which are different from the rest. This completely depends on the author but in this book we will use standardized notations, so that you can easily transition to reading other machine-learning related literature without any difficulties.

  • $m$ represents the number of training examples; or the number of records/rows/instances/observations in the dataset. In the sampled dataset used to generate the table above, $m=10$
  • $n$ represents the number of fields or attributes in the dataset. In the sampled dataset, $n=1$
  • $X$ is a matrix that contains instances for all input variables; In the sampled dataset, $X$ refers to the column vector labeled as Total Square Footage. Note that if more than one feature existed, you would instead refer to it as the feature matrix $X$. We will discuss this further into the book.
  • $y$ is a column vector that contains the class labels; it is also referred to as the vector of target variables. In the sampled dataset above, $y$ refers to the column vector labeled as House Sales Price.

With these list of notations, we can say that $(x^{(i)},y^{(i)})$ refers to the $i$-th training example. Very helpful right? If we want to locate the $5$-th training example, we then say $(x^{(5)},y^{(5)}), i = 5$. The importance of indexing becomes more clear in the upcoming sections where data exploration is covered.

Total BSMT SF GR Live Area Overall Quality Overall Condition Sale Price
856 1710 7 5 208500
1262 1262 6 8 181500
920 1786 7 5 223500
756 1717 7 5 140000
1145 2198 8 5 250000
796 1362 5 5 143000
1686 1694 8 5 307000
1107 2090 7 6 200000
952 1774 7 5 129900
991 1077 5 6 118000

Since we are dealing with vectors and matrices, basic linear algebra is used to help represent the dataset. For instance, if we were to convert the table above into a matrix of features and their instances, we would obtain the following matrix $X$:

$$ X = \left\lbrack \begin{matrix} x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & x_4^{(1)} \cr x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & x_4^{(2)}\cr \vdots & \vdots & \vdots & \vdots \cr x_1^{(10)} & x_2^{(10)} & x_3^{(10)} & x_4^{(10)} \cr \end{matrix} \right\rbrack $$

As you can observe, the subscript of $x$ corresponds to the column number and the superscript corresponds to the row number.

The $y$ vector can be represented as follows:

$$ y = \left\lbrack \begin{matrix} y_1 \cr y_2 \cr \vdots \cr y_{10} \cr \end{matrix} \right\rbrack $$

Using some a little bit of code let us now look at some examples of operations that can be done on matrix $X$. But before that, we load the matrix $X$, which we had already stored in $.mat$ format. $.mat$ format is used to store matrices as text files for later use. You can see this video for a short tutorial on how to store $.mat$ files. [TODO: Video on storing matrices]

In [30]:
load ('data/chapter_1/matrixX');
In [31]:
matrixX % values for matrix X
matrixX =

    856   1710      7      5
   1262   1262      6      8
    920   1786      7      5
    756   1717      7      5
   1145   2198      8      5
    796   1362      5      5
   1686   1694      8      5
   1107   2090      7      6
    952   1774      7      5
    991   1077      5      6


Matrix Operations

In [32]:
% Operation: add two elements of a matrix
matrixX(2,2)
ans =  1262

The above command outputs the element $x_2^{(2)}$ in the matrix $X$.

In [33]:
% Operation: add elements from an entire row of a matrix
sum(matrixX(1,:))
ans =  2578

The above command outputs the sum of row $1$ in the matrix $X$. Notice that the $:$ symbol is used for extracting all elements of either a row or column. In this example it was used to extract elements for each column in row $1$.

In [34]:
% Operation: add elements from an entire column of a matrix
sum(matrixX(:,1))
ans =  10471

The above command ouputs the sum of column $1$ in the matrix $X$. Notice that unlike in the first example above, here we used the $:$ symbol at the front, which means that we are adding all the elements for each row in column $1$. At first it is confusing to know exactly where we need to put the $:$ symbol when we need to perform operations on an entire column and row. It gets even trickier when we are performing operations on more than 1 column or row. Let us look at some advanced examples below.

In [35]:
% Operation: add row elements in column 3 and column 4
[matrixX(:,3) + matrixX(:,4)]
ans =

   12
   14
   12
   12
   13
   10
   13
   13
   12
   11

In the example above, each value in column $3$ was added to the corresponding value in column $4$. In other words, here we performed what is sometimes called element-wise operation. More on element-wise operations in the upcoming sections.

We will look at more advanced operations later in this chapter. [TODO: Do advanced video on this as well]


Data Exploration

Whenever we are dealing with huge datasets that contain hundreds or thousands of rows it becomes a bit challenging to visualize patterns in our head. Fortunately, in data science and machine learning, there are thousands of tools that can be used to help summarize and visualize the data. We will not focus on the different types of visualization techniques in this book, but we will utlilize a few to help you better understand the machine learning concepts and the datasets we are dealing with.

Using the original dataset that we used to generate the table above, we will begin to experiment with some Octave code to do some data exploration (If you are not familiar with Octave, don't worry, the code is very simple to understand; you will get the hang of it in no time). Beside you can refer to the Python version of this book [TODO: PUT PYTHON REFERENCE HERE]

The first thing we need to do is read the full housing dataset, which can be downloaded here:

In [36]:
% Read the housing dataset
data = csvread('data/chapter_1/train.csv'); # read csv file

After reading the dataset, the next step is to explore it. We need to see what information it contains, such as the data types. This should be done before conducting any transormations or aggregations on the dataset. This is where the indexing becomes useful.

If you are not familiar with Octave commands for exploring datasets, you may find it useful to head over to this video for a crash course in data exploration using Octave. [TODO video introducing the exploration of datasets in Octave].

In the first task, we want to extract all instances for column $39$ and $47$ and create a matrix $X$ from both vectors. Using what we previously learned, this should be fairly easy to accomplish. You may want to single out each column and then create a matrix by combining both vectors. This approach works fine and may take a couple lines of code, but fortunately for us, Octave allows this to be done in one shot:

In [37]:
X = data(2:11,[39,47])
X =

    856   1710
   1262   1262
    920   1786
    756   1717
   1145   2198
    796   1362
   1686   1694
   1107   2090
    952   1774
    991   1077

This matrix represents the first (Total BSMT SF) and second column (GR Live Area) in the table provided in the previous section. If we add the values from the first column with the second column, and combine the results into one column, we will obtain the values for the Total Square Footage of the houses.

In [38]:
total_square_footage = [X(:,1) + X(:,2)]
total_square_footage =

   2566
   2524
   2706
   2473
   3343
   2158
   3380
   3197
   2726
   2068

If you look closely at the result of this vector, it corresponds to the values in column $1$ of the sample table provided at the beginning of this section.

Good news, we are almost ready to train our first machine learning algorithm. First, let us arrange things a bit, so we can clearly establish $X$ (predictor variables), and $y$ (target variable). To train a model, these are the first two pieces of information you need. Let us do that belows:

In [39]:
X = total_square_footage
X =

   2566
   2524
   2706
   2473
   3343
   2158
   3380
   3197
   2726
   2068

Note that in this case, we had already come up with our predictor variable, which was stored in the variable total_square_footage. So all we need to do is just reasigned its value to $X$.

Now we want to obtain the target variable $y$. For convenience, the class labels are usually stored in the last column of the dataset. Thus, in order to obtain the values for the target variable $y$, we do the following:

In [40]:
y = data(2:11, end)
y =

   208500
   181500
   223500
   140000
   250000
   143000
   307000
   200000
   129900
   118000

Well what do you know? We are almost ready to build and train our first linear regression model. But before that, let us take a close look at our dataset. There are many questions that will arise upon visulazing the data, such as: what machine learning algorithm is right for the problem? These type of questions can be answered upon a deeper understanding of the structure of the data.


Plotting

There are many visualization tools to choose from, but since our dataset only has two fields, we can just plot our data on an $x$ and $y$ plane. Let us see what our sample dataset looks like when plotted.

In [41]:
source ('helpers/chapter_1/displaySampleTable.m')

For brevity, we have avoided some code here and instead created an additional script that we called with the source command. This is a neat function to keep things tidy. If you look in the directory provided as a parameter in the source function call, you will find the code that generates this plot. Feel free to play around with the code and change the settings for your plot. For now, it is just sufficient to understand what the plot represents. Later in the upcoming sections we will give some additional guidance on how to generate these type of visualizations.

As you may have immediately observed, the $X$ values (representing Total Square Footage) are on the $x$-axis and the $y$ values (representing Sales Price) are on the $y$-axis -- this is a reasonable choice for now, and we will see why later.

As we discussed earlier in this chapter, in linear regression, the idea is that some variable $x$, usually called the predictor variable, and a response variable $y$, usually referred to as the outcome, are used to fit a line through some data points. This line, also known as the 'best-fitting line', minimizes the distance between the sample points and the fitted line. The fitted line has a slope and an intercept, which will represent our parameter values.

If you don't know much about linear regression, at this point you may be wondering why this best-fitting line is important. To make this easier to understand, let us consider our housing data example. Let us say we want to build a model that predicts how much the sales price of a house will be based on its size. From what we know already, some houses are already have a fixed price based on their size. For example, the first instance $(x^{(1)},y^{(1)}), i = 1$, which corresponds to the first house that has a size of $2566$ and a sales price of $208500$. This information was easy to gather because we already know the sales prices for that specific house size. What if we wanted to know the price of a house based on the size, say $2500$ total square footage. As you can see, there is no house with that size in our training dataset. So how do we predict a price for it? What if we guess a price for the house based on the size and price of all the other houses. Surely there is a pattern, that can help us to obtain this information. In simple and informal terms, the problem we have just defined is what linear regression aims to solve. That best-fitting line is the pattern we are looking for, and it will help to make a sales price prediction for other sizes of house that are not in the training dataset.


Best Fitting Line

We already understand the basic intuition of linear regression, but how exactly do we obtain the best-fitting line for the plot above? Fortunately for us, there are several ways you can this. But before reviewing these methods, let's define some more notations for our problem

In this problem, we are aiming to obtain what is called a hypothesis, given the the predictor variable and the target variable. In our example, given the size of a house ($x$), the hypothesis $h_\theta(x)$ is used to make a prediction of the house sales price ($y$). In other words, $h_\theta(x)$ is a function that maps values from $x$ to $y$.

How do we represent the hypothesis, $h_\theta(x)$?

Well, we said that the hypothesis $h_\theta(x)$ is just a straight line, and if you are familiar with a bit of calculus, this line can be represented by an equation. In calculus, the equation of a line is $y=mx + b$. Therefore, the hypothesis is nothing more than the same equation with a slight transormation. Instead of saying that the equation of the hypothesis $h_\theta(x)$ is equal to $mx + b$, we simplify things by saying that $h_{\theta}(x)=\theta_0 + \theta_1x$. In such scenario, $m$ (slope) represents $\theta_1$ and $b$ (y-intercept) is nothing more than $\theta_0$; we keep $x$, which represents our predictor variable. For many machine learning experts, this is automatic knowledge, but if you are just starting, we don't expect you to fully grasp this concept on your first encounter with it. What we encourage at this point is to understand the motivation behind clear notations, terminologies, and definitions.

$h_{\theta}(x)=\theta_0 + \theta_1x$ is read as “$h$ subscript theta of $x$ is equal to theta subscript $0$ plus theta subscript $1$ times $x$”.

Let us look at a few examples where we already know the paramater values and we just plot the hypothesis on a 2D plane.

Example 1

In [55]:
close all;
X = [ones(6,1), (0:5)'];
theta = [1.5;0];
hypothesis = X * theta;

%% plotting
plot(X,hypothesis,'g');
set(gca, "linewidth", 4, "fontsize", 16) %% changes the size of the plot

legend('h_\theta(x) ', 'location','north')
xlim([0,5]);
ylim([0,5]);
xlabel('x');
ylabel('y');

Example 2

In [56]:
close all;
X = [ones(6,1), (0:5)'];
theta = [0;0.5];
hypothesis = X * theta;

%% plotting
plot(X(:,2),hypothesis,'g')
set(gca, "linewidth", 4, "fontsize", 16) %% changes the size of the plot
legend('h_\theta(x) ', 'location','north')
xlim([0,5]);
ylim([0,5]);
xlabel('x');
ylabel('y');

Example 3

In [57]:
close all;
X = [ones(6,1), (0:5)'];
theta = [1;0.5];
hypothesis = X * theta;

%% plotting
plot(X(:,2),hypothesis,'g');
set(gca, "linewidth", 4, "fontsize", 16) %% changes the size of the plot
legend('h_\theta(x) ', 'location','north')
xlim([0,5]);
ylim([0,5]);
xlabel('x');
ylabel('y');

In the following sections we will show you three aproaches to computing a best fitting line or the hypothesis, $h_\theta(x)$.

Method 1 - Normal Equations

Normal Equations - Here we are interested in solving for the coefficients, $\theta_0$ and $\theta_1$. In other words, we are looking for the values of the parameters $\theta$. One way to compute these parameter values is by using the equation below:

$$\theta = (X^tX)^{-1}X^ty$$

The above equation will solve for the coefficient vector of size $2 \times 1$, which will contain the two parameter values we are looking for, ($\theta_0, \theta_1$). You can learn more about this here. Now let us see how that looks in code form.

In [45]:
close all;
X = data(2:11,[39,47]);
m = length(X);
X = [ones(m,1), total_square_footage];
theta = (X' * X) \ X' * y; %% or (pinv(X'*X))*X'*y
hypothesis = theta(1) + theta(2) * total_square_footage; % or X * theta

plot(X(:,2), hypothesis,'r-');
hold on;
plot(X(:,2), y,'bo');
legend('Best fitting line is y=1.0603e+02x-9.7638e+04', 'Data points', 'Location','northwest')
ylabel('Sales Price');
xlabel('Total Square Footage');
set(gca, "linewidth", 4, "fontsize", 14) %% changes the size of the plot
hold off;

Method 2 - LU Decomposition

In this approach we use similar concept but with slightly different notations. In fact, the following implementation is just an alternative approach for the first method.

Let us say that we want to solve the coefficients of a matrix $\theta$. This can be done using a set of matrix equations, such as $Ax = b$. In this case $x$ represents our parameter values $\theta$. Therefore, we can transform the equation to $A\theta = b$. In order to compute for $\theta$, we say $\theta = A\backslash b$.

$A\backslash b$ outputs the same result as $(X^tX)^{-1}X^ty$, but it is computed differently. In fact the latter version is more elegant than the former, and we will see why in the following implementation.

In [46]:
close all;
A = [sum(X(:,2).*X(:,2)), sum(X(:,2)); sum(X(:,2)), m];
b = [sum(X(:,2).*y); sum(y)];
theta = flip(A\b);
hypothesis = theta(1) + theta(2) * total_square_footage; % or X * theta

plot(X(:,2), hypothesis,'r-');
hold on;
plot(X(:,2), y,'bo');
legend('Best fitting line is y=1.0603e+02x-9.7638e+04', 'Data points', 'Location','northwest')
ylabel('Sales Price');
xlabel('Total Square Footage');
set(gca, "linewidth", 4, "fontsize", 14) %% changes the size of the plot
hold off;

Method 3

In the next section we look at yet another method for obtaining a best fitting line using an algorithm called gradient descent used for parameter learning. We will introduce the notion of a cost function to optimize a given linear regression problem. Additionally, we will cover gradient descent for minimizing the cost function to obtain optimized parameter values to get best fitting line through our data points.


Cost Function

Essentially, the cost function in linear regression aims to find parameter values, $\theta_0$ and $\theta_1$, so that the hypothesis $h_\theta(x)$ is close to the target variable $y$ for all the training examples $(x,y)$.

Another way to put is that we want to minimize the problem by summing the squared error difference of the hypothesis (predictions) and the actual values of the target variable $y$.

Concretely, we are solving for the following:

$$\min_{\rm \theta_0,\theta_1}\frac {1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 $$ where $h_\theta(x^{(i)}) = \theta_0 + \theta_1x^{(i)}$.

The equation written above is simply trying to find the values of $\theta_0$ and $\theta_1$, so that the average $\frac {1}{2m}$ times the sum of squared errors between the predictions on the training set minus the actual values of the houses of the training set is minimized. That is why it is called the cost function (squared error function/mean squared error).

More formally, the cost function can be rewritten as follows:

$$J(\theta_0, \theta_1) = \frac{1}{2m}\sum_{i=1}^m(\hat y_i - y_i)^2 =\frac {1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 $$

,

where $h_\theta(x^{(i)}) = \theta_0 + \theta_1x^{(i)}$.

The objective function then becomes:

$$\min_{\rm \theta_0,\theta_1}J(\theta_0, \theta_1)$$

Ultimately, we are looking for the parameters $\theta_0, \theta_1$ which minimizes the function $J(\theta_0, \theta_1)$.

Note that in linear regression there are many other alternative cost functions. However, this one is good enough for the majority of linear regression problems.

Now that we have clearly defined the cost function, $J$, we need to understand what it does and why we want to use it.

Simplifying Cost Function

In order to better understand the cost function $J$, let use a simplified version of the previous cost function. This simplification will also help with visualizing and provide examples to the inner workings of the cost function. Remember we are only simplifying the cost function for demonstration purposes.

Let us assume that the hypthesis $h_\theta(x)$ only considers one of its parameters, $\theta_1$; therefore, $\theta_0$ is ignored. In other words, our new hypothesis becomes the following:

$$h_\theta(x) = \theta_1x$$

This also means that the cost function has changed as well; this is the simplified cost function:

$$J(\theta_1) =\frac {1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 $$

where $h_\theta(x^{(i)}) = \theta_1x^{(i)}$.

The simplified objective function then becomes:

$$\min_{\rm \theta_1}J(\theta_1)$$

To demonstrate in a very simple manner what the cost function will do, let us look at a very simple example, and then we will apply what we have learned to the housing dataset.

Let's define some dummy data.

In [47]:
X = [0;1;2;3;4;5];
y = [0;1;2;3;4;5];
theta = [1];
hypothesis = X * theta
hypothesis =

   0
   1
   2
   3
   4
   5

Now let's plot it

In [48]:
close all;

%% plotting
plot(X,hypothesis,'r')
hold on;
legend('h_\theta(x) ', 'location','north')
xlim([0,5]);
ylim([0,5]);
xlabel('x');
ylabel('y');
plot(X,y,'bx');
hold off;

As you can see, the $h_\theta(x)$ is just a straight diagonal line that goes through the $0$ origin point. This behavior is mainly because in the example $\theta_1 = 1$.

The next thing we want to do now is compute the cost function for this dummy dataset provided. Since the simplified cost function $J(\theta_1)$ is a function of the parameter $\theta_1$, all we do is plugin the values obtained for the hyptothesis $h_\theta(x)$.

$$J(\theta_1) =\frac {1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 $$$$J(1) =\frac {1}{2m}\sum_{i=1}^m(\theta_1x^{(i)} - y^{(i)})^2 $$$$=\frac {1}{2(6)}[(0-0)^2+(1-1)^2+(2-2)^2+(3-3)^2+(4-4)^2+(5-5)^2] $$$$=\frac {1}{12}(0^2 + 0^2 + 0^2+0^2+0^2+0^2) = 0$$

Keep in mind that $m=6$, therefore, when $\theta_1 = 1$, $J(\theta_1) =J(1)= 0$. Actually, in code form, $(h_\theta(x^{(i)}) - y^{(i)})$ can simply be calculated as follows:

In [49]:
hypothesis - y
ans =

   0
   0
   0
   0
   0
   0

We will show you how to code the entire cost function later in this chapter but for now let us look at some more examples. This time let us use a different value for $\theta_1$, say $\theta_1 = 0.5$. Therefore, now we need to compute for $J(\theta_1)$, $\theta_1=0.5$.

$$J(0.5) =\frac {1}{2(6)}[(0-0)^2+(0.5-1)^2+(1-2)^2+(1.5-3)^2+(2-4)^2+(2.5-5)^2]$$$$=\frac {1}{12}(0 + 0.25 + 1 + 2.25 + 4 + 6.25) = 13.75/12 \equiv 1.15$$

When $m=6$ and $\theta_1 = 0.5$, $J(\theta_1) = J(0.5)\equiv 1.15$. Let us calculate the new hypothesis using the same dummy dataset and the new value of $\theta_1$.

In [50]:
theta = [0.5];
hypothesis = X * theta
hypothesis =

   0.00000
   0.50000
   1.00000
   1.50000
   2.00000
   2.50000

Let's plot...

In [51]:
close all;

%% plotting
plot(X,hypothesis,'r')
hold on;
legend('h_\theta(x) ', 'location','north')
xlim([0,5]);
ylim([0,5]);
xlabel('x');
ylabel('y');
plot(X,y,'bx');
hold off;

Unlike in the first plot, where $\theta_1 = 1$, the hyptothesis line for $\theta_1=0.5$, doesn't go straight through the points. This intuitively says that our cost is higher, which implies that our cost $J$ will produce higher margin of error. So is the case here, since $J(0.5) = 1.15$..

Let us look at one more example, say $\theta_1 = 1.5$. If you the calculation right it turns out that when $\theta_1=1.5$, $J(1.5) \equiv 1.15$. And the plot looks like this:

In [52]:
theta = [1.5];
hypothesis = X * theta;

close all;

%% plotting
plot(X,hypothesis,'r')
hold on;
legend('h_\theta(x) ', 'location','north')
xlim([0,5]);
ylim([0,5]);
xlabel('x');
ylabel('y');
plot(X,y,'bx');
hold off;

At this point in time you may be wondering what can we do with the $J$ values we are obtaining each time we are changing the value of $\theta_1$. If we put the values we computed for $J$ and plot it against the parameter values we used $\theta_1$, here is what it looks like:

In [58]:
thetas = [1;0.5;1.5];
j_values = [0;1.15;1.15];

close all;

%% plotting
plot(thetas,j_values,'bx');
set(gca, "linewidth", 4, "fontsize", 16) %% changes the size of the plot
hold on;
xlim([0,3]);
ylim([0,3]);
xlabel('\theta_1');
ylabel('J(\theta_1)');
hold off;

If we compute more values for $J(\theta_1)$ by using different values for $\theta_1$, you will eventually be able to clearly see the shape of a parabola forming when plotting the points as we did above -- at least for the example we provided. As you can observe, the $J(\theta_1)$ produces its minimum value when $\theta_1 = 1$, making this the desired value for the objective function. In fact, $\theta_1$ is our global minimum.

Let us do something fun. Let us put two plots, side by side, one where the hypothesis is drawn, and the other where the corresponding cost function values are plotted. It will give you a better look at what we have done so far. We encourage you to always practise some of these plottings since they are a great way to understand your machine learning models better.

Assuming we already have all the information we need to plot our side by side plots --acutally we did it above--, we can just go ahead and hard code them as follows:

In [62]:
close all;
%% hypothesis values for each correspoing theta_1 used
hypothesis_1 = [0;1;2;3;4;5]; %% hypothesis for theta_1 = 1
hypothesis_2 = [0;0.5;1;1.5;2;2.5]; %% hypothesis for theta_1 = 0.5
hypothesis_3 = [0;1.5;3;4.5;6;7.5]; %% hypothesis for theta_1 = 1.5
thetas = [1;0.5;1.5];
j_values = [0;1.15;1.15];

%% plotting
subplot (1,2,1) %% left plot
plot(X,y,'bx');
hold on;
plot(X, hypothesis_1, 'r');
plot(X, hypothesis_2, 'g');
plot(X, hypothesis_3, 'm');
xlim([0,5]);
ylim([0,5]);
xlabel('x');
ylabel('y');
hold off;

subplot (1,2,2) %% right plot
plot(thetas(1),j_values(1), 'rx');
hold on;
plot(thetas(2),j_values(2), 'gx');
plot(thetas(3),j_values(3), 'mx');

xlim([0,5]);
ylim([0,5]);
xlabel('\theta_1');
ylabel('J(\theta_1)');

With the plot above, now you can see all the hard work you have done so far. Each hypothesis produces a different $J$ value, and it can be seen clearly from the plot since the each case has been color-coded.

We are beginning to see how important it is to plot our results because it gives us a basic idea of how well our machine learning model is functions. In later chapters, we will discuss some best practises and other different type of visualizations you can utilize to make sure your code has no bugs.

Cost function Summary

$$ HYPOTHESIS: h_{\theta}(x) = \theta_0 + \theta_1(x) $$$$ PARAMETERS: \theta_0, \theta_1 $$$$ COST FUNCTION: J(\theta_0, \theta_1) =\frac {1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 $$$$ GOAL: \min_{\rm \theta_0, \theta_1}J(\theta_0, \theta_1) $$

Gradient descent

Gradient descent is an algorithm for minimizing or reducing a cost function $J(\theta_0,\theta_1)$. This is done by changing ${\theta_0,\theta_1}$ until we end up at a minimum.

Gradient descent algorithm:

repeat until convergence {

$\theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} J (\theta_0,\theta_1)$

for $j=0$ and $j=1$ }

where $\alpha$ is the learning rate (how big each step to find global minimum).

If $\alpha$ is too small, gradient descent can be very slow. If $\alpha$ is too large, gradient descent can overshoot the minimum. In other words, it may fail to converge, or even diverge. It is important to note that the gradient descent algorithm can coverge to local minimum even when the learning rate is fixed. It means that as we approach local minimum, the algorithm will automatically take smaller steps, therefore, there is no need to decrease $\alpha$ over time.

Gradient Descent for Linear Regression

Here we combine what we have learned so far. We have our hypothesis for linear regression $h_\theta(x) = \theta_0 + \theta_1x$, our cost function $$ J(\theta_0,\theta_1) =\frac {1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 $$

and our gradient descent algorithm for minimizing the cost function:

repeat until convergence {

$\theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} J (\theta_0, \theta_1)$

for $j = 0$ and $j = 1$

}

To break down the problem we simplify $\frac{\partial}{\partial \theta_j} J (\theta_0, \theta_1)$. It turns out that:

$$ \frac{\partial}{\partial \theta_j} J (\theta_0, \theta_1) = \frac{\partial}{\partial \theta_j}.\frac {1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 $$

further simplification gives:

$$ = \frac{\partial}{\partial \theta_j}.\frac {1}{2m}\sum_{i=1}^m(\theta_0 + \theta_1x^{(i)} - y^{(i)})^2 $$

Since we are computing the derivative for when $j=0$ and $j=1$, we compute for the following:

for $j=0$,

$$ \theta_0 = \frac{\partial}{\partial \theta_0}J (\theta_0, \theta_1) = \frac {1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)}) $$

and for $j=1$,

$$ \theta_0 = \frac{\partial}{\partial \theta_1}J (\theta_0, \theta_1) = \frac {1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)}).x^{(i)} $$

Now that we know how to update those parameters we define the complete gradient descent algorithm for linear regression:

repeat until convergence {

$\theta_0 = \theta_0 - \alpha \frac {1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})$

$\theta_1 = \theta_1 - \alpha \frac {1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)}).x^{(i)}$

}

Another thing to note with gradient descent for linear regression is that it will always converge to global minimum. This happens because there are two parameters involved in the cost function, which always results in a convex function.

Batch Gradient Descent is when using all the training examples at each step of the gradient descent algorithm.

Another alternative to gradient descent for parameter learning is called Normal Equations, which was introduced in the earlier part of this book.

Do A Recap of the Chapter here

TODO: Do a summarized version of the entire notebook, make sure to highlight the main points.

Additional Exercises

  1. Here we only used a sample of the housing dataset. For practise, try to run the different code segments using the complete dataset. Solution here as well.

References