# Overview¶

In this chapter, we will focus on supervised learning, where given a dataset with labels that are known in advance, the task is to predict those labels on a holdout dataset. There are two common types of supervised learning algorithms: linear regression and classification. Linear regression models deal with problems where a continuous value is expected as the output (e.g., predicting house price). While classification models deal with problems where a discrete value is provided as the output (e.g., classifying emails into spam or non-spam). There are many other ways to distinguish both problems but I do believe this is the simplest way to differentiate them. If at this moment it is not clear what differentiaties both these problems, you will get a feel for it in the next couple paragraphs, so don't panic just yet!

One of the algorithms used for training supervised learning models is called Linear regression, in particular, univariate linear regression. Wait a minute! We went from supervised learning, to linear regression, and suddenly to univariate linear regression. What's next? Multivariate linear regression? Exactly! You see there are families of algorithms and the deeper we dive, newer algorithms will be introduced.

The goal of linear regression is to find the relationship between predictor variables (explanatory variables) and continuous response variable (outcomes). Once this relationship is found, then it is possible to predict an outcome given some new input. In the following sections, we will discuss how to represent a linear regression problem and some of the underlying operations needed to build an effective linear regression model.

# Diving deep¶

To give you a basic intuition on the concept of linear regression, we will take a look at a very common and simple linear regression problem: predicting housing prices. In linear regression, the idea is that some independent variables $X$, usually called the explanatory variables, and a response variable $y$, typically referred to as the outcome (a numerical value), are used to fit a line through a series of data points. Notice that where there is more than one explanatory variable this problem is referred to as a multivariate linear regression, as clarified above. We will cover more about multiple variables in the next chapter, but for now let's just focus on the univartiate (one variable) linear regression problem.

So let us say we have a scatter plot as shown below. By the way, I used a few lines of code to generate the image, but don't worry about coding for right now -- we will get to that later.

In [2]:
% source code to generate plot
source ('helpers/chapter_1/displaySampleTable.m')


Intuitively, in order to follow the trend of the points of the diagram above, we would draw a diagonal line as I did with my bare hands below. As you can see I am not a great linear regression solver, and that's why we need computer algorithms to produce the most optimized line that represents the trend of the data -- that's where the linear regression algorithms comes in.

This is how I solved the linear regression problem, by drawing a straight red diagonal line across the data points:

The line drawn across the data points is also known as the best-fitting line, which simply represents the overall minimized distance between the sample points and the fitted line, i.e., the line that optimizes a given error function (more on this in the upcoming sections). Think of this line as us trying to draw the line across while keeping it as close as possible to all the data points in the plot.

In mathematics, the fitted line has a slope and an intercept, which represent the parameter values of a linear function. We will discuss more about the parameter values later in this chapter but for now it is only important to understand what the purpose of linear regression is -- fitting the best line through the data points that contains the least amount of errors.

For now we won't discuss the specifics of the linear regression algorithm, but you can observe that the whole point of it is to draw a line through the sample points, where they lie as closest as possible to the red line ('best-fitting line'). There are various methods used to compute and generate the best-fitting line, and we will discuss some of them in the upcoming sections.

In this particular figure, $x$ represents the total square footage of houses and $y$ corresponds to their sales price. The fitted line can help us to make an estimation/prediction of how much a house is worth based on it's total square footage. For instance, let us say that we wanted to predict the price of a house which has a total square footage of 2600. According to the best-fitting line, the prediced price will be around $175000. The figure below shows a hand-drawn example of how this prediction was obtained. In machine learning, the best fitting line is formally called the hypothesis -- more of this in the upcoming sections. For now just keep in mind that this is what linear regression is really about, and that this is the kind of problem we will be solving in this chapter. # Model Representation¶ To better understand machine learning algorithms such as linear regression, it is important to represent them using mathematical notations. Since linear regression is one of the simplest machine learning algorithms to understand, it is reasonable to start representing algorithms by their mathematical notations here. Mathematical notation is important because it can be used to standardize model representations, which are useful for writing machine learning research papers, and sharing various machine-learning ideas [TODO: add more use cases here]. Before introducing the notations, take a quick look at the table provided below. The table represents a sample of a housing dataset (10 records). Together with the iris dataset [TODO: provide link], housing datasets are commonly used to better demonstrate the inner works of machine learning algorithms. In addition, the housing dataset has some interesting and desirable properties, which we will review in the upcoming sections of this book. Total Square Footage House Sales Price 2566 208500 2524 181500 2706 223500 2473 140000 3343 250000 2158 143000 3380 307000 3197 200000 2726 129900 2068 118000 If you haven't guessed yet, the table above contains the sample data points used to generate the plot in the previous section. No need to go back up, this is a good time to start getting a bit creative and use some programming to generate a scatter plot with very little code. To keep things minimal we have already stored the sample dataset into a matrix file. We will discuss how to generate these matrices later in the book. Fow now we will just use this generated file to plot our data. In [3]: %% Displays scatter plot for housing prices % load stored matrix file load('data/chapter_1/sampleTable'); X = [ones(10,1), sampleTable(:,1)]; y = sampleTable(:,2); % plotting plot(X(:,2), y, 'bx', 'markersize', 10, 'lineWidth', 5); ylabel('Sales Price (y)', 'fontsize', 14); xlabel('Total Square Footage (x)', 'fontsize', 14); % changes the size of the plot set(gca, 'linewidth', 4, 'fontsize', 16)  Using this table as our guide, we now introduce some general notations and terminologies. Please note that the notations used here may not correspond to those used in other machine learning literature. There is no particular reason why some authors may prefer to use some notations which are different from the rest. This completely depends on the author but in this book we will use standardized notations, so that you can easily transition to reading other machine-learning related literature without any difficulties. •$m$represents the number of training examples; or the number of records/rows/instances/observations in the dataset. In the sampled dataset used to generate the table above,$m=10$•$n$represents the number of fields or attributes in the dataset. In the sampled dataset,$n=1$•$X$is a matrix that contains instances for all input variables; In the sampled dataset,$X$refers to the column vector labeled as Total Square Footage. Note that if more than one feature existed, you would instead refer to it as the feature matrix$X$. We will discuss this further into the book. •$y$is a column vector that contains the class labels; it is also referred to as the vector of target variables. In the sampled dataset above,$y$refers to the column vector labeled as House Sales Price. With these list of notations, we can say that$(x^{(i)},y^{(i)})$refers to the$i$-th training example. Very helpful right? If we want to locate the$5$-th training example, we then say$(x^{(5)},y^{(5)}), i = 5$. The importance of indexing becomes more clear in the upcoming sections where data exploration is covered. Total BSMT SF GR Live Area Overall Quality Overall Condition Sale Price 856 1710 7 5 208500 1262 1262 6 8 181500 920 1786 7 5 223500 756 1717 7 5 140000 1145 2198 8 5 250000 796 1362 5 5 143000 1686 1694 8 5 307000 1107 2090 7 6 200000 952 1774 7 5 129900 991 1077 5 6 118000 Since we are dealing with vectors and matrices, basic linear algebra is used to help represent the dataset. For instance, if we were to convert the table above into a matrix of features and their instances, we would obtain the following matrix$X$: $$X = \left\lbrack \begin{matrix} x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & x_4^{(1)} \cr x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & x_4^{(2)}\cr \vdots & \vdots & \vdots & \vdots \cr x_1^{(10)} & x_2^{(10)} & x_3^{(10)} & x_4^{(10)} \cr \end{matrix} \right\rbrack$$ As you can observe, the subscript of$x$corresponds to the column number and the superscript corresponds to the row number. The$y$vector can be represented as follows: $$y = \left\lbrack \begin{matrix} y_1 \cr y_2 \cr \vdots \cr y_{10} \cr \end{matrix} \right\rbrack$$ Using some a little bit of code let us now look at some examples of operations that can be done on matrix$X$. But before that, we load the matrix$X$, which we had already stored in$.mat$format.$.mat$format is used to store matrices as text files for later use. You can see this video for a short tutorial on how to store$.mat$files. [TODO: Video on storing matrices] In [30]: load ('data/chapter_1/matrixX');  In [31]: matrixX % values for matrix X  matrixX = 856 1710 7 5 1262 1262 6 8 920 1786 7 5 756 1717 7 5 1145 2198 8 5 796 1362 5 5 1686 1694 8 5 1107 2090 7 6 952 1774 7 5 991 1077 5 6  ### Matrix Operations¶ In [32]: % Operation: add two elements of a matrix matrixX(2,2)  ans = 1262  The above command outputs the element$x_2^{(2)}$in the matrix$X$. In [33]: % Operation: add elements from an entire row of a matrix sum(matrixX(1,:))  ans = 2578  The above command outputs the sum of row$1$in the matrix$X$. Notice that the$:$symbol is used for extracting all elements of either a row or column. In this example it was used to extract elements for each column in row$1$. In [34]: % Operation: add elements from an entire column of a matrix sum(matrixX(:,1))  ans = 10471  The above command ouputs the sum of column$1$in the matrix$X$. Notice that unlike in the first example above, here we used the$:$symbol at the front, which means that we are adding all the elements for each row in column$1$. At first it is confusing to know exactly where we need to put the$:$symbol when we need to perform operations on an entire column and row. It gets even trickier when we are performing operations on more than 1 column or row. Let us look at some advanced examples below. In [35]: % Operation: add row elements in column 3 and column 4 [matrixX(:,3) + matrixX(:,4)]  ans = 12 14 12 12 13 10 13 13 12 11  In the example above, each value in column$3$was added to the corresponding value in column$4$. In other words, here we performed what is sometimes called element-wise operation. More on element-wise operations in the upcoming sections. We will look at more advanced operations later in this chapter. [TODO: Do advanced video on this as well] ### Data Exploration¶ Whenever we are dealing with huge datasets that contain hundreds or thousands of rows it becomes a bit challenging to visualize patterns in our head. Fortunately, in data science and machine learning, there are thousands of tools that can be used to help summarize and visualize the data. We will not focus on the different types of visualization techniques in this book, but we will utlilize a few to help you better understand the machine learning concepts and the datasets we are dealing with. Using the original dataset that we used to generate the table above, we will begin to experiment with some Octave code to do some data exploration (If you are not familiar with Octave, don't worry, the code is very simple to understand; you will get the hang of it in no time). Beside you can refer to the Python version of this book [TODO: PUT PYTHON REFERENCE HERE] The first thing we need to do is read the full housing dataset, which can be downloaded here: In [36]: % Read the housing dataset data = csvread('data/chapter_1/train.csv'); # read csv file  After reading the dataset, the next step is to explore it. We need to see what information it contains, such as the data types. This should be done before conducting any transormations or aggregations on the dataset. This is where the indexing becomes useful. If you are not familiar with Octave commands for exploring datasets, you may find it useful to head over to this video for a crash course in data exploration using Octave. [TODO video introducing the exploration of datasets in Octave]. In the first task, we want to extract all instances for column$39$and$47$and create a matrix$X$from both vectors. Using what we previously learned, this should be fairly easy to accomplish. You may want to single out each column and then create a matrix by combining both vectors. This approach works fine and may take a couple lines of code, but fortunately for us, Octave allows this to be done in one shot: In [37]: X = data(2:11,[39,47])  X = 856 1710 1262 1262 920 1786 756 1717 1145 2198 796 1362 1686 1694 1107 2090 952 1774 991 1077  This matrix represents the first (Total BSMT SF) and second column (GR Live Area) in the table provided in the previous section. If we add the values from the first column with the second column, and combine the results into one column, we will obtain the values for the Total Square Footage of the houses. In [38]: total_square_footage = [X(:,1) + X(:,2)]  total_square_footage = 2566 2524 2706 2473 3343 2158 3380 3197 2726 2068  If you look closely at the result of this vector, it corresponds to the values in column$1$of the sample table provided at the beginning of this section. Good news, we are almost ready to train our first machine learning algorithm. First, let us arrange things a bit, so we can clearly establish$X$(predictor variables), and$y$(target variable). To train a model, these are the first two pieces of information you need. Let us do that belows: In [39]: X = total_square_footage  X = 2566 2524 2706 2473 3343 2158 3380 3197 2726 2068  Note that in this case, we had already come up with our predictor variable, which was stored in the variable total_square_footage. So all we need to do is just reasigned its value to$X$. Now we want to obtain the target variable$y$. For convenience, the class labels are usually stored in the last column of the dataset. Thus, in order to obtain the values for the target variable$y$, we do the following: In [40]: y = data(2:11, end)  y = 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000  Well what do you know? We are almost ready to build and train our first linear regression model. But before that, let us take a close look at our dataset. There are many questions that will arise upon visulazing the data, such as: what machine learning algorithm is right for the problem? These type of questions can be answered upon a deeper understanding of the structure of the data. ### Plotting¶ There are many visualization tools to choose from, but since our dataset only has two fields, we can just plot our data on an$x$and$y$plane. Let us see what our sample dataset looks like when plotted. In [41]: source ('helpers/chapter_1/displaySampleTable.m')  For brevity, we have avoided some code here and instead created an additional script that we called with the source command. This is a neat function to keep things tidy. If you look in the directory provided as a parameter in the source function call, you will find the code that generates this plot. Feel free to play around with the code and change the settings for your plot. For now, it is just sufficient to understand what the plot represents. Later in the upcoming sections we will give some additional guidance on how to generate these type of visualizations. As you may have immediately observed, the$X$values (representing Total Square Footage) are on the$x$-axis and the$y$values (representing Sales Price) are on the$y$-axis -- this is a reasonable choice for now, and we will see why later. As we discussed earlier in this chapter, in linear regression, the idea is that some variable$x$, usually called the predictor variable, and a response variable$y$, usually referred to as the outcome, are used to fit a line through some data points. This line, also known as the 'best-fitting line', minimizes the distance between the sample points and the fitted line. The fitted line has a slope and an intercept, which will represent our parameter values. If you don't know much about linear regression, at this point you may be wondering why this best-fitting line is important. To make this easier to understand, let us consider our housing data example. Let us say we want to build a model that predicts how much the sales price of a house will be based on its size. From what we know already, some houses are already have a fixed price based on their size. For example, the first instance$(x^{(1)},y^{(1)}), i = 1$, which corresponds to the first house that has a size of$2566$and a sales price of$208500$. This information was easy to gather because we already know the sales prices for that specific house size. What if we wanted to know the price of a house based on the size, say$2500$total square footage. As you can see, there is no house with that size in our training dataset. So how do we predict a price for it? What if we guess a price for the house based on the size and price of all the other houses. Surely there is a pattern, that can help us to obtain this information. In simple and informal terms, the problem we have just defined is what linear regression aims to solve. That best-fitting line is the pattern we are looking for, and it will help to make a sales price prediction for other sizes of house that are not in the training dataset. ### Best Fitting Line¶ We already understand the basic intuition of linear regression, but how exactly do we obtain the best-fitting line for the plot above? Fortunately for us, there are several ways you can this. But before reviewing these methods, let's define some more notations for our problem In this problem, we are aiming to obtain what is called a hypothesis, given the the predictor variable and the target variable. In our example, given the size of a house ($x$), the hypothesis$h_\theta(x)$is used to make a prediction of the house sales price ($y$). In other words,$h_\theta(x)$is a function that maps values from$x$to$y$. How do we represent the hypothesis,$h_\theta(x)$? Well, we said that the hypothesis$h_\theta(x)$is just a straight line, and if you are familiar with a bit of calculus, this line can be represented by an equation. In calculus, the equation of a line is$y=mx + b$. Therefore, the hypothesis is nothing more than the same equation with a slight transormation. Instead of saying that the equation of the hypothesis$h_\theta(x)$is equal to$mx + b$, we simplify things by saying that$h_{\theta}(x)=\theta_0 + \theta_1x$. In such scenario,$m$(slope) represents$\theta_1$and$b$(y-intercept) is nothing more than$\theta_0$; we keep$x$, which represents our predictor variable. For many machine learning experts, this is automatic knowledge, but if you are just starting, we don't expect you to fully grasp this concept on your first encounter with it. What we encourage at this point is to understand the motivation behind clear notations, terminologies, and definitions.$h_{\theta}(x)=\theta_0 + \theta_1x$is read as “$h$subscript theta of$x$is equal to theta subscript$0$plus theta subscript$1$times$x$”. Let us look at a few examples where we already know the paramater values and we just plot the hypothesis on a 2D plane. #### Example 1¶ In [55]: close all; X = [ones(6,1), (0:5)']; theta = [1.5;0]; hypothesis = X * theta; %% plotting plot(X,hypothesis,'g'); set(gca, "linewidth", 4, "fontsize", 16) %% changes the size of the plot legend('h_\theta(x) ', 'location','north') xlim([0,5]); ylim([0,5]); xlabel('x'); ylabel('y');  #### Example 2¶ In [56]: close all; X = [ones(6,1), (0:5)']; theta = [0;0.5]; hypothesis = X * theta; %% plotting plot(X(:,2),hypothesis,'g') set(gca, "linewidth", 4, "fontsize", 16) %% changes the size of the plot legend('h_\theta(x) ', 'location','north') xlim([0,5]); ylim([0,5]); xlabel('x'); ylabel('y');  #### Example 3¶ In [57]: close all; X = [ones(6,1), (0:5)']; theta = [1;0.5]; hypothesis = X * theta; %% plotting plot(X(:,2),hypothesis,'g'); set(gca, "linewidth", 4, "fontsize", 16) %% changes the size of the plot legend('h_\theta(x) ', 'location','north') xlim([0,5]); ylim([0,5]); xlabel('x'); ylabel('y');  In the following sections we will show you three aproaches to computing a best fitting line or the hypothesis,$h_\theta(x)$. ### Method 1 - Normal Equations¶ Normal Equations - Here we are interested in solving for the coefficients,$\theta_0$and$\theta_1$. In other words, we are looking for the values of the parameters$\theta$. One way to compute these parameter values is by using the equation below: $$\theta = (X^tX)^{-1}X^ty$$ The above equation will solve for the coefficient vector of size$2 \times 1$, which will contain the two parameter values we are looking for, ($\theta_0, \theta_1$). You can learn more about this here. Now let us see how that looks in code form. In [45]: close all; X = data(2:11,[39,47]); m = length(X); X = [ones(m,1), total_square_footage]; theta = (X' * X) \ X' * y; %% or (pinv(X'*X))*X'*y hypothesis = theta(1) + theta(2) * total_square_footage; % or X * theta plot(X(:,2), hypothesis,'r-'); hold on; plot(X(:,2), y,'bo'); legend('Best fitting line is y=1.0603e+02x-9.7638e+04', 'Data points', 'Location','northwest') ylabel('Sales Price'); xlabel('Total Square Footage'); set(gca, "linewidth", 4, "fontsize", 14) %% changes the size of the plot hold off;  ### Method 2 - LU Decomposition¶ In this approach we use similar concept but with slightly different notations. In fact, the following implementation is just an alternative approach for the first method. Let us say that we want to solve the coefficients of a matrix$\theta$. This can be done using a set of matrix equations, such as$Ax = b$. In this case$x$represents our parameter values$\theta$. Therefore, we can transform the equation to$A\theta = b$. In order to compute for$\theta$, we say$\theta = A\backslash b$.$A\backslash b$outputs the same result as$(X^tX)^{-1}X^ty$, but it is computed differently. In fact the latter version is more elegant than the former, and we will see why in the following implementation. In [46]: close all; A = [sum(X(:,2).*X(:,2)), sum(X(:,2)); sum(X(:,2)), m]; b = [sum(X(:,2).*y); sum(y)]; theta = flip(A\b); hypothesis = theta(1) + theta(2) * total_square_footage; % or X * theta plot(X(:,2), hypothesis,'r-'); hold on; plot(X(:,2), y,'bo'); legend('Best fitting line is y=1.0603e+02x-9.7638e+04', 'Data points', 'Location','northwest') ylabel('Sales Price'); xlabel('Total Square Footage'); set(gca, "linewidth", 4, "fontsize", 14) %% changes the size of the plot hold off;  ### Method 3¶ In the next section we look at yet another method for obtaining a best fitting line using an algorithm called gradient descent used for parameter learning. We will introduce the notion of a cost function to optimize a given linear regression problem. Additionally, we will cover gradient descent for minimizing the cost function to obtain optimized parameter values to get best fitting line through our data points. ## Cost Function¶ Essentially, the cost function in linear regression aims to find parameter values,$\theta_0$and$\theta_1$, so that the hypothesis$h_\theta(x)$is close to the target variable$y$for all the training examples$(x,y)$. Another way to put is that we want to minimize the problem by summing the squared error difference of the hypothesis (predictions) and the actual values of the target variable$y$. Concretely, we are solving for the following: $$\min_{\rm \theta_0,\theta_1}\frac {1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2$$ where$h_\theta(x^{(i)}) = \theta_0 + \theta_1x^{(i)}$. The equation written above is simply trying to find the values of$\theta_0$and$\theta_1$, so that the average$\frac {1}{2m}$times the sum of squared errors between the predictions on the training set minus the actual values of the houses of the training set is minimized. That is why it is called the cost function (squared error function/mean squared error). More formally, the cost function can be rewritten as follows: $$J(\theta_0, \theta_1) = \frac{1}{2m}\sum_{i=1}^m(\hat y_i - y_i)^2 =\frac {1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2$$ , where$h_\theta(x^{(i)}) = \theta_0 + \theta_1x^{(i)}$. The objective function then becomes: $$\min_{\rm \theta_0,\theta_1}J(\theta_0, \theta_1)$$ Ultimately, we are looking for the parameters$\theta_0, \theta_1$which minimizes the function$J(\theta_0, \theta_1)$. Note that in linear regression there are many other alternative cost functions. However, this one is good enough for the majority of linear regression problems. Now that we have clearly defined the cost function,$J$, we need to understand what it does and why we want to use it. ### Simplifying Cost Function¶ In order to better understand the cost function$J$, let use a simplified version of the previous cost function. This simplification will also help with visualizing and provide examples to the inner workings of the cost function. Remember we are only simplifying the cost function for demonstration purposes. Let us assume that the hypthesis$h_\theta(x)$only considers one of its parameters,$\theta_1$; therefore,$\theta_0$is ignored. In other words, our new hypothesis becomes the following: $$h_\theta(x) = \theta_1x$$ This also means that the cost function has changed as well; this is the simplified cost function: $$J(\theta_1) =\frac {1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2$$ where$h_\theta(x^{(i)}) = \theta_1x^{(i)}$. The simplified objective function then becomes: $$\min_{\rm \theta_1}J(\theta_1)$$ To demonstrate in a very simple manner what the cost function will do, let us look at a very simple example, and then we will apply what we have learned to the housing dataset. Let's define some dummy data. In [47]: X = [0;1;2;3;4;5]; y = [0;1;2;3;4;5]; theta = [1]; hypothesis = X * theta  hypothesis = 0 1 2 3 4 5  Now let's plot it In [48]: close all; %% plotting plot(X,hypothesis,'r') hold on; legend('h_\theta(x) ', 'location','north') xlim([0,5]); ylim([0,5]); xlabel('x'); ylabel('y'); plot(X,y,'bx'); hold off;  As you can see, the$h_\theta(x)$is just a straight diagonal line that goes through the$0$origin point. This behavior is mainly because in the example$\theta_1 = 1$. The next thing we want to do now is compute the cost function for this dummy dataset provided. Since the simplified cost function$J(\theta_1)$is a function of the parameter$\theta_1$, all we do is plugin the values obtained for the hyptothesis$h_\theta(x)$. $$J(\theta_1) =\frac {1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2$$$$J(1) =\frac {1}{2m}\sum_{i=1}^m(\theta_1x^{(i)} - y^{(i)})^2$$$$=\frac {1}{2(6)}[(0-0)^2+(1-1)^2+(2-2)^2+(3-3)^2+(4-4)^2+(5-5)^2]$$$$=\frac {1}{12}(0^2 + 0^2 + 0^2+0^2+0^2+0^2) = 0$$ Keep in mind that$m=6$, therefore, when$\theta_1 = 1$,$J(\theta_1) =J(1)= 0$. Actually, in code form,$(h_\theta(x^{(i)}) - y^{(i)})$can simply be calculated as follows: In [49]: hypothesis - y  ans = 0 0 0 0 0 0  We will show you how to code the entire cost function later in this chapter but for now let us look at some more examples. This time let us use a different value for$\theta_1$, say$\theta_1 = 0.5$. Therefore, now we need to compute for$J(\theta_1)$,$\theta_1=0.5$. $$J(0.5) =\frac {1}{2(6)}[(0-0)^2+(0.5-1)^2+(1-2)^2+(1.5-3)^2+(2-4)^2+(2.5-5)^2]$$$$=\frac {1}{12}(0 + 0.25 + 1 + 2.25 + 4 + 6.25) = 13.75/12 \equiv 1.15$$ When$m=6$and$\theta_1 = 0.5$,$J(\theta_1) = J(0.5)\equiv 1.15$. Let us calculate the new hypothesis using the same dummy dataset and the new value of$\theta_1$. In [50]: theta = [0.5]; hypothesis = X * theta  hypothesis = 0.00000 0.50000 1.00000 1.50000 2.00000 2.50000  Let's plot... In [51]: close all; %% plotting plot(X,hypothesis,'r') hold on; legend('h_\theta(x) ', 'location','north') xlim([0,5]); ylim([0,5]); xlabel('x'); ylabel('y'); plot(X,y,'bx'); hold off;  Unlike in the first plot, where$\theta_1 = 1$, the hyptothesis line for$\theta_1=0.5$, doesn't go straight through the points. This intuitively says that our cost is higher, which implies that our cost$J$will produce higher margin of error. So is the case here, since$J(0.5) = 1.15$.. Let us look at one more example, say$\theta_1 = 1.5$. If you the calculation right it turns out that when$\theta_1=1.5$,$J(1.5) \equiv 1.15$. And the plot looks like this: In [52]: theta = [1.5]; hypothesis = X * theta; close all; %% plotting plot(X,hypothesis,'r') hold on; legend('h_\theta(x) ', 'location','north') xlim([0,5]); ylim([0,5]); xlabel('x'); ylabel('y'); plot(X,y,'bx'); hold off;  At this point in time you may be wondering what can we do with the$J$values we are obtaining each time we are changing the value of$\theta_1$. If we put the values we computed for$J$and plot it against the parameter values we used$\theta_1$, here is what it looks like: In [58]: thetas = [1;0.5;1.5]; j_values = [0;1.15;1.15]; close all; %% plotting plot(thetas,j_values,'bx'); set(gca, "linewidth", 4, "fontsize", 16) %% changes the size of the plot hold on; xlim([0,3]); ylim([0,3]); xlabel('\theta_1'); ylabel('J(\theta_1)'); hold off;  If we compute more values for$J(\theta_1)$by using different values for$\theta_1$, you will eventually be able to clearly see the shape of a parabola forming when plotting the points as we did above -- at least for the example we provided. As you can observe, the$J(\theta_1)$produces its minimum value when$\theta_1 = 1$, making this the desired value for the objective function. In fact,$\theta_1$is our global minimum. Let us do something fun. Let us put two plots, side by side, one where the hypothesis is drawn, and the other where the corresponding cost function values are plotted. It will give you a better look at what we have done so far. We encourage you to always practise some of these plottings since they are a great way to understand your machine learning models better. Assuming we already have all the information we need to plot our side by side plots --acutally we did it above--, we can just go ahead and hard code them as follows: In [62]: close all; %% hypothesis values for each correspoing theta_1 used hypothesis_1 = [0;1;2;3;4;5]; %% hypothesis for theta_1 = 1 hypothesis_2 = [0;0.5;1;1.5;2;2.5]; %% hypothesis for theta_1 = 0.5 hypothesis_3 = [0;1.5;3;4.5;6;7.5]; %% hypothesis for theta_1 = 1.5 thetas = [1;0.5;1.5]; j_values = [0;1.15;1.15]; %% plotting subplot (1,2,1) %% left plot plot(X,y,'bx'); hold on; plot(X, hypothesis_1, 'r'); plot(X, hypothesis_2, 'g'); plot(X, hypothesis_3, 'm'); xlim([0,5]); ylim([0,5]); xlabel('x'); ylabel('y'); hold off; subplot (1,2,2) %% right plot plot(thetas(1),j_values(1), 'rx'); hold on; plot(thetas(2),j_values(2), 'gx'); plot(thetas(3),j_values(3), 'mx'); xlim([0,5]); ylim([0,5]); xlabel('\theta_1'); ylabel('J(\theta_1)');  With the plot above, now you can see all the hard work you have done so far. Each hypothesis produces a different$J$value, and it can be seen clearly from the plot since the each case has been color-coded. We are beginning to see how important it is to plot our results because it gives us a basic idea of how well our machine learning model is functions. In later chapters, we will discuss some best practises and other different type of visualizations you can utilize to make sure your code has no bugs. ## Cost function Summary¶ $$HYPOTHESIS: h_{\theta}(x) = \theta_0 + \theta_1(x)$$$$PARAMETERS: \theta_0, \theta_1$$$$COST FUNCTION: J(\theta_0, \theta_1) =\frac {1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2$$$$GOAL: \min_{\rm \theta_0, \theta_1}J(\theta_0, \theta_1)$$ ## Gradient descent¶ Gradient descent is an algorithm for minimizing or reducing a cost function$J(\theta_0,\theta_1)$. This is done by changing${\theta_0,\theta_1}$until we end up at a minimum. #### Gradient descent algorithm:¶ repeat until convergence {$\theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} J (\theta_0,\theta_1)$for$j=0$and$j=1$} where$\alpha$is the learning rate (how big each step to find global minimum). If$\alpha$is too small, gradient descent can be very slow. If$\alpha$is too large, gradient descent can overshoot the minimum. In other words, it may fail to converge, or even diverge. It is important to note that the gradient descent algorithm can coverge to local minimum even when the learning rate is fixed. It means that as we approach local minimum, the algorithm will automatically take smaller steps, therefore, there is no need to decrease$\alpha$over time. ## Gradient Descent for Linear Regression¶ Here we combine what we have learned so far. We have our hypothesis for linear regression$h_\theta(x) = \theta_0 + \theta_1x$, our cost function $$J(\theta_0,\theta_1) =\frac {1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2$$ and our gradient descent algorithm for minimizing the cost function: repeat until convergence {$\theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} J (\theta_0, \theta_1)$for$j = 0$and$j = 1$} To break down the problem we simplify$\frac{\partial}{\partial \theta_j} J (\theta_0, \theta_1)$. It turns out that: $$\frac{\partial}{\partial \theta_j} J (\theta_0, \theta_1) = \frac{\partial}{\partial \theta_j}.\frac {1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2$$ further simplification gives: $$= \frac{\partial}{\partial \theta_j}.\frac {1}{2m}\sum_{i=1}^m(\theta_0 + \theta_1x^{(i)} - y^{(i)})^2$$ Since we are computing the derivative for when$j=0$and$j=1$, we compute for the following: for$j=0$, $$\theta_0 = \frac{\partial}{\partial \theta_0}J (\theta_0, \theta_1) = \frac {1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})$$ and for$j=1$, $$\theta_0 = \frac{\partial}{\partial \theta_1}J (\theta_0, \theta_1) = \frac {1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)}).x^{(i)}$$ Now that we know how to update those parameters we define the complete gradient descent algorithm for linear regression: repeat until convergence {$\theta_0 = \theta_0 - \alpha \frac {1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})\theta_1 = \theta_1 - \alpha \frac {1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)}).x^{(i)}\$

}

Another thing to note with gradient descent for linear regression is that it will always converge to global minimum. This happens because there are two parameters involved in the cost function, which always results in a convex function.

Batch Gradient Descent is when using all the training examples at each step of the gradient descent algorithm.

Another alternative to gradient descent for parameter learning is called Normal Equations, which was introduced in the earlier part of this book.

## Do A Recap of the Chapter here¶

TODO: Do a summarized version of the entire notebook, make sure to highlight the main points.