- Created By: Prasannjeet Singh ([email protected])
- Originaly done as an assignment. The problem statements can be found here.
- Note that there are a total of 5 exercises solved in this project.
This file contains solution 1 through 6 of the first exercise of Assignment - 1.
Notes
- All the function implementations are done in external files, which are all included within this folder. More details about each implemented function can be found in their respective .m files.
- Console outputs are shown in black background and white text.
Plotting uses the function plotTrainingData() which is defined in this folder. More about the function can be read in it's function definition.
load Data/data1.mat;
plotTrainingData(X, y, 65);
snapnow;
close;
The function kNNclassify() has been implemented in this folder. More about the function can be read in it's function definition.
Classifying the points uses kNNclassify() function. More about it can be read in it's function definition.
load Data/data1.mat;
hFig = figure(3);
set(hFig, 'Position', [0 0 1500 500]);
for i = 1:3
subplot(1,3,i);
[~,~,OutputMessage] = kNNclassify (i*2-1, X, y, [-17, 14])
end
snapnow
close(hFig);
OutputMessage =
'For k =1, and point = (-17,14)
New Point Belongs to Group [1] (or RED) Category
The point has been marked in pink
The boundary has also been marked with the color red'
OutputMessage =
'For k =3, and point = (-17,14)
New Point Belongs to Group [1] (or RED) Category
The point has been marked in pink
The boundary has also been marked with the color red'
OutputMessage =
'For k =5, and point = (-17,14)
New Point Belongs to Group [0] (or BLUE) Category
The point has been marked in light blue
The boundary has also been marked with the color blue'
The boundary is drawin using the function kNNdrawBoundary() which has been implmented in this folder. It uses the contour() function to draw the boundary. More about it can be read in the function definition which is in this folder.
load Data/data1.mat;
hFig = figure(1);
set(hFig, 'Position', [0 0 1500 500]);
for i = 1:3
subplot(1,3,i);
kNNdrawBoundary(i*2-1, X, y);
end
snapnow;
close(hFig);
To calculate the training error, I check whether a particular value of k can accurately classify an already existing point in the graph. I did this by using the method trainingError() which is implemented and defined in this folder. Moreover, I have also plotted a graph which maps the training error for each value of 'k'. This was done by the function trainingErrorMatrix() which is also implemented and defined in this folder.
Let us see this graph for a broad understanding:
load Data/data1.mat;
[~,trainingErrorGraph] = trainingErrorMatrix(X,y);
snapnow;
close;
Just by looking at the image above, we can say that in most of the cases the training error increases as we increase the value of k. Note that the k value is always odd in the above graph. Let us also make a table with precise values of training error for a few values of k:
k Training Error
1 0.000 3 8.333 5 11.667 7 15.000 9 15.000 11 16.667
From the table above, it can be noticed that there is absolutely no training error when the value of k is 1, however, k=1 is not the best choice for generalization of the model to unseen points, because no matter what manner the points are scattered in the graph, the training error for k=1 will always be zero, because for any particular point, that point itself will always be the nearest (with distance = 0). This observation tells us that k=1 is a clear case of overfitting, and while training results might be the best in overfitting, test-results aren't. Moreover, any outlier values have a high chance of always being wrongly classified.
Therefore a relatively larger value of k has a high probability of classifying test data better than k=1. However, we cannot also have a very high value of k, as the training error peaks as we keep increasing the value of k. Therefore, a value between 3 to 7 should be optimal in the above scenario.
Q6 kNNclassify_taxi Implementation, Classification and Boundary with different values of k [1, 3, 5] {#8}
Drawing boundary for Taxi-Cab algorithm uses the same function as drawing boundary for Euclidean algorithm kNNdrawBoundary(). Using the optional fourth argument, we can draw the Taxi-Cab boundary. More details about the optional argument, etc. can be read in the function definition which is present in this same folder.
Classification of the point (-17, 14) for various values of k
load Data/data1.mat;
hFig = figure(6);
set(hFig, 'Position', [0 0 1500 500]);
for i = 1:3
subplot(1,3,i);
[~,~,OutputMessage] = kNNclassify_taxi (i*2-1, X, y, [-17, 14])
end
snapnow
close(hFig);
OutputMessage =
'For k =1, and point = (-17,14)
New Point Belongs to Group [1] (or RED) Category
The point has been marked in pink
The boundary has also been marked with the color red'
OutputMessage =
'For k =3, and point = (-17,14)
New Point Belongs to Group [0] (or BLUE) Category
The point has been marked in light blue
The boundary has also been marked with the color blue'
OutputMessage =
'For k =5, and point = (-17,14)
New Point Belongs to Group [1] (or RED) Category
The point has been marked in pink
The boundary has also been marked with the color red'
Let us now compare the classifications done by Taxi-Cab algorithm and Euclidean algorithm:
k Euclidean Algorithm Taxi-Cab Algorithm
1 1 1 3 1 0 4 0 1
As we can see, except at k=1, both the algorithms produce different results.
Drawing The Boundary for This Algorithm in Comparision with Euclidean Algorithm
Note that the small dots scattered in the graph are the original points from the data1.m file. These are plotted along with the boundary for reference.
load Data/data1.mat;
for i = 1:3
kNNdrawBoundary(i*2-1, X, y, 1);
snapnow;
hold off;
close;
end
This file contains solution 1 to 3 of Exercise-2 of Assignment 1.
Notes
- All the function implementations are done in external files, which are all included within this folder. More details about each implemented function can be found in their respective .m files.
- Console outputs are shown in black background and white text.
This can simply be done by using the inbuilt gscatter() function:
clear all;
load Data/data2.mat;
thisFig = figure(1);
gscatter(train_X(:,1),train_X(:,2), train_y, 'brgk', 'phd*');
snapnow;
close(thisFig);
This uses the function kNNdrawBoundary_multiclass() which is defined in this folder. More details about the function can be read in it's file.
load Data/data2.mat;
thisFig = figure(2);
set(thisFig, 'Position', [0 0 2000 500]);
for i = 1:4
subplot(1,4,i);
kNNdrawBoundary_multiclass (i*2-1, train_X, train_y);
end
snapnow;
close(thisFig);
This uses the errorRateFinder() function, which is defined in this folder. More details about the function can be read in it's file.
Plotting Error Rate vs k for the given test data
% Initialization
load Data/data2.mat;
[rows, ~] = size(train_X);
% Running loop to find accuracy and error rates for each k
clearvars theMatrix;
theMatrix = ones(rows,2);
for i = 1:rows
[~,er] = errorRateFinder(i,train_X,train_y,test_X,test_y);
theMatrix(i,:) = [i, er];
end
% Now plotting the above generated matrix
hFig = figure(3);
set(hFig, 'Position', [0 0 1000 500]);
subplot('position',[0.05 0.1 0.95 0.85]);
graph = bar(theMatrix(:,1), theMatrix(:,2),'FaceColor',[0 .5 .5]);
title('Error Plotted Against k');
xlabel(strcat('k-Values ranging from',32,int2str(1),' to',32,int2str(rows)));
ylabel('The Training Errors');
snapnow;
close(hFig);
As can be observed clearly, error increases gradually as we incease the value of k. Moreover, after a certain value of k, the error peaks up and then stays constant. Let us now find the value of k for which the error was the least and also the error percentage for that k.
leastError = sortrows(theMatrix, 2);
kValue = leastError(1,1)
ErrorValue = leastError(1,2)
kValue =
29
ErrorValue =
7.7922
Therefore, the value of k with the smallest error (7.8%) is 29.
This document contains solutions for question 1 to 3 of Exercise-3.
The function kNNregression() has been implemented and is available in this folder.
Using the given sample data z Using the value of k as 12
load Data/data3.mat;
kNNregression(12, train_X, train_y, z)
ans =
1.6156e+04
As can be observed, the answer matches with the one provided in the assignment.
As Mean Squared Error is not explicitly defined in the question, it is assumed to be the following:
Taken from This Wikipedia Article.
load Data/data3.mat;
[rows, columns] = size(test_X);
for k = 1:10
for i = 1:rows
testResult(i,k) = kNNregression(k, train_X, train_y, test_X(i,:));
end
end
testResult = ((sum((testResult-(repmat(test_y, [1,10]))) .^ 2))/rows)';
testResult(:,2) = 1:10;
thisFig = figure(3);
graph = bar(testResult(:,2), testResult(:,1),'FaceColor',[0 .5 .5],'EdgeColor',[0 .9 .9],'LineWidth',1.5);
title('Mean Squared Error for Different Values of k');
xlabel('k Values');
ylabel('Mean Squared Error');
snapnow;
close(thisFig);
Now let us find the value of k that minimizes the test error
testResult = sortrows(testResult, 1);
kValue = testResult(1,2)
errorValue = testResult(1,1)
kValue =
4
errorValue =
1.2405e+06
Therefore, as calculated, the least test error is observed with k = 4.
This document contains the solution to VG exercise of the assignment 1.
Note that I have used two features from the sample data provided to us (data3.mat) to generate the heat map, as it might be insightful to observe the heat-map of real data, rather than looking at the heatmap of a randomly generated data.
The features used for generating heat-map are:
- Wheel-Base (Column 1 in data3.mat), and
- Engine-Size (Column 6 in data3.mat).
Both the data mentioned above are normalized in the function implementation.
% Clearing the workspace, loading the data, then selecting the data-sets
% as described above.
clear all;
load Data/data3.mat
heatX = train_X(:,[1,6]);
Gradient value can be specified below. The returned matrix will be a two dimensional square matrix with dimensions G*G. Thus, a total of G*G test sets are generated on which k-NN regression is applied and the results are returned in a square matrix. A higher value of G will result in less-pixelated heat map, however, it will take longer to execute. No loops are used. The data sets will always be normalized in the range of 0-G, as G will be the length and height of the graph. K was randomly chosen to be 12. Both the values can be changed to see varied results.
G = 500;
k = 12;
Note that below function will accept any two features (from any data source) to create a heat matrix, as long as there are only two features and are distributed in two columns, and the second parameter contains the training data solution.
[heatMatrix] = heatMapGenerator(k, heatX, train_y, G);
Following is the generated graph:
hFig1 = figure(4);
h1 = axes;
imagesc(heatMatrix), colorbar, colormap(flipud(hot));
set(h1, 'XAxisLocation', 'Top');
title('Heat Map for Car Prices');
xlabel('Wheel-Base');
ylabel('Engine-Size');
snapnow;
close(hFig1);
The heatmap generated above is exactly as we expected it to be. As can be observed in the sample data, cost of the vehicle increase when wheel-base or engine size is increased, we can observe the same trend in the image above. Note that the range of X and Y axis are normalized according to the value of G chosen by us.
Additionally, since I chose the value of G as 500, a lower value of G would have created a pixelated graph as below. However, the trends would not change.
heatMatrix2 = heatMatrix([1:ceil(G/100):G], [1:ceil(G/100):G]);
hFig2 = figure(5);
h1 = axes;
imagesc(heatMatrix2), colorbar, colormap(flipud(hot));
set(h1, 'XAxisLocation', 'Top');
title('Heat Map for Car Prices');
xlabel('Wheel-Base');
ylabel('Engine-Size');
snapnow;
close(hFig2);
Note that to generate a surface plot, the size of the matrix had to be proportionally minimized to reduce the enumber of edges (the black lines in the graph that show the depth), which made the graph un-interpretable. However, the trends shown will be exactly the same.
heatMatrix3 = heatMatrix([1:ceil(G/10):G], [1:ceil(G/10):G]);
hFig3 = figure(6);
h1 = axes;
surf(heatMatrix3), colorbar, colormap(flipud(hot));
set(h1, 'XAxisLocation', 'Top');
title('3-D Heat Map for Car Prices');
xlabel('Wheel-Base');
ylabel('Engine-Size');
snapnow;
close(hFig3);
% Rotated Graph
hFig4 = figure(7);
h1 = axes;
theGraph = surf(heatMatrix3); colorbar, colormap(flipud(hot));
set(h1, 'XAxisLocation', 'Top');
title('3-D Heat Map for Car Prices (Rotated)');
xlabel('Wheel-Base');
ylabel('Engine-Size');
direction = [0 0 1];
rotate(theGraph,direction,-75)
snapnow;
close(hFig4);
Function used to create the heat-map was imagesc(), and
The function used to create the 3d heat map was surf().
I chose these functions as they don't require any new library inclusions and moreover, they are easy to use. surf() can also be easily rotated and snapped for different representations. Moreover, I changed the colormap to red to signify that a heatmap is drawn. However, I prefer the idea of representing the heatmap in two dimensions as we already have color labels which makes us understand which area of the graph has a higher price. Furthermore, using less data-sets to plot the the heatmap, in a way, gave me an idea of how screen resolutions (eg. in mobile-phones) work, where less dots per inches (dpi) phones creates images with lesser quality as compared to phones with high dpi.