{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "\n", "**15-448: Machine Learning in a Nutshell**, *CMU-Qatar* Spring'20\n", "\n", "**Gianni A. Di Caro**, www.giannidicaro.com\n", "\n", "Disclaimer: This notebook was prepared for teaching purposes. It can include material from different web sources. I'll happy to explicitly acknowledge a source if required. \n", "\n", "***" ] }, { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "
scikit-learn
k
scikit-learn
\n",
" \n",
" - All examples in the neighborhood are weighted **uniformly** (each example counts one).
\n",
" \n",
"- **Distance-based rule(s):** the distance of the examples (in the neighborhood) from the query point is taken into account to define other criteria for assigning the class (e.g., the class whose points have the minimal average or median distance).
\n",
" \n",
" - All examples in the neighborhood are weighted by **distance** (each example counts based on its distance from the query point, more precisely by the inverse of the distance).
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How do we measure *distances* in the feature space? 🤔 ... Euclidean distance is just ONE option out of many (food for future thoughts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Implement and use k-NN using `scikit-learn`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`scikit-learn` offers a module implementing the k-NN classifier!"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.neighbors import KNeighborsClassifier\n",
"\n",
"import numpy as np\n",
"\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"matplotlib.rcParams['figure.dpi']= 100 # set the resolution to x dpi"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setting up the parameters of the learning classifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First step consists in setting up the **parameters of the k-NN classifier:**\n",
"- Number of neighbors
\n",
"- Voting / weighting modality
"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"neighbors_num = 3\n",
"weighting_opts = ['uniform', 'distance']\n",
"weights = weighting_opts[0]\n",
"classifier = KNeighborsClassifier(n_neighbors=neighbors_num, weights=weights)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fit the model with the training data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Second step consists in **fitting** the k-NN classifer with the labeled dataset"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
" metric_params=None, n_jobs=1, n_neighbors=3, p=2,\n",
" weights='uniform')"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.datasets import load_iris\n",
"iris = load_iris()\n",
"\n",
"classifier.fit(iris.data, iris.target)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" We have now a **classifier** object that can be queried for new input examples."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are those `metric='minkowski'`, and `p=2`?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"They deal with the notion of *distance* ..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Measure the empirical loss / accuracy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Third step is checking the prediction **on the training dataset itself**: loss on training $\\rightarrow$ Empirical error"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"prediction = classifier.predict(iris.data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The resulting `prediction` array gives the predicted target per each training data point."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
" 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,\n",
" 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,\n",
" 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
" 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prediction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Mean classification error using the 0/1 Loss:**\n",
"\n",
"$$\\bar{E} = \\frac{1}{m}\\sum_{all\\ data} {\\bf 1}_{clf(t) \\ne t}$$\n",
"\n",
"Count and average all classification errors.\n",
"\n",
"$$Mean Accuracy\\ = 1 - \\bar{E}$$ "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"mean_accuracy = np.mean(prediction == iris.target)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.96"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean_accuracy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pretty good accuracy, as expected, since we are testing on the same dataset used for training!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, how can we achieve **100% accuracy on the on the training dataset using the kNN classifier?**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Setting:\n",
"\n",
"`classifier = KNeighborsClassifier(n_neighbors=1)`\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualize the decision boundaries of the learned classifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Issue: We need to restrict the visualization to a **two-dimensional feature space.** \n",
"\n",
"Accordingly, we need to **learn using only two features**, and visualize how the classifier would classify new data points using these two features only. However, remember, targets refer to four-dimensional feature values. Such that some inconsistencies might seem to appear in the visualization since we are using less information than required (as stored in the target labels)."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['sepal length (cm)',\n",
" 'sepal width (cm)',\n",
" 'petal length (cm)',\n",
" 'petal width (cm)']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris.feature_names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's consider the first two features of the dataset, namely, `sepal length` and `sepal width`."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mean accuracy with features (0,1): 0.853\n"
]
}
],
"source": [
"# extract the first two features\n",
"feature_1 = 0\n",
"feature_2 = 1\n",
"iris_features_2 = iris.data[:, [feature_1,feature_2]]\n",
"\n",
"# learn the k-NN classifier using only the selected features\n",
"classifier.fit(iris_features_2, iris.target)\n",
"\n",
"# check the prediction performance on the whole dataset\n",
"# note that query data dimension must match training data dimension\n",
"prediction = classifier.predict(iris_features_2)\n",
"\n",
"# compute the mean accuracy error over the training dataset\n",
"mean_accuracy = np.mean(prediction == iris.target)\n",
"\n",
"print('Mean accuracy with features ({},{}):'\n",
" ' {:.3f}'.format(feature_1, feature_2, mean_accuracy))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We may notice that now, using two features, the accuracy went down."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To **visualize the decision regions,** we need first to define the range of values to consider for the two selected features. \n",
"\n",
"It is reasonable to take the minimum and maximum values of each feature."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[4., 8.],\n",
" [2., 5.],\n",
" [1., 7.],\n",
" [0., 3.]])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# feature_ranges contain the range of variability for each feature \n",
"# based on the values in the training dataset\n",
"feature_ranges = np.array([ (np.floor(np.min(iris.data[:, i])), \n",
" np.ceil(np.max(iris.data[:, i]))) \n",
" for i in range(iris.data.shape[1])] )\n",
"feature_ranges"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Meshgrids and colormeshes for visualization of regions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need a **meshgrid** of values for the selected features.\n",
"\n",
"Meshgrids are very useful objects for plotting and visualization, beyond ML. It is worth to take a detour and try to understand a bit more of what they are.
\n",
"\n",
"\n",
"Note: The examples and the details in this subsection are provided for the student willing to learn more about complex data visualization techniques. However, you can safely skip this subsection and the following one, and jump directly to the \n",
" section 1.6\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Examples of use of meshgrids"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
" \n",
"- `get_predictions()` \n",
"- `plot_data_and_region_boundaries()` \n",
" \n",
"These functions, defined below, are just clean wrappers for the previous code."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"def learn_kNN_classifier(features_data, targets, neighbors, voting):\n",
" '''Set up a K-NN classifier and fits it to the given training data.\n",
" Return: learned classifier.''' \n",
" \n",
" classifier = KNeighborsClassifier(n_neighbors=neighbors, \n",
" weights=voting)\n",
" classifier.fit(features_data, targets)\n",
" return classifier\n",
"\n",
"def get_predictions(test_data, targets, classifier):\n",
" '''Input: trained classifier, test set for feature data paired with target values.\n",
" Return: prediction labels and the mean accuracy over the test set.'''\n",
" \n",
" prediction = classifier.predict(test_data)\n",
" \n",
" mean_accuracy = np.mean(prediction == targets)\n",
"\n",
" return (prediction, mean_accuracy)\n",
"\n",
" \n",
"def plot_data_and_region_boundaries(classifier, training_x, training_y, targets,\n",
" num_of_pts=1000, xlabel='x', ylabel='y', title='',\n",
" colormaps = [None, None]):\n",
" '''Get a trained classifier and training data for two features.\n",
" Plots the training data and the decision boundaries over the region\n",
" defined by the min and max values of the given input features.'''\n",
" \n",
" num_of_pts = num_of_pts\n",
"\n",
" xx, yy = np.meshgrid(np.linspace(np.floor(np.min(training_x)),\n",
" np.ceil(np.max(training_x)), \n",
" num=num_of_pts),\n",
" np.linspace(np.floor(np.min(training_y)),\n",
" np.ceil(np.max(training_y)),\n",
" num=num_of_pts))\n",
"\n",
" predictions_z = classifier.predict(np.c_[xx.ravel(), yy.ravel()]) \n",
" \n",
" predictions_z = predictions_z.reshape(xx.shape)\n",
"\n",
" from matplotlib.colors import ListedColormap\n",
" if colormaps[0] == None:\n",
" cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])\n",
" cmap_bold = ListedColormap(['darkorange', 'c', 'blue'])\n",
" else:\n",
" cmap_light = colormaps[0]\n",
" cmap_bold = colormaps[1]\n",
"\n",
" plt.figure()\n",
"\n",
" plt.xlabel(xlabel)\n",
" plt.ylabel(ylabel)\n",
" plt.title(title)\n",
"\n",
" plt.pcolormesh(xx, yy, predictions_z, cmap=cmap_light)\n",
" \n",
" scattter = plt.scatter(training_x, training_y, \n",
" s=20, c=targets, cmap=cmap_bold, edgecolor='black')\n",
"\n",
" plt.legend(*scatter.legend_elements(), loc=\"upper right\", title=\"Classes\")\n",
"\n",
" \n",
" plt.xlim(xx.min(), xx.max())\n",
" plt.ylim(yy.min(), yy.max())\n",
"\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's check that everything is fine and simple to use."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mean accuracy using (sepal length (cm), sepal width (cm)): 0.927\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"