15-488 - Spring 2020

Machine Learning
in a Nutshell




Key Information

Lectures: UT 4:30 - 5:50pm - Room 2052

Labs/Recitations: W 4:30pm - 5:50pm, Room 2062

9.0

35% In-class assessments (Quizzes, Labs), 35% Homework, 30% Project (Two Tasks)

15-112 or 15-110 passed with a C or a higher letter grade


Overview

This course is about the application of machine learning (ML) concepts and models to solve challenging real-world problems.
The emphasis of the course is on the methodological and practical aspects of designing, implementing, and using ML solutions.
Course topics develop around the notion of ML process pipeline, that identifies the multi-staged process of building and deploying an ML solution. An ML pipeline includes:

  • definition of the problem, objectives, and performance metrics;
  • collection and management of relevant operational data;
  • data wrangling (transforming, cleaning, filtering, scaling);
  • perform feature engineering on the available data in terms of feature selection, feature extraction, feature processing;
  • selection of appropriate ML models based on problem requirements and available data;
  • implementation, application, testing, and evaluation of the selected model(s);
  • deployment of the final ML model.

The process proceeds both forward and backward, iterating each stage until a satisfactory solution model is built.
The workflow of an ML pipeline is illustrated in the figure below (source: Practical ML with Python).

The course tackles all the stages of the ML pipeline, presenting conceptual insights and providing algorithmic and software tools to select and implement effective ways of proceeding and dealing with the challenges of the different stages.
The python ecosystem for data science and ML pandas, numpy, matplotlib, scikit-learn, keras, notebooks is introduced and used to retrieve, store, manipulate, visualize, and perform exploratory analysis of the data.

Course workflow:

The first part of the course addresses the data part of the pipeline, from data mining and collection, to data filtering and processing, to feature engineering for different types of data (numeric, categorical, textual, image, temporal).
Next, unsupervised learning (UL) techniques are introduced to further operating on the data by learning effective representations, perform dimensionality reduction and data compression. UL techniques include: Clustering models, Principal Component Analysis (PCA), Autoencoders.
Moving from data to techniques for classification and regression, a number supervised ML models are presented, including:

  • Decision Trees,
  • k-Nearest Neighbors,
  • Naive Bayes,
  • Logistic Regression,
  • Support Vector Machines (SVMs),
  • Least Squares Linear Regression,
  • Regularization,
  • Feature maps,
  • Kernelization,
  • Deep / Convolutional Neural Networks.

The different models are introduced by a conceptualization of the main underlying ideas and by providing the algorithmic and software tools necessary to experiment with the model on different datasets.
A discussion of the aspects of generalization, bias and variance, model evaluation and selection using cross-validation techniques, completes the ML pipeline.
The different techniques are tested and evaluated in problem scenarios from different domains and based on different data types. Selected problem domains include: natural language processing, machine vision, financial forecasting, logistics, production planning, diagnosis and prediction for bio-medical data.

Learning Objectives

Students who successfully complete the course will have acquired a general knowledge of the main concepts and techniques of data science and ML, and will be adept to use ML in different fields of application.
The course will provide the students with a toolkit of different skills needed to effectively go through the entire ML pipeline. The students will acquire conceptual and practical knowledge about:

  • collecting, handling, exploring, and wrangling data in different formats and originating from different sources;
  • selecting, extracting and engineering data features using both manual and learning techiques;
  • identifying the most appropriate ML techniques for the problem and the data at hand;
  • implementing and using a set of core ML models;
  • testing and evaluating ML models;
  • using the Python ecosystem for ML and data science;
  • applying ML to problems from a range of different application domains.

Course Layout

  • Course is based on two lectures per week where the different problems, solution models, and algorithms are formally introduced. The introduction of a new concept is always accompanied by the presentation of practical use cases.

  • Each week, a third class is used as a laboratory or for recitation. Laboratory classes let the students aswering graded assignments that require both programming hands-on and conceptual understanding of course subjects. Recitation classes are aimed to revise the concepts introduced in the lecture classes and profile the use of the different software tools.

Prerequisites

Having passed either 15-112 or 15-110 with a C (minimum).

The basic notions of linear algebra, calculus, and probability theory that are necessary for the understanding of the formal concepts will be explained assuming no or little previous knowledge.

Assignments and Grading

  • Laboratory assessments: Students take bi-weekly laboratory classes where they have to answer questions involving both programming hands-on and conceptual aspects.
  • Homework: Outside of the classroom, students practice with bi-weekly homework consisting of programming tasks integrated in presentation notebooks. In the homework, students implement and experiment with the different algorithmic solutions, are confronted with different types of data, answer to conceptual questions, learn how to present material and results combining text, data, images, and code.
  • Project: Students have to complete a project that addresses the full ML pipeline and iteration cycle. Project work is staged in three sub-projects and is reported as a notebook. The first sub-project starts by dealing with data, the second adds up classification techniques and model evaluation and selection, the thirds adds up regression models. The project is done in small groups and the results are presented at the end of the course.

Grade: 35% Laboratory Assessments, 35% Homework, 30% Project

Readings

In addition to the lecture handouts and python notebooks (that will be made available after each lecture), during the course additional material will be provided by the instructor to cover specific parts of the course.

A number of (optional) textbooks can be consulted to ease the understanding of the different topics (the relevant chapters will be pointed out by the teacher), these include, but are not restricted to:

  • Machine Learning, Tom Mitchell (in the library)
  • Machine Learning: The Art and Science of Algorithms that Make Sense of Data, P. Flach
  • A Course in Machine Learning, Hal Daume', available online

Schedule (Possibly subject to changes)



Date Topics Handouts References
1/12 General concepts, ML pipeline: Machine learning for data-driven decision making, extracting information from data, finding structures. Machine learning pipeline: from data sources to final model learning and deployment. pdf
1/14 General ML scheme, Learning with a teacher (Supervised Learning): Overview of the general ML scheme (from data to features, ML task, ML problem); Supervised Learning as learning with a teacher / supervisor; data labels and error quantification; preparing a labeled dataset: issues, processes; definition of classification and regression tasks; practical examples. pdf
1/15 ML Tasks: Supervised Learning (Classification, Regression); feature spaces; geometric view; Unsupervised Learning (Finding patterns and relations, clustering, compression, dimensionality reduction); Reinforcement Learning (Sequential decision-making). pdf

1/19 Model hypothesis, Loss functions: A complete example of SL task flow; design choice: hypothesis class, parametric model functions; design choice: how to evaluate a model, loss functions; examples and properties of basic loss functions for classification and for regression. pdf
1/21 Optmization problems, Generalization, ML Workflow Optmization problem for SL; empirical error; model complexity and overfitting; expected generalization (out-of-sample) error; validation sets and estimation of the generalization error; canonical SL problem; SL workflow; building a model in the ML pipeline; promoting generalization. pdf
1/22 Test 1 + Laboratory: First written Test; Discussion of test solutions; Short introduction to the use of Jupyter notebooks. Handout

1/26 Read and display data: Introduction to the basic concepts behind numpy arrays and pandas' data frames; basic pandas methods for reading and inspecting data; CSV data formats; first examples of data display using matplotlib. Notebook Cars dataset
1/28 From data to models, a complete example: A complete, step-by-step example of how to proceed in practice from the data available in a dataset file to the definition and validation of sound regression models: data ingestion, data preparation, EDA, model hypotheses, loss function, model testing, looping over the models, model selection. Notebook Web server traffic dataset
1/29 Lab Test 1: Regression pipeline Notebook Top 50 songs on Spotify'19 dataset

2/2 Classification tasks: Example of classification task using the Iris dataset; introduction to scikit-learn; data visualization; analysis by visual inspection; construction of a threshold-based classifier; how to select the thresholds for generalization?; ML classification using k-Nearest Neighbors; general concepts behind kNN; effect of k on the classification boundaries. Notebook
2/4 Classification, Validation, Model Selection: Summary of k-NN properties; scikit-learn and k-NN; visualization techniques using meshgrids; model validation and expected generalization error; hold-out and cross-validation methods for validation and for model selection; general operational scheme for using a dataset (training, validation, testing, model optimization and selection). Notebook Breast cancer dataset
2/5 Lab Test 2: Classification pipeline Notebook Wheat seeds dataset

2/9 Data cleaning, Missing values: Introduction to data wrangling; data cleaning: dealing with missing data; types of missingness: MCAR, MAR, MNAR; general strategies, pros and cons; discard data: listwise, pairwise, dropping; imputation techniques: statistical estimates, common point, frequent category, category adding, arbritrary value, adding variable, random sampling, multiple imputation, last and next observation in time series, interpolation in time series, predictive models (k-NN for regression). Notebook
2/11 Sports day
2/12 Data cleaning, Deal with Outliers, Scale features: Concept of outlier in data; reasons for an outlier; types of outliers (global, contextual, collective); detection of outliers; removing or keeping? parametric approaches: Gaussians and z-score, univariate vs. multivariate; non-parametric approaches: quantiles, IQR, box plots; need for scaling / normalizing features; standardization (z-transform); scaling to a range; robuts scaling; normalization; sklearn methods; example of use of seaborn. Notebook

2/16 Case study, Data Wrangling: Case study for data wrangling; data inspection and visualization with pandas; scaling and normalizing issues; standard, range, normalized, robust scaling; use of skelearn transformer objects; missing entries: check and impute using pandas; check and manage outliers; correlations among data; correlation coefficients; correlation matrix and heatmap visualization. Notebook Diabetes, original dataset
Diabetes, corrupted dataset
2/18 Feature enginering 1: Notion and importance of feature enginering: from raw data to features that better represent the underlying problem to the predictive ML models; different feature data types; pandas methods to deal with data types; from categorical data to numeric values; numeric data types: values, counts, frequencies, percentages; binarization and rounding feature transformation and creation; features transformation to account for interaction; polynomial feature transformation; properties and issues of using polynomial feature maps. Notebook Pokemon games dataset
Song views dataset
Item popularity dataset
2/19 Feature engineering 2: Tranformation and use of features using polynomial maps; examples in regression and classification tasks; linearity of the models in prediction and in training; sklearn methods for manipulating polynomial features; feature selection and correlation; weighting feature importance; use of pipelines to automate and store processes. Notebook

2/23 Feature engineering 3, Image data: Properties of image data; RGB enconding; skimage methods for handling and processing images; raw pixel intensities as features; grayscale transformation; feature extraction by binning, properties and issues using histograms and pie charts; features extracted by aggregation statistics; features from edge detection; Canny edge detector; Gaussian filters; Sobel filter as edge operator. Notebook Cat image
Dog image
Panda image
Sea image
Coder survey dataset
2/25 Lecture canceled
2/26 Feature engineering 4, Image data: Gaussian filters; Edge operators: Sobel filter; image gradients, function gradients; Histogram of Oriented Gradients (HOG) as image feature descriptor; step-by-step process for computing the HOG; skimage methods; use of HOG in image classification and object detection; localized feature extraction: SIFT, SURF, ORB; main ideas behind SIFT; example of use with OpenCV; example of use of ORB with skimage. Notebook Sliding window image
Colosseum image
Sitting dog image

3/1 Spring break
3/3 Spring break
3/4 Spring break

3/8 Unsupervised learning for representation learning and finding structure in data, Clustering 1: Characteristics of UL tasks; concepts about the use of UL for dimensionality reduction, finding hidden structure and grouping, learn generative models; clustering models: partitional, hierarchical, hard vs. soft clustering; similarity measures; partitional clustering and K-Means problem; naive K-means algorithm; phases of the algorithm; linear cluster boundary regions and Voronoi tasselation of feature space; K-Means at work; use of sklearn for K-Means clustering; impact of K-Means objective function in the resulting clustering. Slides
Notebook
Mall customers dataset
3/10 Class postponed
3/11 Class postponed

COVID-19 adapted schedule

3/15 Clustering 2, Properties of K-Means: K-Means examples; generating synthetic datasets; PCA for feature selection; step-by-step K-Means; K-Means as an instance of Expectation-Maximization; assumptions and limitations of K-Means; ideal data for K-Means: balanced, spherical cluster of data generated by Gaussian populations with equal variances; number of clusters: effects of wrong choices, how to select k (elbow method); local minima in the distortion function; minimization of the Euclidean distance and linear cluster boundaries; Voronoi tassellation of the feature space; effects of imbalanced cluster data; effects of unequal variances and presence of covariances; computational issues. Notebook
3/17 Clustering 3, Clustering and image data: Signal compression and Vector Quantization; relationship between K-Means and VQ; use of K-Means / VQ for image compression in the color space; compression of a dataset of images; cluster centers as image prototypes; data visualization in the RGB space; histogram information; compression ratio; clustering for the NIST image dataset; use of clustering for image segmentation; use of image segmentation; phases of image segmentation; examples Notebook
3/18 Clustering, discussion of homework

3/22 Regression with linear models, OLS: Review of regression and linear regression concepts; meaning of linearity and role of the feature weights; example of transformation from non linear to linear model; linear regression and Ordinary Least squares; analytic solution of OLS using partial derivatives and linear system of equations; matricial form; solution using numpy matrix manipulation vs. sklearn methods; issues with matrix inversion; linear dependencies among feature vectors and inversion issues; numeric examples; optional: rank of a matrix and related concepts of linear algebra. Notebook
3/24 Regression with linear models, feature transformation, model selection: Linear regression with non linear feature transformations; example with a polynomial function corrupted by Gaussian noise; analysis of the results, weights of the features; feature selection and search for the best model complexity; remove unnecessary features; cross-validate parameter search; explicit vs. implicit model selection; notion of regularization. Notebook
3/25 Regression with regularized loss functions 1: Implicit feature control and selection using regularized loss functions; bias-variance of a model; effect of large weights on variance / overfitting; regularized loss function; use of L-norms; Ridge regression; Lasso regression. Notebook
3/28 Regression with regularized loss functions 2: Ridge regression and Lasso regression at work; comparative analysis over a number of test scenarios with linear and polynomial features; Elastic Net regression; a rel-world regression scenario from data wrangling to model selection. Notebook

3/29 Decision-trees 1: Supervised Learning and Query Answering Machines; learning and posing/answering questions; defining sequences of questions; decision trees; divid-and-conquer concept; properties and structure of a decision tree; function represention by DT; example of boolean functions; intractability of exahustive search; NP-hardness of finding the optimal decision tree; axis-parallel decision boundaries; overfitting. PDF
3/31 Decision-trees 2: Construction of a DT; decision stumps; recursive procedure; effects and goals of attribute splitting; purity and uncertainty; greedy, top-down heuristics; ID3; selection of best attributes based on different criteria; entropy of a random variable; entropy as a measure of purity of a labeled set; information gain; numeric examples. PDF
4/1 Decision-trees 3: Computing information gains; properties if ID3 ID3; overfitting issues; pruning approaches, C4.5; deadling with continuous attributes; thresholding with binary branching; axis-parallel boundaries; regression with decision trees; purity of a set: discrete vs. continuous labels; use of variance / standard deviation as measure of purity; (extras) other measures of dispersion / purity in labeled sets (e.g., Gini index); criteria to decide to stop splitting a partition; examples of regression trees vs. max depth constraint; practice problems. PDF

4/5 Ensemble models (Bagging, Boosting): General ideas behind combining models; voting/averaging vs. stacking models; bagging and boosting as forms of combining different experts; bagging: construction of the datasets by bootstrapping, properties of the base model, variance reduction goals, aggregation by averaging; random forests as bagging with randomization of the features of each model; boosting: sequential generation of the weighted datasets, base model as a weak learner, goals of combining multiple weak learners, how to compute voting weights. PDF
4/7 Practice of Decision Trees: Scikit-learn methods for DTs for classification and regression; examples of use with different datasets; exploring the effects of parameters on generalization; analysis of the results; visualization of trees and of decision boundaries. Notebook
4/8 Practice of Bagging, Random Forest, Boosting: Scikit-learn methods for bagging, random forests, boosting; examples of use with different datasets and different feature selection; analysis of performance; inspection of trained models; visualization of decision boundaries; feature importance; cross-validated search for best parameters / models. Notebook
4/11 Linear Models, Support Vector Machines 1: General form and properties of linear models for classification and for regression; formulation of a linear model, bias, scalar product; basic geometrical properties; linear decision boundaries; from a linear function to discrete labels for classification; feature transformations and linear models; score and behavior of a linear classifier; notion of classifier margin; SVMs as max-margin classifiers. PDF

4/12 Support Vector Machines 2: Functional and geometric margin; classifier margin; SVM optimization problem; SVM hard-margin formulation and properties; support vectors; formulation of primal and dual problems; solution of dual; soft-margin SVM for non linearly separable data; slack variables and penalty factor; support vectors in soft-margin SVM; Hinge loss and regularization; Support Vector Regression (SVR); general concepts and basic formulation; loss function; example of application. PDF
4/14 Support Vector Machines 3, Kernelization: Application of Support Vector Classification (SVC) to different datasets using methods from scikit-learn; functions for problem generation, visualization, and analysis; hard-margin and soft-margin SVM at work; inspection of support vectors; SVC vs. other classifiers; performance comparison and decision boundaries; ideas behind kernelization; kernelization at work in classification tasks; effect of using different kernels on the resulting non-linear decision boundaries; Support Vector Regression (SVR) at work: linear vs. non-linear kernels; use of scikit-learn methods for testing and analysis of the results; effect of different parameter setting. Notebook
4/15 Lab Test 3: Linear models, DT, Ensemble Models, SVMs, Kernelization Notebook Pulsar stars dataset

4/19 Neural networks models 1: From linear classifiers to the Perceptron model; abstraction of neuron processing, threshold models; Perceptron algorithm: iterative adjustments of weights, use of gradients to miminize quadratic loss; from single perceptrons to multi-layer perceptrons (MLP); neural networks as non-linear supervised models for function approximation; hidden and visible layers; feed-forward multi-layer architectures; activation functions: linear, step, sigmoid; hidden layer as feature extractor / feature map; NN and automatic feature learning through the hidden layers. PDF Perceptron Notebook
4/21 Neural networks models 2, Course Wrap-up: Review of the concepts behind neural networks: non-linear parametric function approximation, perceptron units with non-linear activations, feed-forward architectures, visible and hidden layers, hidden layers and feature maps; multy-layer perceptrons (MLP) vs. convolutional neural networks (CNN); representation and visualization of the non-linear function encoded by a network; loss function and NN optimization problem; basic concepts behind gradient descent; full, stochastics, and batch gradient descent; idea of backpropagation for computing the gradients; choice of the activation function and optimization problem; design choices; overfitting issues; general theoretical properties; use of keras for immplementing and testing neural networks: creating a sequential layout, compile a model, test and visualize learning evolution, inspect/visualize the activations from the layers and relate to feature extractions; examples with numeric and image data, both with MLPs and CNNs. PDF Neural Networks Notebook
4/22 Student Project Presentations

Homework Assignments

Topic Files Due Dates
Homework 1: A Classification task: from data ingestion to model selection and testing Handout Feb 19
Homework 2: Data cleaning and Feature selection: Practicing with the Python ecosystem for ML and Data Science Handout March 8
Homework 3: Supervised and Unsupervised Image Classification Handout March 29


Project

A Query Answering Machine (QuAM)

Deliverables Files Due Dates
D1 [Report]: Initial Proposal and Dataset Handout March 23
D2 [Dataset and Notebook]: Final Dataset Handout April 04
D3 [Software and Notebook]: Query Answering Machine (QuAM) Handout April 19, April 23
D4 [Presentation]: Final report Handout April 21-22


Policies for Assignments

  • Homework is due on autolab by the posted deadline. Assignments submitted past the deadline will incur the use of late days.

  • You have 6 late days in total, but cannot use more than 2 late days per homework. No credit will be given for homework submitted more than 2 days after the due date. After your 6 late days have been used you will receive 20% off for each additional day late.

  • You can discuss the exercises with your classmates, but you should write up your own solutions. If you find a solution in any source other than the material provided on the course website or the textbook, you must mention the source.

  • In general, for all types of assignments and tests, CMU's directives for academic integrity apply and must be duly followed.

Office Hours

Name Email Hours Location
Gianni Di Caro gdicaro@cmu.edu Thursdays 4:30-5:30pm + pass by my office at any time ... M 1007
Aliaa Essameldin aeahmed@andrew.cmu.edu TBD M 1004