15-288 - Spring 2024

Machine Learning
in a Nutshell




Key Information

Lectures: UT 11:30am - 12:45pm, Room 1190

Labs/Recitations: R 11:30am - 12:45pm, Room 1190

9.0

30% Laboratory Assessments, 28% Homework, 32% Project (Two Tasks), 10% Final Exam

15-112 passed with a C or a higher letter grade

Zhijie Xu, Devang Acharya


Overview

This course is about the application of machine learning (ML) concepts and models to solve challenging real-world problems.
The emphasis of the course is on the methodological and practical aspects of designing, implementing, and using ML solutions.
Course topics develop around the notion of ML process pipeline, that identifies the multi-staged process of building and deploying an ML solution. An ML pipeline includes:

  • definition of the problem, objectives, and performance metrics;
  • collection and management of relevant operational data;
  • data wrangling (transforming, cleaning, filtering, scaling);
  • perform feature engineering on the available data in terms of feature selection, feature extraction, feature processing;
  • selection of appropriate ML models based on problem requirements and available data;
  • implementation, application, testing, and evaluation of the selected model(s);
  • deployment of the final ML model.

The process proceeds both forward and backward, iterating each stage until a satisfactory solution model is built.
The workflow of an ML pipeline is illustrated in the figure below (source: Practical ML with Python).

The course tackles all the stages of the ML pipeline, presenting conceptual insights and providing algorithmic and software tools to select and implement effective ways of proceeding and dealing with the challenges of the different stages.
The python ecosystem for data science and ML pandas, numpy, matplotlib, scikit-learn, keras, notebooks is introduced and used to retrieve, store, manipulate, visualize, and perform exploratory analysis of the data.

Course workflow:

The first part of the course addresses the data part of the pipeline, from data mining and collection, to data filtering and processing, to feature engineering for different types of data (numeric, categorical, textual, image, temporal).
Next, unsupervised learning (UL) techniques are introduced to further operating on the data by learning effective representations, perform dimensionality reduction and data compression. UL techniques will include: Clustering models, Principal Component Analysis (PCA), Autoencoders.
Moving from pure data to techniques for classification and regression, a number supervised ML models are presented, including:

  • Decision Trees,
  • k-Nearest Neighbors,
  • Naive Bayes,
  • Logistic Regression,
  • Support Vector Machines (SVMs),
  • Bagging, Random Forests, Boosting,
  • Least Squares Linear Regression,
  • Regularization,
  • Feature maps,
  • Kernelization,
  • Convolutional Neural Networks & Deep Networks.

Towards the end of the course we overview Recommender systems, that are a notable example of applications that may require the integration of many of the concepts and techniques introduced in the course.

Finally, we consider the notion of Transformer architectures in deep learning and their use for building advanced language models like ChatGPT. In particular, we make use of the OpenAI API to interact with and use ChatGPT to build and fine tune an NLP model.

The different models are introduced by a conceptualization of the main underlying ideas and by providing the algorithmic and software tools necessary to experiment with the model on different datasets.
A discussion of the aspects of generalization, bias and variance, model evaluation and selection using cross-validation techniques, completes the ML pipeline.
The different techniques are tested and evaluated in problem scenarios from different domains and based on different data types. Selected problem domains include: natural language processing, machine vision, financial forecasting, logistics, production planning, diagnosis and prediction for bio-medical data.

Learning Objectives

Students who successfully complete the course will have acquired a general knowledge of the main concepts and techniques of data science and ML, and will be adept to use ML in different fields of application.
The course will provide the students with a toolkit of different skills needed to effectively go through the entire ML pipeline. The students will acquire conceptual and practical knowledge about:

  • collecting, handling, exploring, and wrangling data in different formats and originating from different sources;
  • selecting, extracting and engineering data features using both manual and learning techiques;
  • identifying the most appropriate ML techniques for the problem and the data at hand;
  • implementing and using a set of core ML models;
  • testing and evaluating ML models;
  • using the Python ecosystem for ML and data science;
  • using the OpenAI API and exploiting the capabilities of ChatGPT;
  • applying ML to problems from a range of different application domains.

Course Layout

  • Course is based on two lectures per week where the different problems, solution models, and algorithms are formally introduced. The introduction of a new concept is always accompanied by the presentation of practical use cases.

  • Each week, a third class is usually employed as a laboratory or for recitation. Laboratory classes let the students aswering graded assignments that require both programming hands-on and conceptual understanding of course subjects. Recitation classes are aimed to revise the concepts introduced in the lecture classes and profile the use of the different software tools.

Prerequisites

Having passed 15-112 with a C (minimum).

The basic notions of linear algebra, calculus, and probability theory that are necessary for the understanding of the formal concepts will be explained assuming no or little previous knowledge.

Assignments and Grading

Assignments:

  • Laboratory Assessments: Students take laboratory classes (LabTests) where they have to answer questions involving both programming hands-on and conceptual aspects. LabTests will be typically done weekly, with a few exceptions and will require to prepare a Jupyter notebook integrating code, plots, and discussions.
  • Homework: Outside of the classroom, students practice with multiple homework consisting of programming tasks integrated in Jupyter notebooks. In the homework, students implement and experiment with the different algorithmic solutions, are confronted with different types of data, answer to conceptual questions, learn how to present material and results combining text, data, images, and code.
  • Project: Students have to complete a project that addresses the full ML pipeline and iteration cycle. Project work is staged in two tasks and comprises four deliverables. The final report is in the form of a Jupyter notebook implementing a Query Answering Machine for an application domain selected by the student. The project is done in small groups and the results are presented and discussed at the end of the course.
  • Final Exam: At the end of the course, there will be a written exam that will summarize the concepts addressed during the course.

Grading scheme:

  • 30% Laboratory Assessments
  • 28% Homework
  • 32% Project
  • 10% Final Exam

Readings

In addition to the lecture handouts and python notebooks (that will be made available after each lecture), during the course additional material will be provided by the instructor to cover specific parts of the course.

A number of (optional) textbooks can be consulted to ease the understanding of the different topics (the relevant chapters will be pointed out by the teacher), these include, but are not restricted to:

  • Machine Learning, Tom Mitchell (in the library)
  • Machine Learning: The Art and Science of Algorithms that Make Sense of Data, P. Flach
  • A Course in Machine Learning, Hal Daumé, available online

Schedule



Date Topics Handouts References
1/7 General concepts, ML pipeline: Machine learning for data-driven decision making, extracting information from data, finding structures; basic ML concepts and applications; ML pipeline: from data sources to final model learning and deployment; course information and organization pdf
1/9 ML tasks and application problems: Taxonomy of ML tasks and problems; Supervised Learning (Classification, Regression); feature spaces; geometric view; workflow of SL; Unsupervised Learning (Finding patterns and relations, clustering, compression, dimensionality reduction); Reinforcement Learning (Sequential decision-making); Advantages and issues learning with a teacher / supervisor; data labels and error quantification; preparing a labeled dataset. pdf
1/11 SL task flow, model hypotheses, canonical SL problem: A complete example of SL task flow: problem definition, design choices, features, hypothesis class, loss function; empirical error; canonical supervised ML problem. pdf

1/14 Generalization error, Empirical risk minimization, Loss functions: Empirical and generalization (out-of-sample) errors; expected generaliztion error and its approximation; training, validation, and test sets; how to support generalization; loss functions for classification and regression; building a SL model in the ML pipeline, design choices. pdf
1/16 Workflow for regression, Generalization and Overfitting, Model selection and Cross-Validation (CV): Workflow for a regression example; overfitting vs. generalization; model selection and validation sets to minimize expected generalization errors; hold-out method; cross-validation methods: k-fold CV, leave-one-out CV, random subsampling; design issues in CV; model selection; model selection using CV. pdf
1/18 LabTest 1: Fundamentals of ML, NumPy, Jupyter notebooks and Markdow

1/21 From data to models, a complete regression example: A complete, step-by-step example of how to proceed in practice from the data available in a dataset file to the definition and validation of sound regression models: data ingestion, data preparation, EDA, model hypotheses, loss function, model testing, looping over the models, model selection. Notebook
1/23 Regression with linear models, OLS, CV at work: Regression and linear regression concepts and models; linearity of models in prediction and in training; role of feature weights; Ordinary Least Squares (OLS); analytic solution of OLS using linearity in coefficients, partial derivatives and linear system of equations; matricial form; solution using numpy matrix manipulation vs. sklearn methods; issues with matrix inversion; numeric examples with generation of instances; use of cross-validation and sklearn methods for model validation. Notebook
1/25 LabTest 2: ML pipeline and python tools for a regression task

1/28 Classification tasks, k-NN: Example of classification task using the Iris dataset; use of scikit-learn datasets; data visualization; analysis by visual inspection; ML classification using k-Nearest Neighbors; general concepts behind k-NN; effect of k on the classification boundaries; plotting decision boundaries using matplotlib. Notebook
1/30 k-NN, model selection, measures of performance: Model validation and expected generalization error; hold-out and cross-validation methods for validation and for model selection; general operational scheme for using a dataset (training, validation, testing, model optimization and selection); sciki-learn methods and iterators for model selection, dataset splitting, CV-based grid search for parameter setting; recall, precision, F1, confusion matrix Notebook
2/1 Break, no classes

2/4 Data cleaning, Missing values: Introduction to data wrangling; data cleaning: dealing with missing data; types of missingness: MCAR, MAR, MNAR; general strategies, pros and cons; discard data: listwise, pairwise, dropping; imputation techniques: statistical estimates, common point, frequent category, category adding, arbritrary value, adding variable, random sampling, multiple imputation, last and next observation in time series, interpolation in time series, predictive models; pandas methods for dealing with missing entries. Notebook
2/6 Data cleaning, Dealing with Outliers: Concept of outlier in data; reasons for an outlier; types of outliers (global, contextual, collective); detection of outliers; removing or keeping? parametric approaches: Gaussians and z-score, univariate vs. multivariate; non-parametric approaches: quantiles, IQR, box plots; sklearn methods. Notebook
2/8 LabTest 3: A classification task with k-NN

2/11 Feature scaling: Need for scaling / normalizing features; standardization (z-transform); scaling to a range; robuts scaling; normalization; sklearn methods; examples of use. Notebook
2/13 National Sports Day, no classes
2/15 Feature Engineering 1, Feature selection, Feature transformations: Correlations among data and feature selection; correlation coefficients; correlation matrix and heatmap visualization; tranformation and use of features using polynomial maps; examples in regression and classification tasks; linearity of the models in prediction and in training; sklearn methods for manipulating polynomial features; weighting feature importance; use of pipelines to automate and store processes. Notebook

2/18 Feature Engineering 2, Type of data features: Notion and importance of feature enginering: from raw data to features that better represent the underlying problem to the predictive ML models; different feature data types; pandas methods to deal with data types; from categorical data to numeric values; numeric data types: values, counts, frequencies, percentages; binarization and rounding. Notebook
2/20 Feature Engineering 3, Time series data: Examples and properties of time series; time series and supervised learning: the need for defining good features; date time features, pandas methods; time lagged features, notion of autocorrelation, pandas methods; features based on rolling window statistics; features based on expanding window statistics; linear regression on engineered features; cross-validation for time series, sklearn methods for splitting and cross-validating the dataset; components of a time series: level, trend, seasonality, random residuals; importance of stationarity; subtraction of components to get stationarity; techniques to check stationarity; STL method from statsmodels; statistical tests. Notebook Daily min temperatures
Train passengers
Airline passengers
2/22 Homework in class

2/25 Spring break
2/27 Spring break
2/29 Spring break

3/3 Feature engineering 4, Image data 1: Image data and task-relevant features; properties of image data; RGB enconding; skimage methods for handling and processing images; raw pixel intensities as features; grayscale transformation; feature extraction by binning, properties and issues using histograms and pie charts; features extracted by aggregation statistics. Notebook Cat image
Dog image
Panda image
Sea image
Coder survey dataset
3/5 Feature engineering 5, Image data 2: Features from edge detection; Canny edge detector; Gaussian filters; Sobel filter as edge operator; convolution operator; examples of use of convolution filters for image processing and feature extraction: blurring, smoothing, embossing, sharpening, edge extraction. Notebook
3/7 Feature engineering 6, Image data 3: Filter algebra; non-linear filters; image gradients, function gradients; Histogram of Oriented Gradients (HOG) as image feature descriptor; step-by-step process for computing the HOG; skimage methods; use of HOG in image classification and object detection; localized feature extraction: SIFT, SURF, ORB; main ideas behind SIFT; example of use with OpenCV; example of use of ORB with skimage. Notebook Sliding window image
Colosseum image
Sitting dog image

3/10 Decision-trees 1: Supervised Learning and Query Answering Machines; learning and posing/answering questions; defining sequences of questions; decision trees; divid-and-conquer concept; properties and structure of a decision tree; function represention by DT; example of boolean functions; intractability of exahustive search; NP-hardness of finding the optimal decision tree; axis-parallel decision boundaries; overfitting; construction of a DT; decision stumps; recursive procedure; effects and goals of attribute splitting; purity and uncertainty; greedy, top-down heuristics; ID3; selection of best attributes based on different criteria; entropy of a random variable; entropy as a measure of purity of a labeled set; information gain; numeric examples. pdf
3/12 Decision-trees 2: properties if ID3; overfitting issues; pruning approaches, C4.5; dealing with continuous attributes; thresholding with binary branching; axis-parallel boundaries; regression with decision trees; purity of a set: discrete vs. continuous labels; use of variance / standard deviation as measure of purity; (extras) other measures of dispersion / purity in labeled sets (e.g., Gini index); criteria to decide to stop splitting a partition; examples of regression trees vs. max depth constraint; sklearn code. pdf
3/14 LabTest 4: Decision Trees

3/17 Ensemble models (Bagging, Random Forest, Boosting): General ideas behind combining models; voting/averaging vs. stacking models; bagging and boosting as forms of combining different experts; bagging: construction of the datasets by bootstrapping, properties of the base model, variance reduction goals, aggregation by averaging; random forests as bagging with randomization of the features of each model; boosting: sequential generation of the weighted datasets, base model as a weak learner, goals of combining multiple weak learners, how to compute voting weights. pdf
3/19 Linear Models for Classification, Definitions, Properties, Training: General form and properties of linear models for classification and for regression; formulation of a linear model, bias, scalar product; basic geometrical properties; linear decision boundaries; from a linear function to discrete labels for classification; feature transformations and linear models; score and behavior of a linear classifier; functional and geometric margin; loss functions for classification; optimization problem and its challenges pdf
3/21 Gradient methods for model training: Solving optimization problems for linear models using different loss functions; gradients and system of equations; difficulty of optimzation problems; properties of gradients; gradient descent/ascent as numeric iterative approach for finding minima/maxima; properties of GD; role of step size; batch mode GD; sum functions and stochastic/incremental GD; examples of use and sklearn code (in Notebook) pdf Notebook

3/24 Probabilistic classifiers, Logistic Regression (LR): Probabilistic view of ML and density estimation; discriminative vs. generative modeling; estimation of probability distributions: parametric vs. non parametric; MLE estimation of parameters; Bernoulli distribution; LR as a discriminative probabilistic model for linear classification; mathematical formulation: logistic regression function and Bernoulli distribution; M(Conditional)LE of weight values for LR: formulation of the optimzation problem; form of the optimization problem for M(C)AP; concavity of the optimization function; use of gradient ascent; calculation of the gradients. pdf
3/26 Support Vector Machines (SVM), Kernelization: Notion of classifier margin; SVMs as max-margin classifiers. SVM optimization problem; SVM hard-margin formulation and properties; support vectors; formulation of primal and dual problems; solution of dual; soft-margin SVM for non linearly separable data; slack variables and penalty factor; support vectors in soft-margin SVM; Hinge loss and regularization; Support Vector Regression (SVR): general concepts and basic formulation, loss function, example of application; Kernel functions and inner products; kernelization of SVM problem formulation; kernel trick; examples of kernel functions; properties of kernels; examples of applications; kernelization of Logistic regression. pdf
3/28 LabTest 5: Ensemble methods, Gradient methods, Logistic Regression, SVMs

3/31 Unsupervised Learning, Clustering, K-Means: Characteristics of UL tasks; general concepts about the use of UL for dimensionality reduction, finding hidden structure and grouping, learn generative models; clustering models: partitional, hierarchical, hard vs. soft clustering; similarity measures; partitional clustering and k-means problem; naive k-means algorithm; phases of the algorithm; linear cluster boundary regions and Voronoi tasselation of feature space; k-means at work step-by-step; k-means as an instance of Expectation-Maximization; assumptions and limitations of k-Means; ideal data for k-Means: balanced, spherical cluster of data generated by Gaussian populations with equal variances; effects of data deviating from ideal; number of clusters: effects of wrong choices, how to select k (elbow method); kernel k-means; comparison with other clustering algorithms. pdf
Notebook
Clustering for Image Compression, Classification, Segmentation: Notebook
4/2 Neural Networks 1, Perceptron, Multi-Layer Perceptrons (MLPs): From linear classifiers to the Perceptron model; abstraction of neuron processing, threshold models; Perceptron algorithm: iterative adjustments of weights, use of gradients to miminize quadratic loss; from single perceptrons to multi-layer perceptrons (MLP); neural networks as non-linear parametric models for function approximation; perceptron units with non-linear activations; hidden and visible layers; feed-forward (FF) multi-layer architectures; activation functions; hidden layer as feature extractor / feature map; NN and automatic feature learning through the hidden layers; loss function and NN optimization problem; stochastic gradient descent approaches; idea of backpropagation for computing the gradients; choice of the activation function and optimization problem. pdf
4/4 Neural Networks 2, Design and Training, Convolutional Neural Networks (CNNs): Design choices; overfitting issues; general theoretical properties; use of Keras for implementing and testing neural networks: creating a sequential layout, compile a model, test and visualize learning evolution, inspect/visualize the activations from the layers and relate to feature extractions; epochs; softmax; MLP examples with numeric and image data; MLPs vs. Convolutional Neural Networks (CNNs); issues with fully connected networks; core reasons behind the success of CNNs; recap of convolution operator and SIFT feature extraction in images; role and rationale behind convolutional and pooling layers; locality of processing, hierarchical dimensionality reduction; number of trainable parameters; typical CNN architectures: sequence of (convolution filters, activations, pooling); constructing convolutional layers; role of stride; feature maps and filter banks; constructing pooling layers; max pooling; soft-max output layer; examples; visualization of features extracted at the different layers; notes about optimization in CNNs and transfer learning; keras for CNNs. pdf Notebook MLPs
Notebook CNNs
Pima Indians dataset

4/7 Eid al-Fitr, no classes
4/9 Eid al-Fitr, no classes
4/11 Eid al-Fitr, no classes

4/14 Break, no classes
4/16 Regularization techniques: Control of model complexity, minimize risk of overfitting: explicit feature selection; noise and limitations during training; use of regularized loss functions (implicit control of weight magnitude); bias-variance of a model; effect of large weights on variance / overfitting; loss function with additive regularization terms; use of L-norms; Ridge regression; Lasso regression; Ridge regression and Lasso regression at work; comparative analysis over a number of test scenarios with linear and polynomial features; Elastic Net regression; real-world regression scenario from data wrangling to model selection; study of the effect of the λ parameter for training and testing errors. Notebook Mart sales dataset
4/18 LabTest 6: Clustering, Neural Networks, Regularization

4/21 Overview of Recurrent networks, Transformers, NLP: Recurrent networks (RNs, LSTMs) and their use in NLP; from recurrent networks to sequence-to-sequence models using encoder-decoder architectures; attention mechanisms; transformers; BERT and GPT; typical NLP tasks and use of Huggin Face models and tools for tackling them; examples with notebooks from a reference book. Reference book on NLP with Huggin Face tools Github repository for the book, with notebooks and material
4/23 Course review
4/25 Final Exam

Homework Assignments

(Schedule and topics are subject to changes)
Topic Files Due Dates
Homework 1: A Classification task: from data ingestion to model selection and testing
Homework 2: Data cleaning and model selection for regression tasks
Homework 3: Feature engineering and model selection for supervised image classification
Homework 4: Unsupervised image classification, Time series analysis


LabTests

(Schedule and topics are subject to changes)
Topic Files Due Dates
LabTest 1: Fundamentals of ML, NumPy, Jupyter notebooks and Markdow
LabTest 2: ML pipeline and python tools for a regression task
LabTest 3: Classification pipeline using k-NN
LabTest 4: Data cleaning: missing data and outliers
LabTest 5: Scaling, Feature transformations, Feature engineering of numeric data
LabTest 6: Decision Trees
LabTest 7: Ensemble methods, Gradient methods, Logistic Regression, SVMs
LabTest 8: Clustering, Neural Networks, Regularization
LabTest 9: Recommender systems and ChatGPT


Project

A Query Answering Machine (QuAM)

Deliverables Files Due Dates
D1 [Report]: Initial Proposal and Dataset
D2 [Dataset and Notebook]: Final Dataset
D3 [Software and Notebook]: Query Answering Machine (QuAM)
D4 [Presentation]: Final report


Policies for Assignments

  • Homework is due on Gradescope by the posted deadline. Assignments submitted past the deadline will incur the use of late days.

  • You have 3 late days in total, but cannot use more than 1 late day per homework. No credit will be given for homework submitted more than 2 days after the due date. After your 3 late days have been used you will receive 30% off for each additional day late.

  • For homework, you can discuss the exercises with your classmates, but you should write up your own solutions. If you find a solution in any source other than the material provided on the course website or the textbook, you must mention the source.

  • In general, for all types of assignments and tests, CMU's directives for academic integrity apply and must be duly followed.

Office Hours

Name Email Hours Location
Gianni Di Caro gdicaro@cmu.edu By appointment, drop by M 1007
Zhijie Xu zhijiex@andrew.cmu.edu TBD
Devang Acharya devanga@andrew.cmu.edu TBD