 # Introduction to Machine Learning Key Information

Mondays, Wednesdays, Thursdays, 1:30pm - 2:50pm, Room 2051

12.0

15% Final, 10% Midterm, 40% Homework (four assignments), 20% Project (Two Tasks), 15% Multiple-choice Quizzes

(15122) and (21127 or 21128 or 15151) and (21325 or 36217 or 36218 or 36225 or 15359). In general, a solid background in CS, calculus, and probability theory is needed to deal successfully with course challenges

# Overview

Machine learning is a subfield of computer science with the goal of exploring, studying, and developing learning systems, methods, and algorithms that can improve their performance with learning from data. The course is designed to give undergraduate students a one-semester-long introduction to the main principles, algorithms, and applications of machine learning.

After completing the course, students will be able to:

• select and apply an appropriated supervised learning algorithm for classification problems (e.g., naive Bayes, perceptron, support vector machine, logistic regression);

• select and apply an appropriate supervised learning algorithm for regression problems (e.g., linear regression, ridge regression);
• recognize different types of unsupervised learning problems, and select and apply appropriate algorithms (e.g., density estimation, clustering, linear and nonlinear dimensionality reduction);

• work with probabilities (Bayes rule, conditioning, expectations, independence), linear algebra (vector and matrix operations, eigenvectors, SVD), and calculus (gradients, Jacobians) to derive machine learning methods such as linear regression, naive Bayes, and principal components analysis;

• understand machine learning principles such as model selection, overfitting, and underfitting, and techniques such as cross-validation and regularization;

• implement machine learning algorithms such as logistic regression via stochastic gradient descent, linear regression, perceptron, SVMs, boosting, k-means clustering;

• run appropriate supervised and unsupervised learning algorithms on real and synthetic data sets and interpret the results.

The course is organized as follows:

• The course will run in parallel with the 10-315 course in the main campus, taught by Prof. Aarti Singh. We will share the same homework, exams, project topics, and grading criteria.

• The course will be based on lectures, that will be given on Mondays, Wednesdays, and Thursdays. Whenever possible, Thursdays, will be used as recitation classes, to revise and/or expand concepts introduced in the lectures, work out example cases, and cover aspects that may be not in the students' background.
Students are expected to attend all classes and to actively participate with questions.

• For each one of the different topics, the course will present relevant techniques, discuss formal results, and show the application to problems of practical and theoretical interest.

• Homework will include both questions to be answered and short programming assignments. Written questions will involve working through different algorithms, deriving and proving mathematical results, and critically analyzing the material presented in class. Programming assignments will mainly involve writing code to implement and test algorithms in relevant scenarios.

• Quizzes, in the form of multiple-choice questions, will be used to check student progress and to promote revising lecture material with continuity.

• In the second part of the course, a Project will be assigned. It will be in the form of a leadeboard game that will have to be completed in two stages (Task T1 and Task T2) that will be presented in the class by each student during the course.

## Prerequisites

Having successfully passed the following courses is necessary: (15122) and (21127 or 21128 or 15151) and (21325 or 36217 or 36218 or 36225 or 15359).

In general, familiarity with python programming, and a solid background in general CS, calculus, and probability theory are needed to deal successfully with course challenges. Some basic concepts in CS, calculus, and probability will be briefly revised (but not re-explained from scratch).

Talk to the teacher if you are unsure whether your background is suitable or not for the course.

Course grading will assigned based on the following weighting: 40% Homework, 15% Final exam, 10% Midterm exam, 20% Project, 15% Multiple-choice Quizzes. There will be about four homework assignments. The final exam will include questions about all the topics considered in the course, with an emphasis on the topics introduced after the midterm exam. Quizzes will consist in multiple-choice questions aimed to keep up with the topics of the course in-between homeworks. The project will be in the form of a leadeboard game that will have to be completed in two stages, Tasks T1 and T2 that will be presented in the class by each student.

Grades will be curved only for classes featuring more than 20 participants.

## Textbooks

In addition to the lecture handouts (that will be made available after each lecture), during the course additional material will be provided by the instructor to cover specific parts of the course.

A number of (optional) textbooks can be consulted to ease the understanding of the different topics (the relevant chapters will be pointed out by the teacher):

• Machine Learning, Tom Mitchell (in the library)
• Pattern Recognition and Machine Learning, Christopher Bishop, available online
• Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, available online
• A Course in Machine Learning, Hal Daume', available online
• Pattern Classification Richard Duda, Peter Hart, David Stork, 2nd ed., partially online (in the library)

# Schedule (Possibly subject to changes)

Dates Topics Slides Useful References
HW
8/26 Introduction: Basic concepts; taxonomy of learning problems; workflow of ML approaches; interpretation views of ML problems; course road map; logistics; practical recommendations. pdf
8/28 Generalization, ML design, Canonical SL problem, Probability review: Example of generalization issues; overfitting; notion of regularization; design of an SL system; hypothesis and inductive bias; loss functions; introduction to learning theory; empirical vs. generalization errors; canonical SL problem: inductive bias, loss function and empirical risk minimization, parameter optimization; review summary of discrete probability theory. pdf QNA1 out
8/29 Bayes Optimal Classifier, Decision Boundaries: probabilistic view, elements of decision theory, first review of probability concepts, Bayes rule, decision boundaries and classification errors, generative and discriminative modeling, introduction to probability distribution estimation pdf

9/2 Estimating Probabilities: probability estimation and Bayes classifiers; importance and challenges estimating probabilities from data; frequentist vs. Bayesian approach to modeling probabilities; definition and properties of MLE, MAP, and Full Bayes approaches for parameter estimation; priors and conjugate probability distributions; examples with Bernoulli data and Beta priors. pdf
9/4 Prediction and Classification with Estimated Probabilities: overview of conjugate priors for continous and discrete distributions; making predictions using different criteria; classification using estimated probabilities and MLE/MAP/Bayes; Gaussians and decision boundaries; complexity challenges and dependencies; introduction to Naive Bayes. pdf
QNA1 due, HW1 out
9/5 Naive Bayes Classifier: review of Bayes classification using estimated probabilities and MLE/MAP/Bayes; review of multivariate Gaussians, notions related to the covariance matrices; quadratic and linear decision boundaries when using Gaussians; complexity challenges and feature dependencies; Naive Bayes models; discrete and continous features, MLE and MAP approaches; simplification rules for MAP estimation using smoothing parameters; case study for discrete features: text classification; case study for continous features: image classification. pdf

9/9 From Generative to Discriminative Classifiers, Linear Classifiers, Logistic Regression: Discriminative models for classification; linear models; discriminant functions; properties of linear models; loss functions; linear classification using linear discriminant functions and log-logistic loss; probabilistic discriminative models; logistic regression as linear probabilistic classifier; decision boundaries; MLE and MAP solutions; optimization problem. pdf
9/11 Logistic Regression, Gradient-based methods: Solution of the optimization problems for Logistic Regression; generalities on solving optimization problems; concave and convex functions; local and global optima; gradients; gradient descent / ascent; exact gradient step-size and zig-zig behavior; effect of step size / learning rate; well- and ill-conditioned functions; gradient ascent for LR-MLE; gradient ascent for LR-MAP. pdf

Notebook
9/12 Logistic regression, comparison among classifiers: More on gradient ascent for logistic regression, practical scenarios; MCAP case for gradient ascent with Gaussian priors; logistic regression with more than two classes, softmax function; decision boundaries for different classifiers; linear vs. non-linear boundaries; number of parameters in LR vs. Naive Bayes; asymptotic results for LR vs. NB; overall comparison between LR and NB. pdf

9/16 Nonlinear Classifiers, Neural Networks: Linear units and perceptron; perceptron algorithm and properties; from perceptrons to artificial neural networks, biological analogy; structure of a unit; multi-layerd feed-forward architectures (MLP); recurrent network models; sigmoid units; other activation functions; hidden layers and hierarchical feature learning and propagation; matrices and network parameters; theoretical properties of MLPs; design choices, concepts about overfitting and complexity. pdf
9/18 Multi-Layer Perceptrons 1: NN as composite functions; functional form of a NN; visualization of the output surface; loss minimization problem in the weight space; non-convex optimization landscape for the error surface; stochastic and batch gradient descent; backpropagation and chain rule; backpropagation for a logistic unit; backpropagation for a network of logistic units. pdf

Notebook: MLP viz
HW1 due, QNA2 out
9/19 Multi-Layer Perceptrons 2: more on backpropagation; forward and backward passes in the general case; properties and issues of backpropagation; design choices (momentum, learning rate, epochs); weight inizialization; overfitting and generalization issues; approaches for to regularization; overview of model selection and validation approaches in ML. pdf

Notebook: MLP for MNIST

9/23 Deep Neural Networks: Limits of fully connected MLPs; Issues with the choice of the activation function, sigmoid/tanh units and vanishing gradients; limitations of fully connected MLPs; general ideas about exploting structure and locality in input data; receptive fields; convolutional filters and feature maps; weight sharing; invariant properties of features; pooling layers for subsampling; incremental and hierachical feature extraction by convolutional and pooling layers; output layer and softmax function; applications; general ideas about autoencoders and generative networks. pdf
9/25 Support Vector Machines (SVM) I: Linear models for classification (again); score, functional margin, geometric margin; max-margin classifiers; linearly separable case and hard-margin SVM optimization problem; support vectors and relationship with the weight vector; non-linearly separable case and use of slack variables for elastic problem formulation; margin and non margin support vectors; penalty / tradeoff parameter; SVMs and hinge loss. pdf
9/26 Support Vector Machines (SVM) II: Review of soft-margin SVM; SVM and hinge loss; solution of the SVM optimization problem; relaxations and Lagrangian formulation; Lagrangian function and dual problem; solution of the dual; Lagrange multipliers; functional relations between multipliers and SVM parameters; solving the non-linearly separable case. pdf QNA2 due, HW2 out

9/30 Kernel Methods I: Dual SVM problem formulation for the non-linearly separable case (soft-margin); formulation and dot products; non-linear mapping in higher dimensional feature spaces for non-linear problems; kernel functions and implicit feaure map definition; Mercer's conditions for kernels; kernel matrix; kernel trick, kernelizing algorithms; SVM kernelization. pdf
10/2 Kernel Methods II: Kernels and similarity measures; Hilbert spaces and inner products; kernelization and modularity; examples of kernel functions; number of dimensions in feature space; RBF kernel and infinite dimensionality; kernel trick in SVMs; training vs. classification time; overfitting issues; examples of kernelize SVMs at work. pdf
10/3 Kernel Methods 3: Computational and complexity issues using kernels with SVMs; SVMs vs. Logistic regression pdf

10/7 Decision Trees, SL based on the Divide-and-Conquer Model: Learning by asking questions; structure of decision trees; expressivness of DTs; hypothesis space and NP-hardness for finding simplest consistent hypothesis; recursive dataset decomposition, divide-and-conquer; axes-parallel decision boundaries; greedy top-down heuristics, ID3, C4.5; entropy and information gain; purity of a labeled set; maximal gain for choosing the next attribute; discrete vs. continous features; overfitting issues and countermeasures. pdf
• Mitchell Chapters 1, 2, 6.1 - 6.3
• Bishop Chapters 1, 2
10/9 Ensemble methods, Boosting: Overview of ensemble methods; bagging; boosting; turn a wek classifier into a strong classifier; sequential weighting of training points; AdaBoost; combining weak classifers outputs using voting weights; decision stumps as weak classifiers; analysis and propertied of AdaBoost; robustness to overfitting. pdf HW2 due
10/10 Recitation on Decision Trees (it was on Wednesday)
10/14 Fall break
10/16 Fall break QNA3 out 😕
10/17 Fall break
10/21 Midterm Exam
10/23 Regression Models, Linear Regression 1: Regression problems; empirical predictor; hypothesis class and loss functions; linear regression with squared losses: Ordinary Least Squares (OLS); problem formulation and solution approaches; controlling model complexity (and avoiding singularities) using regularization; effects of different regularization approaches; Ridge regression; Lasso regression; comparison among Lp-norm regularizations. pdf QNA3 due, HW3 out
10/24 Linear Regression 2: Review of linear regression solutions; issues related to solving OLS: matrix inversion, computations, numerical instabilities; normal equations vs. SGD vs. algebraic methods; Ridge regression as constrained optimization, shrinking of weights; Lasso regression as constrained optimization, shrinking and zeroing of weights. pdf (updated)
• Bishop 3.1.4
10/24 k-NN classifier, Model selection: k-NN as a non-parametric distance-based method; k-NN classifier vs. Optimal classifier; 1-NN decision boundary and Voronoi tassellation of feature space; small vs. large K; non-parametric vs. parametric methods; overfitting issues; training error vs. testing error; training error vs. validation error; hold-out method; cross-validation methods: k-fold CV, leave-one-out CV, random subsampling; design issues in CV; model selection; model selection using CV. pdf
• Murphy Chapter 1.4.2
• Daume' Chapter 3
• Mitchell Chapter 7

10/28 Linear Regression 3: Statistical models of linear regression; discriminative modeling of the conditional distribution of the outputs; white Gaussian noise to explain variations; maximization of log-likelihood vs. solution of OLS; M(C)LE as unregularized least squares; use of priors on parameters; M(C)AP estimate as regularized LS; Gaussian prior and Ridge regression; Laplace prior and Lasso regression. pdf
• Bishop Chapter 3
Project T2 out
10/30 Nonlinear Regression Models, Kernel Regression 1: Approaches to nonlinear regression; Taylor series, linear approximation, basis functions; solution with generic feature functions; polynomial regression; properties of other basis functions; examples, model complexity and selection of hyperparameters; bias vs. variance; model selection and evalution; use of kernels, example in ridge regression. pdf
10/31 Nonlinear Regression Models, Kernel Regression 2: Dual form of ridge regression; inner products; complexity of primal and dual forms; derivation and solution of the dual problem; dual variables as Lagrange multipliers; learned weights as weighted average of training examples; dual solution and feature maps; kernelization of the dual; kernels for learning and predictions; complexity of solutions using explicit and kernelized feature maps; concepts about Support Vector Regression (optional). pdf

11/4 Student Presentations for Task 1 of Project

11/6 Nonlinear Regression Models, Kernel Regression 3: Smmothing kernels as a result of closed-form LS solutions; examples with Gaussian and other basis functions; localization and weighted averages of observations; bias-variance tradeoff; kernel regression as non-parametric regression; Nadaraya-Watson estimator; examples of kernels; role and impact of bandwidth; k-NN as another non-parametric estimator; kernel regression and least squares formulations. pdf
• Bishop Chapter 3.3, 3.2, 2.5, 6.2, 6.3.1
HW3 due, HW4 out
11/7 Unsupervised Learning, Principal Component Analysis (PCA): Large feature spaces; curse of dimensionality; dimensionality reduction by feature selection and and latent features; simple general model for dimensionality reduction / compression; subspace spanned by a vector basis; key ideas behind principal component analysis (PCA); representation of data; variance captured by projections; definition of mininum variance directions; PCA algorithm; examples; limitations of PCA, Kernel PCA. pdf

11/11 Unsupervised learning, Clustering Methods: Overview of unsupervised learning tasks; characterization of clustering tasks; types of clustering; (flat) K-means clustering problem; role of centroids, cluster assignments, Voronoi diagrams; (naive) K-means algorithm, examples; computational complexity; convergence and local minima; K-means loss function; alternating optimization (expectation-maximization); limitations of K-means; soft clustering; kernel K-means. pdf
11/13 Latent variable models, Mixture Models, Expectation-Maximization (EM): Latent variables and associated relevant problems; MLE with complete data; from complete data to latent data; Gaussian Mixture Models (GMMs); MLE for latent data in GMMs; formulation and alternating optimization; proof for the convergence of alternating optimization; general form of EM. pdf
• Bishop Chapters 9.2-9.4
11/14 Gaussian Mixture Models, Clustering, and EM: Derivation of general form of EM algorithm; EM for GMMs; general properties and limitations of EM; study of special cases for GMMs, relations to clustering (K-means). pdf

Slides on GMMs (Barnabas Poczos, 10-401)
• Bishop Chapters 9.2-9.4

11/18 Nonparametric Density Estimation: Density estimation problem; parametric vs. non-parametric approaches; histogram density estimation; role of bin width; bias-variance tradeoff; general form of the local approximator; fixing the width: kernel methods; Parzen windows; smooth kernels; fixing th enumber of points: k-NN methods; comparison between the approaches; role of the bandwidth and bias-variance tradeoff. pdf
• Bishop Chapter 2.5
11/20 Class canceled HW4 due, QNA4 out
11/21 Review of Clustering and EM

11/25 Learning Theory I: Needs for bounds on generalization errors; PAC model bounds; sample complexity; consistent but bad hypotheses; derivation of PAC Haussler bound; use of a PAC bound; limitation of Haussler's bound; Hoeffding's bound for a hypothesis which is not consistent; PAC bound and Bias-Variance tradeoff; computing the sample complexity; sample complexity for the case of decision trees; DT of fixed width vs. number of leaves; sample complexity and number of points that allow consistent classification. pdf
11/27 Learning Theory II: PAC bounds on continuous hypothesis spaces; set shattering; VC dimension; VC dimension for linear models, decision stumps, axis-aligned rectangles, circles, ellipsis; generalization error bound and VC dimension; tightness of the bound; bias-variance and VC-dimension; limitations of the VC dimensions. pdf QNA4 due
11/28 Course wrap-up, Q& A:

12/2 Student Project Presentations II
12/4 Final Exam

# Homework Assignments

Topic Files Due Dates
Homework 1: Classification, Naive Bayes and Logistic Regression - Sep 18
Homework 2: Neural Networks and Support Vector Machines - Oct 9
Homework 3: Regression - Nov 6
Homework 4: Unsupervised Learning - Nov 20

## Homework Policies

• Homework is due on autolab by the posted deadline. Assignments submitted past the deadline will incur the use of late days.

• You have 6 late days in total, but cannot use more than 2 late days per homework. No credit will be given for homework submitted more than 2 days after the due date. After your 6 late days have been used you will receive 20% off for each additional day late.

• You can discuss the exercises with your classmates, but you should write up your own solutions. If you find a solution in any source other than the material provided on the course website or the textbook, you must mention the source. You can work on the programming questions in pairs, but theoretical questions are always submitted individually. Make sure that you include a README file with your andrew id and your collaborator's andrew id.

• In general, for both assignments and exams, CMU's directives for academic integrity apply and must be duly followed.

## Exam dates and policies

The class includes both a midterm and a final exam. Both the exams will include theory and pseudo-programming questions. During exams students are only allowed to consult 1-page cheatsheet (written in any desired format). No other material is allowed, including textbooks, computers/smartphones, or copies of lecture handouts.

The midterm exam is set for October 21.

The final exam is set for December 4.

## Office Hours

Name Email Hours Location
Gianni Di Caro gdicaro@cmu.edu Thursdays 4:30-5:30pm + pass by my office at any time ... M 1007
Aliaa Essameldin aeahmed@andrew.cmu.edu TBD M 1004