Machine learning is a subfield of computer science with the goal of exploring, studying, and developing learning systems, methods, and algorithms that can improve their performance with learning from data. The course is designed to give undergraduate students a one-semester-long introduction to the main principles, algorithms, and applications of machine learning.
After completing the course, students will be able to:
Having successfully passed the following courses is necessary: (15122) and (21127 or 21128 or 15151) and (21325 or 36217 or 36218 or 36225 or 15359).
In general, familiarity with python programming, and a solid background in general CS, calculus, and probability theory are needed to deal successfully with course challenges. Some basic concepts in CS, calculus, and probability will be briefly revised (but not re-explained from scratch).
Talk to the teacher if you are unsure whether your background is suitable or not for the course.
Course grading will assigned based on the following weighting: 40% Homework, 15% Final exam, 10% Midterm exam, 20% Project, 15% Multiple-choice Quizzes. There will be about four homework assignments. The final exam will include questions about all the topics considered in the course, with an emphasis on the topics introduced after the midterm exam. Quizzes will consist in multiple-choice questions aimed to keep up with the topics of the course in-between homeworks. The project will be in the form of a leadeboard game that will have to be completed in two stages, Tasks T1 and T2 that will be presented in the class by each student.
Grades will be curved only for classes featuring more than 20 participants.
In addition to the lecture handouts (that will be made available after each lecture), during the course additional material will be provided by the instructor to cover specific parts of the course.
A number of (optional) textbooks can be consulted to ease the understanding of the different topics (the relevant chapters will be pointed out by the teacher):
Dates | Topics | Slides | Useful References |
HW |
|||||||
---|---|---|---|---|---|---|---|---|---|---|---|
8/26 | Introduction: Basic concepts; taxonomy of learning problems; workflow of ML approaches; interpretation views of ML problems; course road map; logistics; practical recommendations. |
|
|||||||||
8/28 | Generalization, ML design, Canonical SL problem, Probability review: Example of generalization issues; overfitting; notion of regularization; design of an SL system; hypothesis and inductive bias; loss functions; introduction to learning theory; empirical vs. generalization errors; canonical SL problem: inductive bias, loss function and empirical risk minimization, parameter optimization; review summary of discrete probability theory. | QNA1 out | |||||||||
8/29 | Bayes Optimal Classifier, Decision Boundaries: probabilistic view, elements of decision theory, first review of probability concepts, Bayes rule, decision boundaries and classification errors, generative and discriminative modeling, introduction to probability distribution estimation |
|
|||||||||
|
|||||||||||
9/4 | Prediction and Classification with Estimated Probabilities: overview of conjugate priors for continous and discrete distributions; making predictions using different criteria; classification using estimated probabilities and MLE/MAP/Bayes; Gaussians and decision boundaries; complexity challenges and dependencies; introduction to Naive Bayes. |
|
QNA1 due, HW1 out | ||||||||
9/5 | Naive Bayes Classifier: review of Bayes classification using estimated probabilities and MLE/MAP/Bayes; review of multivariate Gaussians, notions related to the covariance matrices; quadratic and linear decision boundaries when using Gaussians; complexity challenges and feature dependencies; Naive Bayes models; discrete and continous features, MLE and MAP approaches; simplification rules for MAP estimation using smoothing parameters; case study for discrete features: text classification; case study for continous features: image classification. |
|
|||||||||
|
|||||||||||
9/11 | Logistic Regression, Gradient-based methods: Solution of the optimization problems for Logistic Regression; generalities on solving optimization problems; concave and convex functions; local and global optima; gradients; gradient descent / ascent; exact gradient step-size and zig-zig behavior; effect of step size / learning rate; well- and ill-conditioned functions; gradient ascent for LR-MLE; gradient ascent for LR-MAP. | pdf Notebook |
|||||||||
9/12 | Logistic regression, comparison among classifiers: More on gradient ascent for logistic regression, practical scenarios; MCAP case for gradient ascent with Gaussian priors; logistic regression with more than two classes, softmax function; decision boundaries for different classifiers; linear vs. non-linear boundaries; number of parameters in LR vs. Naive Bayes; asymptotic results for LR vs. NB; overall comparison between LR and NB. | ||||||||||
|
|||||||||||
9/18 | Multi-Layer Perceptrons 1: NN as composite functions; functional form of a NN; visualization of the output surface; loss minimization problem in the weight space; non-convex optimization landscape for the error surface; stochastic and batch gradient descent; backpropagation and chain rule; backpropagation for a logistic unit; backpropagation for a network of logistic units. | pdf Notebook: MLP viz |
|
HW1 due, QNA2 out | |||||||
9/19 | Multi-Layer Perceptrons 2: more on backpropagation; forward and backward passes in the general case; properties and issues of backpropagation; design choices (momentum, learning rate, epochs); weight inizialization; overfitting and generalization issues; approaches for to regularization; overview of model selection and validation approaches in ML. | pdf Notebook: MLP for MNIST |
|||||||||
|
|||||||||||
9/25 | Support Vector Machines (SVM) I: Linear models for classification (again); score, functional margin, geometric margin; max-margin classifiers; linearly separable case and hard-margin SVM optimization problem; support vectors and relationship with the weight vector; non-linearly separable case and use of slack variables for elastic problem formulation; margin and non margin support vectors; penalty / tradeoff parameter; SVMs and hinge loss. |
|
|||||||||
9/26 | Support Vector Machines (SVM) II: Review of soft-margin SVM; SVM and hinge loss; solution of the SVM optimization problem; relaxations and Lagrangian formulation; Lagrangian function and dual problem; solution of the dual; Lagrange multipliers; functional relations between multipliers and SVM parameters; solving the non-linearly separable case. | QNA2 due, HW2 out | |||||||||
|
|||||||||||
10/2 | Kernel Methods II: Kernels and similarity measures; Hilbert spaces and inner products; kernelization and modularity; examples of kernel functions; number of dimensions in feature space; RBF kernel and infinite dimensionality; kernel trick in SVMs; training vs. classification time; overfitting issues; examples of kernelize SVMs at work. |
|
|||||||||
10/3 | Kernel Methods 3: Computational and complexity issues using kernels with SVMs; SVMs vs. Logistic regression | ||||||||||
|
|||||||||||
10/9 | Ensemble methods, Boosting: Overview of ensemble methods; bagging; boosting; turn a wek classifier into a strong classifier; sequential weighting of training points; AdaBoost; combining weak classifers outputs using voting weights; decision stumps as weak classifiers; analysis and propertied of AdaBoost; robustness to overfitting. |
|
HW2 due | ||||||||
10/10 | Recitation on Decision Trees (it was on Wednesday) | ||||||||||
10/14 | Fall break | ||||||||||
10/16 | Fall break | QNA3 out 😕 | |||||||||
10/17 | Fall break | ||||||||||
10/21 | Midterm Exam | ||||||||||
10/23 | Regression Models, Linear Regression 1: Regression problems; empirical predictor; hypothesis class and loss functions; linear regression with squared losses: Ordinary Least Squares (OLS); problem formulation and solution approaches; controlling model complexity (and avoiding singularities) using regularization; effects of different regularization approaches; Ridge regression; Lasso regression; comparison among Lp-norm regularizations. |
|
QNA3 due, HW3 out | ||||||||
10/24 | Linear Regression 2: Review of linear regression solutions; issues related to solving OLS: matrix inversion, computations, numerical instabilities; normal equations vs. SGD vs. algebraic methods; Ridge regression as constrained optimization, shrinking of weights; Lasso regression as constrained optimization, shrinking and zeroing of weights. | pdf (updated) |
|
||||||||
10/24 | k-NN classifier, Model selection: k-NN as a non-parametric distance-based method; k-NN classifier vs. Optimal classifier; 1-NN decision boundary and Voronoi tassellation of feature space; small vs. large K; non-parametric vs. parametric methods; overfitting issues; training error vs. testing error; training error vs. validation error; hold-out method; cross-validation methods: k-fold CV, leave-one-out CV, random subsampling; design issues in CV; model selection; model selection using CV. |
|
|||||||||
|
|||||||||||
10/30 | Nonlinear Regression Models, Kernel Regression 1: Approaches to nonlinear regression; Taylor series, linear approximation, basis functions; solution with generic feature functions; polynomial regression; properties of other basis functions; examples, model complexity and selection of hyperparameters; bias vs. variance; model selection and evalution; use of kernels, example in ridge regression. |
|
|||||||||
10/31 | Nonlinear Regression Models, Kernel Regression 2: Dual form of ridge regression; inner products; complexity of primal and dual forms; derivation and solution of the dual problem; dual variables as Lagrange multipliers; learned weights as weighted average of training examples; dual solution and feature maps; kernelization of the dual; kernels for learning and predictions; complexity of solutions using explicit and kernelized feature maps; concepts about Support Vector Regression (optional). |
|
|||||||||
|
|||||||||||
11/6 | Nonlinear Regression Models, Kernel Regression 3: Smmothing kernels as a result of closed-form LS solutions; examples with Gaussian and other basis functions; localization and weighted averages of observations; bias-variance tradeoff; kernel regression as non-parametric regression; Nadaraya-Watson estimator; examples of kernels; role and impact of bandwidth; k-NN as another non-parametric estimator; kernel regression and least squares formulations. |
|
HW3 due, HW4 out | ||||||||
11/7 | Unsupervised Learning, Principal Component Analysis (PCA): Large feature spaces; curse of dimensionality; dimensionality reduction by feature selection and and latent features; simple general model for dimensionality reduction / compression; subspace spanned by a vector basis; key ideas behind principal component analysis (PCA); representation of data; variance captured by projections; definition of mininum variance directions; PCA algorithm; examples; limitations of PCA, Kernel PCA. |
|
|||||||||
|
|||||||||||
11/13 | Latent variable models, Mixture Models, Expectation-Maximization (EM): Latent variables and associated relevant problems; MLE with complete data; from complete data to latent data; Gaussian Mixture Models (GMMs); MLE for latent data in GMMs; formulation and alternating optimization; proof for the convergence of alternating optimization; general form of EM. |
|
|||||||||
11/14 | Gaussian Mixture Models, Clustering, and EM: Derivation of general form of EM algorithm; EM for GMMs; general properties and limitations of EM; study of special cases for GMMs, relations to clustering (K-means). | pdf Slides on GMMs (Barnabas Poczos, 10-401) |
|
||||||||
|
|||||||||||
11/20 | Class canceled | HW4 due, QNA4 out | |||||||||
11/21 | Review of Clustering and EM | ||||||||||
|
|||||||||||
11/27 | Learning Theory II: PAC bounds on continuous hypothesis spaces; set shattering; VC dimension; VC dimension for linear models, decision stumps, axis-aligned rectangles, circles, ellipsis; generalization error bound and VC dimension; tightness of the bound; bias-variance and VC-dimension; limitations of the VC dimensions. | QNA4 due | |||||||||
11/28 | Course wrap-up, Q& A: | ||||||||||
|
|||||||||||
12/4 | Final Exam |
Topic | Files | Due Dates |
---|---|---|
Homework 1: Classification, Naive Bayes and Logistic Regression | - | Sep 18 |
Homework 2: Neural Networks and Support Vector Machines | - | Oct 9 |
Homework 3: Regression | - | Nov 6 |
Homework 4: Unsupervised Learning | - | Nov 20 |
Homework is due on autolab by the posted deadline. Assignments submitted past the deadline will incur the use of late days.
You have 6 late days in total, but cannot use more than 2 late days per homework. No credit will be given for homework submitted more than 2 days after the due date. After your 6 late days have been used you will receive 20% off for each additional day late.
You can discuss the exercises with your classmates, but you should write up your own solutions. If you find a solution in any source other than the material provided on the course website or the textbook, you must mention the source. You can work on the programming questions in pairs, but theoretical questions are always submitted individually. Make sure that you include a README file with your andrew id and your collaborator's andrew id.
In general, for both assignments and exams, CMU's directives for academic integrity apply and must be duly followed.
The class includes both a midterm and a final exam. Both the exams will include
theory and pseudo-programming questions.
During exams students are only allowed to consult 1-page cheatsheet
(written in any desired format). No other material is allowed, including
textbooks, computers/smartphones, or copies of lecture handouts.
The midterm exam is set for October 21.
The final exam is set for December 4.
Name | Hours | Location | |
---|---|---|---|
Gianni Di Caro | gdicaro@cmu.edu | Thursdays 4:30-5:30pm + pass by my office at any time ... | M 1007 |
Aliaa Essameldin | aeahmed@andrew.cmu.edu | TBD | M 1004 |