This course is about the application of machine learning (ML) concepts and models to
solve challenging realworld problems.
The emphasis of the course is on the
methodological and practical aspects of designing, implementing, and using ML
solutions.
Course topics develop around the notion of ML process pipeline, that identifies
the multistaged process of building and deploying an ML solution. An ML pipeline
includes:
The process proceeds both forward and backward, iterating each stage until a
satisfactory solution model is built.
The workflow of an ML pipeline is illustrated
in the figure below (source: Practical ML with Python).
The course tackles all the stages of the ML pipeline, presenting conceptual insights and
providing algorithmic and software tools to select and implement effective ways of
proceeding and dealing with the challenges of the different stages.
The python ecosystem for data science and ML pandas, numpy, matplotlib,
scikitlearn, keras, notebooks is introduced and used to retrieve, store, manipulate,
visualize, and perform exploratory analysis of the data.
Course workflow:
The first part of the course addresses the data part of the pipeline, from data
mining and collection, to data filtering and processing, to feature engineering for
different types of data (numeric, categorical, textual, image, temporal).
Next, unsupervised learning (UL) techniques are introduced to further operating on
the data by learning effective representations, perform dimensionality reduction and
data compression. UL techniques include: Clustering models, Principal Component Analysis
(PCA), Autoencoders.
Moving from data to techniques for classification and regression, a number
supervised ML models are presented, including:
The different models are introduced by a conceptualization of the main
underlying ideas and by providing the algorithmic and software tools necessary to
experiment with the model on different datasets.
A discussion of the aspects of generalization, bias and variance, model evaluation
and selection using crossvalidation techniques, completes the ML pipeline.
The different techniques are tested and evaluated in problem scenarios from different
domains and based on different data types. Selected problem domains include: natural
language processing, machine vision, financial forecasting, logistics, production
planning, diagnosis and prediction for biomedical data.
Learning Objectives
Students who successfully complete the course will have acquired a general knowledge
of the main concepts and techniques of data science and ML, and will be adept
to use ML in different fields of application.
The course will provide the students with a toolkit of different skills needed to
effectively go through the entire ML pipeline. The students will acquire conceptual and
practical knowledge about:
Course Layout
Having passed 15112 with a C (minimum).
The basic notions of linear algebra, calculus, and probability theory that are necessary for the understanding of the formal concepts will be explained assuming no or little previous knowledge.
Grade: 35% Laboratory Assessments, 35% Homework, 30% Project
In addition to the lecture handouts and python notebooks (that will be made available after each lecture), during the course additional material will be provided by the instructor to cover specific parts of the course.
A number of (optional) textbooks can be consulted to ease the understanding of the different topics (the relevant chapters will be pointed out by the teacher), these include, but are not restricted to:
Date  Topics  Handouts  References  

1/18  General concepts, ML pipeline: Machine learning for datadriven decision making, extracting information from data, finding structures; basic ML concepts and applications; ML pipeline: from data sources to final model learning and deployment; course information and organization  
1/20  ML tasks and application problems: Taxonomy of ML tasks and problems; Supervised Learning (Classification, Regression); feature spaces; geometric view; workflow of SL; Unsupervised Learning (Finding patterns and relations, clustering, compression, dimensionality reduction); Reinforcement Learning (Sequential decisionmaking); advantages and issues learning with a teacher / supervisor; data labels and error quantification; preparing a labeled dataset; practical examples.  
1/21  Introduction to python's ecosystem for ML: Introduction to the use of
Jupyter Notebooks; format of and typical opearations on ML datasets; CSV format
for tabular data; basic concepts behind numpy arrays for vector/matrix
operations; pandas data frames; basic pandas methods for reading and inspecting
data; first examples of data display using matplotlib.
LabTest 1: Basic concepts of ML, use of Jupyter Notebooks and Python tools (15 minutes) 
Notebook  


1/27  Empirical and Generalization errors, Canonical SL problem, ML workflow: Optimization problem for SL; empirical error; model complexity and overfitting; examples using regression; expected generalization (outofsample) error; validation sets and estimation of the generalization error; canonical SL problem; SL workflow; building a model in the ML pipeline; promoting generalization.  
1/28  LabTest 2: Core ML concepts, Practice with python's tools  


2/3  Regression with linear models, OLS: Regression and linear regression
concepts and models; linearity of models in prediction and in training; role of
feature weights; Ordinary Least Squares (OLS); analytic solution of OLS using
linearity in coefficients, partial derivatives and linear system of equations;
matricial form; solution using sklearn methods; issues with matrix inversion;
numeric examples with generation of instances. OPTIONAL: rank of a matrix and related concepts of linear algebra, using numpy matrix manipulation methods for solving OLS. 
Notebook  Notebook with optional material on matrix rank and singularity issues  
2/4  LabTest 3: Regression pipeline  


2/10  kNN for classification tasks: Use of kNN for classification tasks; effect of k on empirical error; plotting decision regions with meshgrids; use of scikitlearn methods (.fit(), .predict()); measure performance; finding the best k?  Notebook 


2/11  LabTest 4: Classification pipeline using kNN  


2/17  Data cleaning, Missing values: Introduction to data wrangling; data cleaning: dealing with missing data; types of missingness: MCAR, MAR, MNAR; general strategies, pros and cons; discard data: listwise, pairwise, dropping; imputation techniques: statistical estimates, common point, frequent category, category adding, arbritrary value, adding variable, random sampling, multiple imputation, last and next observation in time series, interpolation in time series, predictive models; sklearn methods for data removal and imputation.  Notebook  Diabetes, corrupted dataset  
2/18  Dealing with Outliers: Concept of outlier in data; reasons for an outlier; types of outliers (global, contextual, collective); detection of outliers; removing or keeping? parametric vs. nonparametric statistics; parametric approaches: Gaussians, 3σ rule of thumb, zscores, univariate vs. multivariate; nonparametric approaches: median and quantiles, IQR, box plots, 1.5·IQR rule of thumb.  Notebook  


2/24  Feature engineering 1: Correlations, polynomial transformations: correlations among data; use of correlation analysis for selecting/removing features; correlation coefficients; correlation matrix and heatmap visualization; case studies, application to regression problems; beyond linear features: interaction among features; why and how to define feature interaction; linear (regression) models using nonlinear features; polynomial feature transformations: advantages and limits; sklearn methods for polynomial transformations; first use of pipeline methods for automating transformation and fitting processes. general concepts on the utility of transforming the linear features into high(er) dimensional feature spaces.  Notebook  
2/25  LabTest 5: Numeric feature engineering, feature transformations, correlations  


3/1  Spring break  
3/3  Spring break  
3/4  Spring break  


3/10  Feature engineering 3, Image data: Properties of image data; RGB enconding; skimage methods for handling and processing images; raw pixel intensities as features; grayscale transformation; feature extraction by binning, properties and issues using histograms in rgb and grayscale domains, pie charts; features extracted by aggregation statistics.  Notebook  Cat image Dog image Panda image Sea image Another cat image Coder survey dataset 

3/11  Feature engineering 4, Image data:Edge extraction with Canny's algorithm; Gaussian filters; Edge operators: Sobel and Roberts filters; gradients and derivatives, image gradients, function gradients; Histogram of Oriented Gradients (HOG) as image feature descriptor; stepbystep process for computing the HOG; skimage methods; use of HOG in image classification and object detection.  Notebook  Sliding window image Sitting dog image 



3/17  Feature engineering 6, Image data, PCA (Unsupervised learning 1): Unsupervised learning tasks; highdimensionality of data and related issues; need for dimensionality reduction; feature extraction by identifying latent features; dimensionality reduction and compression; Principal Component Analysis (PCA); PCA for learning representations; PCA for dimensionality reduction; key ideas: linearity, directions of maximal variance; mathematical details (optional); PCA at work; limitations; application to image data using sklearn; use of PCA to extract image features in classification.  PDF Notebook 
Andrew Ng notes on Principal Component Analysis  
3/18  Clustering 1 (Unsupervised learning 2): Clustering ideas and models: partitional, hierarchical, hard vs. soft clustering; similarity measures; partitional clustering and KMeans problem; naive Kmeans algorithm; phases of the algorithm; linear cluster boundary regions and Voronoi tasselation of feature space; KMeans at work; impact of initial cluster centers; converence and complexity properties; use of sklearn for KMeans clustering, with application examples.  PDF Notebook 
Mall customers Daume', Chapter 15.2 



3/24  Clustering 3, KMeans for image data, classification metrics: Signal compression and Vector Quantization (VQ); relationship between KMeans and VQ; use of KMeans / VQ for image compression in the color space; cluster centers as vector prototypes; data visualization in the RGB space; histogram information; compression ratio; prototype / unsupervised classification of a dataset; NIST image dataset; comparison with supervised classification; confusion matrix; performance maeasures for classification: confusion matrix, rate/accuracy, recall ratio, precision, Fmeasure; clustering for image segmentation; use of image segmentation; phases of image segmentation; examples.  Notebook  
3/25  Feature Engineering, Time series data: Examples and properties of time series; time series and supervised learning: the need for defining good features; date time features, pandas methods; time lagged features, notion of autocorrelation, pandas methods; features based on rolling window statistics; features based on expanding window statistics; linear regression on engineered features; crossvalidation for time series, sklearn methods for splitting and crossvalidating the dataset; components of a time series: level, trend, seasonality, random residuals; importance of stationarity; subtraction of components to get stationarity; techniques to check stationarity; STL method from statsmodels; statistical tests.  Notebook 
Daily min temperatures Train passengers Airline passengers 



3/31  Decisiontrees 2: Construction of a DT; decision stumps; recursive procedure; effects and goals of attribute splitting; purity and uncertainty; greedy, topdown heuristics; ID3; selection of best attributes based on different criteria; entropy of a random variable; entropy as a measure of purity of a labeled set; information gain; numeric examples.  
4/1  Decisiontrees 3: Computing information gains; properties if ID3 ID3; overfitting issues; pruning approaches, C4.5; dealing with continuous attributes; thresholding with binary branching; axisparallel boundaries; regression with decision trees; purity of a set: discrete vs. continuous labels; use of variance / standard deviation as measure of purity; prediction in regression scenarions; (extras) other measures of dispersion / purity in labeled sets (e.g., Gini index); criteria to decide to stop splitting a partition; examples of regression trees vs. max depth constraint; practice problems; notebook with skelearn methods for decision trees for classification and regression, visualization of the tree and of the decision regions, crossvalidated model selection based on multiple input parameters.  PDF Notebook 
Pima dataset  


4/7  Linear models, Support Vector Machines (SMV) 1: General form and properties of linear models for classification and for regression; formulation of a linear model, bias, scalar product; basic geometrical properties; linear decision boundaries; from a linear function to discrete labels for classification; feature transformations and linear models; score and behavior of a linear classifier; notion of classifier margin; SVMs as maxmargin classifiers; functional and geometric margin; classifier margin; SVM optimization problem; SVM hardmargin formulation and properties; support vectors.  
4/8  Support Vector Machines 2, Kernelization: Support vectors; formulation of primal and dual problems and their relation; softmargin SVM for non linearly separable data; slack variables and penalty factor; support vectors in softmargin SVM; Support Vector Regression (SVR); general concepts and basic formulation; loss function; ideas behind kernelization; kernelization at work in classification tasks; effect of using different kernels on the resulting nonlinear decision boundaries; methods from scikitlearn; functions for problem generation, visualization, and analysis; hardmargin and softmargin SVM at work; inspection of support vectors; SVC vs. other classifiers; performance comparison and decision boundaries; SVR at work: linear vs. nonlinear kernels; effect of parameters.  Notebook  


4/12  Neural networks 1; LabTest 6, Question Mix: From linear classifiers to the Perceptron model; abstraction of neuron processing, threshold models; Perceptron algorithm: iterative adjustments of weights, use of gradients to miminize quadratic loss; from single perceptrons to multilayer perceptrons (MLP); neural networks as nonlinear parametric models for function approximation; perceptron units with nonlinear activations; hidden and visible layers; feedforward (FF) multilayer architectures; activation functions; hidden layer as feature extractor / feature map; NN and automatic feature learning through the hidden layers.  
4/14  Neural networks 2: Representation and visualization of the nonlinear function encoded by a network; loss function and NN optimization problem; basic concepts behind gradient descent; full, stochastic, and batch gradient descent; idea of backpropagation for computing the gradients; choice of the activation function and optimization problem; design choices; overfitting issues; general theoretical properties.  
4/15  Neural networks 3: Use of keras for implementing and testing neural networks: creating a sequential layout, compile a model, test and visualize learning evolution, inspect/visualize the activations from the layers and relate to feature extractions; MLP examples with numeric and image data  Notebook  


4/21  LabTest 7: Neural networks  
4/22  Regularization techniques: Explicit feature control and selection for minimizing risk of overfitting; implicit feature control and selection using regularized loss functions; biasvariance of a model; effect of large weights on variance / overfitting; regularized loss function; use of Lnorms; Ridge regression; Lasso regression; Ridge regression and Lasso regression at work; comparative analysis over a number of test scenarios with linear and polynomial features; Elastic Net regression; a relworld regression scenario from data wrangling to model selection.  Notebook  Mart sales dataset  

Topic  Files  Due Dates 

Homework 1: A Classification task: from data ingestion to model selection and testing  
Homework 2: Data cleaning and model selection for regression tasks  
Homework 3: Feature engineering and model selection for supervised image classification  
Homework 4: Unsupervised image classification, Time series analysis 
Topic  Files  Due Dates 

LabTest 1: Basic concepts of ML, use of Jupyter Notebooks and of Python tools  
LabTest 2: Core ML concepts, Practice with python's tools  
LabTest 3: Regression pipeline  
LabTest 4: Classification pipeline using kNN  
LabTest 5: Numeric feature engineering, feature transformations, correlations  
LabTest 6: Question Mix on decicsion trees, ensemble methods, support vector machines  
LabTest 7: Neural networks 
Deliverables  Files  Due Dates 

D1 [Report]: Initial Proposal and Dataset  
D2 [Dataset and Notebook]: Final Dataset  
D3 [Software and Notebook]: Query Answering Machine (QuAM)  
D4 [Presentation]: Final report  
Homework is due on Gradescope by the posted deadline. Assignments submitted past the deadline will incur the use of late days.
You have 6 late days in total, but cannot use more than 2 late days per homework. No credit will be given for homework submitted more than 2 days after the due date. After your 6 late days have been used you will receive 20% off for each additional day late.
You can discuss the exercises with your classmates, but you should write up your own solutions. If you find a solution in any source other than the material provided on the course website or the textbook, you must mention the source.
In general, for all types of assignments and tests, CMU's directives for academic integrity apply and must be duly followed.
Name  Hours  Location  

Gianni Di Caro  gdicaro@cmu.edu  By appointment / Zoom  M 1007 
Eduardo FeoFlushing  efeoflus@andrew.cmu.edu  TBD  M 1009 
Mohammad Shahmeer Ahmad 