This course is about the application of machine learning (ML) concepts and models to
solve challenging realworld problems.
The emphasis of the course is on the
methodological and practical aspects of designing, implementing, and using ML
solutions.
Course topics develop around the notion of ML process pipeline, that identifies
the multistaged process of building and deploying an ML solution. An ML pipeline
includes:
The process proceeds both forward and backward, iterating each stage until a
satisfactory solution model is built.
The workflow of an ML pipeline is illustrated
in the figure below (source: Practical ML with Python).
The course tackles all the stages of the ML pipeline, presenting conceptual insights and
providing algorithmic and software tools to select and implement effective ways of
proceeding and dealing with the challenges of the different stages.
The python ecosystem for data science and ML pandas, numpy, matplotlib,
scikitlearn, keras, notebooks is introduced and used to retrieve, store, manipulate,
visualize, and perform exploratory analysis of the data.
Course workflow:
The first part of the course addresses the data part of the pipeline, from data
mining and collection, to data filtering and processing, to feature engineering for
different types of data (numeric, categorical, textual, image, temporal).
Next, unsupervised learning (UL) techniques are introduced to further operating on
the data by learning effective representations, perform dimensionality reduction and
data compression. UL techniques include: Clustering models, Principal Component Analysis
(PCA), Autoencoders.
Moving from data to techniques for classification and regression, a number
supervised ML models are presented, including:
The different models are introduced by a conceptualization of the main
underlying ideas and by providing the algorithmic and software tools necessary to
experiment with the model on different datasets.
A discussion of the aspects of generalization, bias and variance, model evaluation
and selection using crossvalidation techniques, completes the ML pipeline.
The different techniques are tested and evaluated in problem scenarios from different
domains and based on different data types. Selected problem domains include: natural
language processing, machine vision, financial forecasting, logistics, production
planning, diagnosis and prediction for biomedical data.
Learning Objectives
Students who successfully complete the course will have acquired a general knowledge
of the main concepts and techniques of data science and ML, and will be adept
to use ML in different fields of application.
The course will provide the students with a toolkit of different skills needed to
effectively go through the entire ML pipeline. The students will acquire conceptual and
practical knowledge about:
Course Layout
Having passed either 15112 or 15110 with a C (minimum).
The basic notions of linear algebra, calculus, and probability theory that are necessary for the understanding of the formal concepts will be explained assuming no or little previous knowledge.
Grade: 35% Laboratory Assessments, 35% Homework, 30% Project
In addition to the lecture handouts and python notebooks (that will be made available after each lecture), during the course additional material will be provided by the instructor to cover specific parts of the course.
A number of (optional) textbooks can be consulted to ease the understanding of the different topics (the relevant chapters will be pointed out by the teacher), these include, but are not restricted to:
Date  Topics  Handouts  References  

1/12  General concepts, ML pipeline: Machine learning for datadriven decision making, extracting information from data, finding structures. Machine learning pipeline: from data sources to final model learning and deployment.  
1/14  General ML scheme, Learning with a teacher (Supervised Learning): Overview of the general ML scheme (from data to features, ML task, ML problem); Supervised Learning as learning with a teacher / supervisor; data labels and error quantification; preparing a labeled dataset: issues, processes; definition of classification and regression tasks; practical examples.  
1/15  ML Tasks: Supervised Learning (Classification, Regression); feature spaces; geometric view; Unsupervised Learning (Finding patterns and relations, clustering, compression, dimensionality reduction); Reinforcement Learning (Sequential decisionmaking).  


1/21  Optmization problems, Generalization, ML Workflow Optmization problem for SL; empirical error; model complexity and overfitting; expected generalization (outofsample) error; validation sets and estimation of the generalization error; canonical SL problem; SL workflow; building a model in the ML pipeline; promoting generalization.  
1/22  Test 1 + Laboratory: First written Test; Discussion of test solutions; Short introduction to the use of Jupyter notebooks.  Handout  


1/28  From data to models, a complete example: A complete, stepbystep example of how to proceed in practice from the data available in a dataset file to the definition and validation of sound regression models: data ingestion, data preparation, EDA, model hypotheses, loss function, model testing, looping over the models, model selection.  Notebook  Web server traffic dataset  
1/29  Lab Test 1: Regression pipeline  Notebook  Top 50 songs on Spotify'19 dataset  


2/4  Classification, Validation, Model Selection: Summary of kNN properties; scikitlearn and kNN; visualization techniques using meshgrids; model validation and expected generalization error; holdout and crossvalidation methods for validation and for model selection; general operational scheme for using a dataset (training, validation, testing, model optimization and selection).  Notebook  Breast cancer dataset  
2/5  Lab Test 2: Classification pipeline  Notebook  Wheat seeds dataset  


2/11  Sports day  
2/12  Data cleaning, Deal with Outliers, Scale features: Concept of outlier in data; reasons for an outlier; types of outliers (global, contextual, collective); detection of outliers; removing or keeping? parametric approaches: Gaussians and zscore, univariate vs. multivariate; nonparametric approaches: quantiles, IQR, box plots; need for scaling / normalizing features; standardization (ztransform); scaling to a range; robuts scaling; normalization; sklearn methods; example of use of seaborn.  Notebook  


2/18  Feature enginering 1: Notion and importance of feature enginering: from raw data to features that better represent the underlying problem to the predictive ML models; different feature data types; pandas methods to deal with data types; from categorical data to numeric values; numeric data types: values, counts, frequencies, percentages; binarization and rounding feature transformation and creation; features transformation to account for interaction; polynomial feature transformation; properties and issues of using polynomial feature maps.  Notebook  Pokemon games dataset Song views dataset Item popularity dataset 

2/19  Feature engineering 2: Tranformation and use of features using polynomial maps; examples in regression and classification tasks; linearity of the models in prediction and in training; sklearn methods for manipulating polynomial features; feature selection and correlation; weighting feature importance; use of pipelines to automate and store processes.  Notebook  


2/25  Lecture canceled  
2/26  Feature engineering 4, Image data: Gaussian filters; Edge operators: Sobel filter; image gradients, function gradients; Histogram of Oriented Gradients (HOG) as image feature descriptor; stepbystep process for computing the HOG; skimage methods; use of HOG in image classification and object detection; localized feature extraction: SIFT, SURF, ORB; main ideas behind SIFT; example of use with OpenCV; example of use of ORB with skimage.  Notebook  Sliding window image Colosseum image Sitting dog image 



3/1  Spring break  
3/3  Spring break  
3/4  Spring break  


3/10  Class postponed  
3/11  Class postponed  


COVID19 adapted schedule  


3/17  Clustering 3, Clustering and image data: Signal compression and Vector Quantization; relationship between KMeans and VQ; use of KMeans / VQ for image compression in the color space; compression of a dataset of images; cluster centers as image prototypes; data visualization in the RGB space; histogram information; compression ratio; clustering for the NIST image dataset; use of clustering for image segmentation; use of image segmentation; phases of image segmentation; examples  Notebook  
3/18  Clustering, discussion of homework  


3/24  Regression with linear models, feature transformation, model selection: Linear regression with non linear feature transformations; example with a polynomial function corrupted by Gaussian noise; analysis of the results, weights of the features; feature selection and search for the best model complexity; remove unnecessary features; crossvalidate parameter search; explicit vs. implicit model selection; notion of regularization.  Notebook  


3/31  Decisiontrees 2: Construction of a DT; decision stumps; recursive procedure; effects and goals of attribute splitting; purity and uncertainty; greedy, topdown heuristics; ID3; selection of best attributes based on different criteria; entropy of a random variable; entropy as a measure of purity of a labeled set; information gain; numeric examples.  
4/1  Decisiontrees 3: Computing information gains; properties if ID3 ID3; overfitting issues; pruning approaches, C4.5; deadling with continuous attributes; thresholding with binary branching; axisparallel boundaries; regression with decision trees; purity of a set: discrete vs. continuous labels; use of variance / standard deviation as measure of purity; (extras) other measures of dispersion / purity in labeled sets (e.g., Gini index); criteria to decide to stop splitting a partition; examples of regression trees vs. max depth constraint; practice problems.  


4/7  Practice of Decision Trees: Scikitlearn methods for DTs for classification and regression; examples of use with different datasets; exploring the effects of parameters on generalization; analysis of the results; visualization of trees and of decision boundaries.  Notebook  
4/8  Practice of Bagging, Random Forest, Boosting: Scikitlearn methods for bagging, random forests, boosting; examples of use with different datasets and different feature selection; analysis of performance; inspection of trained models; visualization of decision boundaries; feature importance; crossvalidated search for best parameters / models.  Notebook  
4/11  Linear Models, Support Vector Machines 1: General form and properties of linear models for classification and for regression; formulation of a linear model, bias, scalar product; basic geometrical properties; linear decision boundaries; from a linear function to discrete labels for classification; feature transformations and linear models; score and behavior of a linear classifier; notion of classifier margin; SVMs as maxmargin classifiers.  


4/14  Support Vector Machines 3, Kernelization: Application of Support Vector Classification (SVC) to different datasets using methods from scikitlearn; functions for problem generation, visualization, and analysis; hardmargin and softmargin SVM at work; inspection of support vectors; SVC vs. other classifiers; performance comparison and decision boundaries; ideas behind kernelization; kernelization at work in classification tasks; effect of using different kernels on the resulting nonlinear decision boundaries; Support Vector Regression (SVR) at work: linear vs. nonlinear kernels; use of scikitlearn methods for testing and analysis of the results; effect of different parameter setting.  Notebook  
4/15  Lab Test 3: Linear models, DT, Ensemble Models, SVMs, Kernelization  Notebook  Pulsar stars dataset  


4/21  Neural networks models 2, Course Wrapup: Review of the concepts behind neural networks: nonlinear parametric function approximation, perceptron units with nonlinear activations, feedforward architectures, visible and hidden layers, hidden layers and feature maps; multylayer perceptrons (MLP) vs. convolutional neural networks (CNN); representation and visualization of the nonlinear function encoded by a network; loss function and NN optimization problem; basic concepts behind gradient descent; full, stochastics, and batch gradient descent; idea of backpropagation for computing the gradients; choice of the activation function and optimization problem; design choices; overfitting issues; general theoretical properties; use of keras for immplementing and testing neural networks: creating a sequential layout, compile a model, test and visualize learning evolution, inspect/visualize the activations from the layers and relate to feature extractions; examples with numeric and image data, both with MLPs and CNNs.  Neural Networks Notebook  
4/22  Student Project Presentations  

Topic  Files  Due Dates 

Homework 1: A Classification task: from data ingestion to model selection and testing  Handout  Feb 19 
Homework 2: Data cleaning and Feature selection: Practicing with the Python ecosystem for ML and Data Science  Handout  March 8 
Homework 3: Supervised and Unsupervised Image Classification  Handout  March 29 
Deliverables  Files  Due Dates 

D1 [Report]: Initial Proposal and Dataset  Handout  March 23 
D2 [Dataset and Notebook]: Final Dataset  Handout  April 04 
D3 [Software and Notebook]: Query Answering Machine (QuAM)  Handout  April 19, April 23 
D4 [Presentation]: Final report  Handout  April 2122 
Homework is due on autolab by the posted deadline. Assignments submitted past the deadline will incur the use of late days.
You have 6 late days in total, but cannot use more than 2 late days per homework. No credit will be given for homework submitted more than 2 days after the due date. After your 6 late days have been used you will receive 20% off for each additional day late.
You can discuss the exercises with your classmates, but you should write up your own solutions. If you find a solution in any source other than the material provided on the course website or the textbook, you must mention the source.
In general, for all types of assignments and tests, CMU's directives for academic integrity apply and must be duly followed.
Name  Hours  Location  

Gianni Di Caro  gdicaro@cmu.edu  Thursdays 4:305:30pm + pass by my office at any time ...  M 1007 
Aliaa Essameldin  aeahmed@andrew.cmu.edu  TBD  M 1004 