This course is about the application of machine learning (ML) concepts and models to
solve challenging realworld problems.
The emphasis of the course is on the
methodological and practical aspects of designing, implementing, and using ML
solutions.
Course topics develop around the notion of ML process pipeline, that identifies
the multistaged process of building and deploying an ML solution. An ML pipeline
includes:
The process proceeds both forward and backward, iterating each stage until a
satisfactory solution model is built.
The workflow of an ML pipeline is illustrated
in the figure below (source: Practical ML with Python).
The course tackles all the stages of the ML pipeline, presenting conceptual insights and
providing algorithmic and software tools to select and implement effective ways of
proceeding and dealing with the challenges of the different stages.
The python ecosystem for data science and ML pandas, numpy, matplotlib,
scikitlearn, keras, notebooks is introduced and used to retrieve, store, manipulate,
visualize, and perform exploratory analysis of the data.
Course workflow:
The first part of the course addresses the data part of the pipeline, from data
mining and collection, to data filtering and processing, to feature engineering for
different types of data (numeric, categorical, textual, image, temporal).
Next, unsupervised learning (UL) techniques are introduced to further operating on
the data by learning effective representations, perform dimensionality reduction and
data compression. UL techniques will include: Clustering models, Principal Component Analysis
(PCA), Autoencoders.
Moving from pure data to techniques for classification and regression, a number
supervised ML models are presented, including:
Towards the end of the course we overview Recommender systems, that are a notable example of applications that may require the integration of many of the concepts and techniques introduced in the course.
Finally, we consider the notion of Transformer architectures in deep learning and their use for building advanced language models like ChatGPT. In particular, we make use of the OpenAI API to interact with and use ChatGPT to build and fine tune an NLP model.
The different models are introduced by a conceptualization of the main
underlying ideas and by providing the algorithmic and software tools necessary to
experiment with the model on different datasets.
A discussion of the aspects of generalization, bias and variance, model evaluation
and selection using crossvalidation techniques, completes the ML pipeline.
The different techniques are tested and evaluated in problem scenarios from different
domains and based on different data types. Selected problem domains include: natural
language processing, machine vision, financial forecasting, logistics, production
planning, diagnosis and prediction for biomedical data.
Learning Objectives
Students who successfully complete the course will have acquired a general knowledge
of the main concepts and techniques of data science and ML, and will be adept
to use ML in different fields of application.
The course will provide the students with a toolkit of different skills needed to
effectively go through the entire ML pipeline. The students will acquire conceptual and
practical knowledge about:
Course Layout
Having passed 15112 with a C (minimum).
The basic notions of linear algebra, calculus, and probability theory that are necessary for the understanding of the formal concepts will be explained assuming no or little previous knowledge.
Assignments:
Grading scheme:
In addition to the lecture handouts and python notebooks (that will be made available after each lecture), during the course additional material will be provided by the instructor to cover specific parts of the course.
A number of (optional) textbooks can be consulted to ease the understanding of the different topics (the relevant chapters will be pointed out by the teacher), these include, but are not restricted to:
Date  Topics  Handouts  References  

1/7  General concepts, ML pipeline: Machine learning for datadriven decision making, extracting information from data, finding structures; basic ML concepts and applications; ML pipeline: from data sources to final model learning and deployment; course information and organization  
1/9  ML tasks and application problems: Taxonomy of ML tasks and problems; Supervised Learning (Classification, Regression); feature spaces; geometric view; workflow of SL; Unsupervised Learning (Finding patterns and relations, clustering, compression, dimensionality reduction); Reinforcement Learning (Sequential decisionmaking); Advantages and issues learning with a teacher / supervisor; data labels and error quantification; preparing a labeled dataset.  
1/11  SL task flow, model hypotheses, canonical SL problem: A complete example of SL task flow: problem definition, design choices, features, hypothesis class, loss function; empirical error; canonical supervised ML problem.  


1/16  Workflow for regression, Generalization and Overfitting, Model selection and CrossValidation (CV): Workflow for a regression example; overfitting vs. generalization; model selection and validation sets to minimize expected generalization errors; holdout method; crossvalidation methods: kfold CV, leaveoneout CV, random subsampling; design issues in CV; model selection; model selection using CV.  
1/18  LabTest 1: Fundamentals of ML, NumPy, Jupyter notebooks and Markdow  


1/23  Regression with linear models, OLS, CV at work: Regression and linear regression concepts and models; linearity of models in prediction and in training; role of feature weights; Ordinary Least Squares (OLS); analytic solution of OLS using linearity in coefficients, partial derivatives and linear system of equations; matricial form; solution using numpy matrix manipulation vs. sklearn methods; issues with matrix inversion; numeric examples with generation of instances; use of crossvalidation and sklearn methods for model validation.  Notebook  
1/25  LabTest 2: ML pipeline and python tools for a regression task  


1/30  kNN, model selection, measures of performance: Model validation and expected generalization error; holdout and crossvalidation methods for validation and for model selection; general operational scheme for using a dataset (training, validation, testing, model optimization and selection); scikilearn methods and iterators for model selection, dataset splitting, CVbased grid search for parameter setting; recall, precision, F1, confusion matrix  Notebook  
2/1  Break, no classes  


2/6  Data cleaning, Dealing with Outliers: Concept of outlier in data; reasons for an outlier; types of outliers (global, contextual, collective); detection of outliers; removing or keeping? parametric approaches: Gaussians and zscore, univariate vs. multivariate; nonparametric approaches: quantiles, IQR, box plots; sklearn methods.  Notebook  
2/8  LabTest 3: A classification task with kNN  


2/13  National Sports Day, no classes  
2/15  Feature Engineering 1, Feature selection, Feature transformations: Correlations among data and feature selection; correlation coefficients; correlation matrix and heatmap visualization; tranformation and use of features using polynomial maps; examples in regression and classification tasks; linearity of the models in prediction and in training; sklearn methods for manipulating polynomial features; weighting feature importance; use of pipelines to automate and store processes.  Notebook  


2/20  Feature Engineering 3, Time series data: Examples and properties of time series; time series and supervised learning: the need for defining good features; date time features, pandas methods; time lagged features, notion of autocorrelation, pandas methods; features based on rolling window statistics; features based on expanding window statistics; linear regression on engineered features; crossvalidation for time series, sklearn methods for splitting and crossvalidating the dataset; components of a time series: level, trend, seasonality, random residuals; importance of stationarity; subtraction of components to get stationarity; techniques to check stationarity; STL method from statsmodels; statistical tests.  Notebook 
Daily min temperatures Train passengers Airline passengers 

2/22  Homework in class  


2/25  Spring break  
2/27  Spring break  
2/29  Spring break  


3/3  Feature engineering 4, Image data 1: Image data and taskrelevant features; properties of image data; RGB enconding; skimage methods for handling and processing images; raw pixel intensities as features; grayscale transformation; feature extraction by binning, properties and issues using histograms and pie charts; features extracted by aggregation statistics.  Notebook  Cat image Dog image Panda image Sea image Coder survey dataset 

3/5  Feature engineering 5, Image data 2: Features from edge detection; Canny edge detector; Gaussian filters; Sobel filter as edge operator; convolution operator; examples of use of convolution filters for image processing and feature extraction: blurring, smoothing, embossing, sharpening, edge extraction.  Notebook  
3/7  Feature engineering 6, Image data 3: Filter algebra; nonlinear filters; image gradients, function gradients; Histogram of Oriented Gradients (HOG) as image feature descriptor; stepbystep process for computing the HOG; skimage methods; use of HOG in image classification and object detection; localized feature extraction: SIFT, SURF, ORB; main ideas behind SIFT; example of use with OpenCV; example of use of ORB with skimage.  Notebook  Sliding window image Colosseum image Sitting dog image 



3/12  Decisiontrees 2: properties if ID3; overfitting issues; pruning approaches, C4.5; dealing with continuous attributes; thresholding with binary branching; axisparallel boundaries; regression with decision trees; purity of a set: discrete vs. continuous labels; use of variance / standard deviation as measure of purity; (extras) other measures of dispersion / purity in labeled sets (e.g., Gini index); criteria to decide to stop splitting a partition; examples of regression trees vs. max depth constraint; sklearn code.  
3/14  LabTest 4: Decision Trees  


3/19  Linear Models for Classification, Definitions, Properties, Training: General form and properties of linear models for classification and for regression; formulation of a linear model, bias, scalar product; basic geometrical properties; linear decision boundaries; from a linear function to discrete labels for classification; feature transformations and linear models; score and behavior of a linear classifier; functional and geometric margin; loss functions for classification; optimization problem and its challenges  
3/21  Gradient methods for model training: Solving optimization problems for linear models using different loss functions; gradients and system of equations; difficulty of optimzation problems; properties of gradients; gradient descent/ascent as numeric iterative approach for finding minima/maxima; properties of GD; role of step size; batch mode GD; sum functions and stochastic/incremental GD; examples of use and sklearn code (in Notebook)  Notebook  


3/26  Support Vector Machines (SVM), Kernelization: Notion of classifier margin; SVMs as maxmargin classifiers. SVM optimization problem; SVM hardmargin formulation and properties; support vectors; formulation of primal and dual problems; solution of dual; softmargin SVM for non linearly separable data; slack variables and penalty factor; support vectors in softmargin SVM; Hinge loss and regularization; Support Vector Regression (SVR): general concepts and basic formulation, loss function, example of application; Kernel functions and inner products; kernelization of SVM problem formulation; kernel trick; examples of kernel functions; properties of kernels; examples of applications; kernelization of Logistic regression.  
3/28  LabTest 5: Ensemble methods, Gradient methods, Logistic Regression, SVMs  


4/2  Neural Networks 1, Perceptron, MultiLayer Perceptrons (MLPs): From linear classifiers to the Perceptron model; abstraction of neuron processing, threshold models; Perceptron algorithm: iterative adjustments of weights, use of gradients to miminize quadratic loss; from single perceptrons to multilayer perceptrons (MLP); neural networks as nonlinear parametric models for function approximation; perceptron units with nonlinear activations; hidden and visible layers; feedforward (FF) multilayer architectures; activation functions; hidden layer as feature extractor / feature map; NN and automatic feature learning through the hidden layers; loss function and NN optimization problem; stochastic gradient descent approaches; idea of backpropagation for computing the gradients; choice of the activation function and optimization problem.  
4/4  Neural Networks 2, Design and Training, Convolutional Neural Networks (CNNs): Design choices; overfitting issues; general theoretical properties; use of Keras for implementing and testing neural networks: creating a sequential layout, compile a model, test and visualize learning evolution, inspect/visualize the activations from the layers and relate to feature extractions; epochs; softmax; MLP examples with numeric and image data; MLPs vs. Convolutional Neural Networks (CNNs); issues with fully connected networks; core reasons behind the success of CNNs; recap of convolution operator and SIFT feature extraction in images; role and rationale behind convolutional and pooling layers; locality of processing, hierarchical dimensionality reduction; number of trainable parameters; typical CNN architectures: sequence of (convolution filters, activations, pooling); constructing convolutional layers; role of stride; feature maps and filter banks; constructing pooling layers; max pooling; softmax output layer; examples; visualization of features extracted at the different layers; notes about optimization in CNNs and transfer learning; keras for CNNs. 
Notebook MLPs Notebook CNNs Pima Indians dataset 



4/7  Eid alFitr, no classes  
4/9  Eid alFitr, no classes  
4/11  Eid alFitr, no classes  


4/14  Break, no classes  
4/16  Regularization techniques: Control of model complexity, minimize risk of overfitting: explicit feature selection; noise and limitations during training; use of regularized loss functions (implicit control of weight magnitude); biasvariance of a model; effect of large weights on variance / overfitting; loss function with additive regularization terms; use of Lnorms; Ridge regression; Lasso regression; Ridge regression and Lasso regression at work; comparative analysis over a number of test scenarios with linear and polynomial features; Elastic Net regression; realworld regression scenario from data wrangling to model selection; study of the effect of the λ parameter for training and testing errors.  Notebook  Mart sales dataset  
4/18  LabTest 6: Clustering, Neural Networks, Regularization  


4/23  Course review  
4/25  Final Exam  

Topic  Files  Due Dates 

Homework 1: A Classification task: from data ingestion to model selection and testing  
Homework 2: Data cleaning and model selection for regression tasks  
Homework 3: Feature engineering and model selection for supervised image classification  
Homework 4: Unsupervised image classification, Time series analysis 
Topic  Files  Due Dates 

LabTest 1: Fundamentals of ML, NumPy, Jupyter notebooks and Markdow  
LabTest 2: ML pipeline and python tools for a regression task  
LabTest 3: Classification pipeline using kNN  
LabTest 4: Data cleaning: missing data and outliers  
LabTest 5: Scaling, Feature transformations, Feature engineering of numeric data  
LabTest 6: Decision Trees  
LabTest 7: Ensemble methods, Gradient methods, Logistic Regression, SVMs  
LabTest 8: Clustering, Neural Networks, Regularization  
LabTest 9: Recommender systems and ChatGPT 
Deliverables  Files  Due Dates 

D1 [Report]: Initial Proposal and Dataset  
D2 [Dataset and Notebook]: Final Dataset  
D3 [Software and Notebook]: Query Answering Machine (QuAM)  
D4 [Presentation]: Final report  
Homework is due on Gradescope by the posted deadline. Assignments submitted past the deadline will incur the use of late days.
You have 3 late days in total, but cannot use more than 1 late day per homework. No credit will be given for homework submitted more than 2 days after the due date. After your 3 late days have been used you will receive 30% off for each additional day late.
For homework, you can discuss the exercises with your classmates, but you should write up your own solutions. If you find a solution in any source other than the material provided on the course website or the textbook, you must mention the source.
In general, for all types of assignments and tests, CMU's directives for academic integrity apply and must be duly followed.
Name  Hours  Location  

Gianni Di Caro  gdicaro@cmu.edu  By appointment, drop by  M 1007 
Zhijie Xu  zhijiex@andrew.cmu.edu  TBD  
Devang Acharya  devanga@andrew.cmu.edu  TBD 