15-488 - Spring 2020

Machine Learning
in a Nutshell




Key Information

Lectures: UT 4:30 - 5:50pm - Room 2052

Labs/Recitations: W 4:30pm - 5:50pm, Room 2062

9.0

35% In-class assessments (Quizzes, Labs), 35% Homework, 30% Project (Two Tasks)

15-112 or 15-110 passed with a C or a higher letter grade


Overview

This course is about the application of machine learning (ML) concepts and models to solve challenging real-world problems.
The emphasis of the course is on the methodological and practical aspects of designing, implementing, and using ML solutions.
Course topics develop around the notion of ML process pipeline, that identifies the multi-staged process of building and deploying an ML solution. An ML pipeline includes:

  • definition of the problem, objectives, and performance metrics;
  • collection and management of relevant operational data;
  • data wrangling (transforming, cleaning, filtering, scaling);
  • perform feature engineering on the available data in terms of feature selection, feature extraction, feature processing;
  • selection of appropriate ML models based on problem requirements and available data;
  • implementation, application, testing, and evaluation of the selected model(s);
  • deployment of the final ML model.

The process proceeds both forward and backward, iterating each stage until a satisfactory solution model is built.
The workflow of an ML pipeline is illustrated in the figure below (source: Practical ML with Python).

The course tackles all the stages of the ML pipeline, presenting conceptual insights and providing algorithmic and software tools to select and implement effective ways of proceeding and dealing with the challenges of the different stages.
The python ecosystem for data science and ML pandas, numpy, matplotlib, scikit-learn, keras, notebooks is introduced and used to retrieve, store, manipulate, visualize, and perform exploratory analysis of the data.

Course workflow:

The first part of the course addresses the data part of the pipeline, from data mining and collection, to data filtering and processing, to feature engineering for different types of data (numeric, categorical, textual, image, temporal).
Next, unsupervised learning (UL) techniques are introduced to further operating on the data by learning effective representations, perform dimensionality reduction and data compression. UL techniques include: Clustering models, Principal Component Analysis (PCA), Autoencoders.
Moving from data to techniques for classification and regression, a number supervised ML models are presented, including:

  • Decision Trees,
  • k-Nearest Neighbors,
  • Naive Bayes,
  • Logistic Regression,
  • Support Vector Machines (SVMs),
  • Least Squares Linear Regression,
  • Regularization,
  • Feature maps,
  • Kernelization,
  • Deep / Convolutional Neural Networks.

The different models are introduced by a conceptualization of the main underlying ideas and by providing the algorithmic and software tools necessary to experiment with the model on different datasets.
A discussion of the aspects of generalization, bias and variance, model evaluation and selection using cross-validation techniques, completes the ML pipeline.
The different techniques are tested and evaluated in problem scenarios from different domains and based on different data types. Selected problem domains include: natural language processing, machine vision, financial forecasting, logistics, production planning, diagnosis and prediction for bio-medical data.

Learning Objectives

Students who successfully complete the course will have acquired a general knowledge of the main concepts and techniques of data science and ML, and will be adept to use ML in different fields of application.
The course will provide the students with a toolkit of different skills needed to effectively go through the entire ML pipeline. The students will acquire conceptual and practical knowledge about:

  • collecting, handling, exploring, and wrangling data in different formats and originating from different sources;
  • selecting, extracting and engineering data features using both manual and learning techiques;
  • identifying the most appropriate ML techniques for the problem and the data at hand;
  • implementing and using a set of core ML models;
  • testing and evaluating ML models;
  • using the Python ecosystem for ML and data science;
  • applying ML to problems from a range of different application domains.

Course Layout

  • Course is based on two lectures per week where the different problems, solution models, and algorithms are formally introduced. The introduction of a new concept is always accompanied by the presentation of practical use cases.

  • Each week, a third class is used as a laboratory or for recitation. Laboratory classes let the students aswering graded assignments that require both programming hands-on and conceptual understanding of course subjects. Recitation classes are aimed to revise the concepts introduced in the lecture classes and profile the use of the different software tools.

Prerequisites

Having passed either 15-112 or 15-110 with a C (minimum).

The basic notions of linear algebra, calculus, and probability theory that are necessary for the understanding of the formal concepts will be explained assuming no or little previous knowledge.

Assignments and Grading

  • Laboratory assessments: Students take bi-weekly laboratory classes where they have to answer questions involving both programming hands-on and conceptual aspects.
  • Homework: Outside of the classroom, students practice with bi-weekly homework consisting of programming tasks integrated in presentation notebooks. In the homework, students implement and experiment with the different algorithmic solutions, are confronted with different types of data, answer to conceptual questions, learn how to present material and results combining text, data, images, and code.
  • Project: Students have to complete a project that addresses the full ML pipeline and iteration cycle. Project work is staged in three sub-projects and is reported as a notebook. The first sub-project starts by dealing with data, the second adds up classification techniques and model evaluation and selection, the thirds adds up regression models. The project is done in small groups and the results are presented at the end of the course.

Grade: 35% Laboratory Assessments, 35% Homework, 30% Project

Readings

In addition to the lecture handouts and python notebooks (that will be made available after each lecture), during the course additional material will be provided by the instructor to cover specific parts of the course.

A number of (optional) textbooks can be consulted to ease the understanding of the different topics (the relevant chapters will be pointed out by the teacher), these include, but are not restricted to:

  • Machine Learning, Tom Mitchell (in the library)
  • Machine Learning: The Art and Science of Algorithms that Make Sense of Data, P. Flach
  • A Course in Machine Learning, Hal Daume', available online

Schedule (Possibly subject to changes)



Date Topics Handouts References
1/12 General concepts, ML pipeline: Machine learning for data-driven decision making, extracting information from data, finding structures. Machine learning pipeline: from data sources to final model learning and deployment. pdf
1/14 General ML scheme, Learning with a teacher (Supervised Learning): Overview of the general ML scheme (from data to features, ML task, ML problem); Supervised Learning as learning with a teacher / supervisor; data labels and error quantification; preparing a labeled dataset: issues, processes; definition of classification and regression tasks; practical examples. pdf
1/15 ML Tasks: Supervised Learning (Classification, Regression); feature spaces; geometric view; Unsupervised Learning (Finding patterns and relations, clustering, compression, dimensionality reduction); Reinforcement Learning (Sequential decision-making). pdf

1/19 Model hypothesis, Loss functions: A complete example of SL task flow; design choice: hypothesis class, parametric model functions; design choice: how to evaluate a model, loss functions; examples and properties of basic loss functions for classification and for regression. pdf
1/21 Optmization problems, Generalization, ML Workflow Optmization problem for SL; empirical error; model complexity and overfitting; expected generalization (out-of-sample) error; validation sets and estimation of the generalization error; canonical SL problem; SL workflow; building a model in the ML pipeline; promoting generalization. pdf
1/22 Laboratory - Complete example with Python ecosystem for the ML pipeline : Format (CSV, JSON, XML) and sources of data, access and find data on the Internet (HTML/HTTP, Web scraping, Kaggle datasets); a first complete example with data; introduction to Numpy, Scikit-learn, Pandas, Matplotlib, Jupyter notebooks.

1/26 Data description and Exploratory Data Analysis: Types of data (numerical, categorical, textual, temporal), use of python tools for visualization and statistical exploratory analysis
1/28 Data processing: Data cleaning and filtering, dealing with missing values, duplicates, and outliers, data scaling, aggregation and summarization
1/29 Laboratory:

2/2 Feature extraction, selection and engineering 1: From raw data to features that better represent the underlying problem to the predictive ML models (numeric, categorical, temporal data)
2/4 Feature extraction, selection and engineering 2: From raw data to features that better represent the underlying problem to the predictive ML models (numeric, categorical, temporal data)
2/5 Laboratory:

2/9 Unsupervised learning for representation learning and finding structure in data: Clustering models
2/11 Unsupervised learning for representation learning and finding structure in data: Clustering at work.
2/12 Laboratory:

2/16 Unsupervised learning models for data compression and dimensionality reduction: Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA)
2/18 Unsupervised learning models for data compression and dimensionality reduction: Vector quantization, Autoencoders
2/19 Laboratory:

2/23 Classification with human-readable models: Decision trees
2/25 Classification with human-readable models: Decision trees at work in the ML pipeline
2/26 Laboratory:
3/1 Spring break
3/3 Spring break
3/4 Spring break
3/8 Classification with distance-based methods: Nearest neighbors
3/10 Classification with distance-based methods: Nearest neighbors at work in the ML pipeline
3/11 Laboratory:

3/15 Model evaluation and selection: Generalization, bias-variance
3/17 Model evaluation and selection: Model selection techniques, cross-validation
3/18 Laboratory:

3/22 Classification with parametric models: Linear models, review of optimization notions
3/24 Classification with parametric models: Support Vector Machines (SVMs)
3/25 Laboratory:

3/29 Classification with probabilistic models: Review of probability notions, Bayes rule, Naive Bayes
3/31 Classification with probabilistic models: Logistic regression
4/1 Laboratory:

4/5 Regression with linear models: Least squares, regularization concepts
4/7 Regression with non linear models: Feature maps, kernelization
4/8 Laboratory:

4/12 Deep learning models: Neural networks as non-linear supervised models, Convolutional neural networks (CNNs)
4/14 Deep learning models: CNN case study with image/text data
4/15 Laboratory:

4/19 Course Wrap-up, Q&A
4/21 Student Project Presentations I
4/22 Student Project Presentations II

Homework Assignments

Topic Files Due Dates
Homework 1: -
Homework 2: -
Homework 4: -
Homework 5: -


Policies for Assignments

  • Homework is due on autolab by the posted deadline. Assignments submitted past the deadline will incur the use of late days.

  • You have 6 late days in total, but cannot use more than 2 late days per homework. No credit will be given for homework submitted more than 2 days after the due date. After your 6 late days have been used you will receive 20% off for each additional day late.

  • You can discuss the exercises with your classmates, but you should write up your own solutions. If you find a solution in any source other than the material provided on the course website or the textbook, you must mention the source.

  • In general, for all types of assignments and tests, CMU's directives for academic integrity apply and must be duly followed.

Office Hours

Name Email Hours Location
Gianni Di Caro gdicaro@cmu.edu Thursdays 4:30-5:30pm + pass by my office at any time ... M 1007
Aliaa Essameldin aeahmed@andrew.cmu.edu TBD M 1004