This course focuses on the application of machine learning (ML) concepts and models to solve challenging real-world problems.
The emphasis is on the methodological and practical aspects of designing, implementing, and utilizing ML solutions.
Course topics are structured around the concept of the ML process pipeline, which outlines the multi-stage process of building and deploying an ML solution. An ML pipeline includes:
The process iterates through these stages, proceeding forward and backward, until a satisfactory model is developed.
The workflow of an ML pipeline is illustrated in the figure below (source: Practical ML with Python).
This course covers all stages of the ML pipeline, offering conceptual insights and providing algorithmic and software tools to address challenges effectively at each stage.
The Python ecosystem for data science and ML, including tools like pandas, numpy, matplotlib, scikit-learn, keras, pytorch, and Jupyter Notebooks, will be introduced and used to retrieve, store, manipulate, visualize, and perform exploratory analysis of the data.
The course introduces each model conceptually and provides algorithmic and software tools necessary for experimentation with different datasets.
Discussions on generalization, bias and variance, model evaluation, and selection using cross-validation techniques complete the ML pipeline.
Techniques will be tested in diverse problem domains and data types, including: natural language processing, computer vision, financial forecasting, logistics, production planning, and biomedical data analysis.
Learning Objectives
Students who successfully complete the course will have acquired a general knowledge
of the main concepts and techniques of data science and ML, and will be adept
to use ML in different fields of application.
The course will provide the students with a toolkit of different skills needed to
effectively go through the entire ML pipeline. The students will acquire conceptual and
practical knowledge about:
Course Layout
Having passed 15-112 with a C (minimum).
The basic notions of linear algebra, calculus, and probability theory that are necessary for the understanding of the formal concepts will be explained assuming no or little previous knowledge.
Assignments:
Grading scheme:
In addition to the lecture handouts and python notebooks (that will be made available after each lecture), during the course additional material will be provided by the instructor to cover specific parts of the course.
A number of (optional) textbooks can be consulted to ease the understanding of the different topics (the relevant chapters will be pointed out by the teacher), these include, but are not restricted to:
| Date | Topics | Handouts | Extra Material | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1/6 | Course introduction, General ML concepts, ML pipeline, ML & Data Science
Software ecosystem Introduction to software tools:
|
pdf (up to slide 18) | |||||||||
| 1/8 | More on ML concepts, ML pipeline, ML & Data Science | pdf (from slide 18) | |||||||||
| |
|||||||||||
| 1/13 | Supervised Learning (SL) flow, model hypotheses, canonical SL problem. | ||||||||||
| 1/15 | Generalization error, Empirical risk minimization | pdf (up to slide 32) | |||||||||
| |
|||||||||||
| 1/20 | More on Generalization error, Loss functions | pdf (from slide 32) | |||||||||
| 1/22 | Workflow for Regression, Generalization and Overfitting | pdf (up to slide 22) | |||||||||
| |
|||||||||||
| 1/27 | From data to models with sklearn and pandas: a complete regression example, model selection, loss functions | Notebook |
Datasets: |
||||||||
| 1/29 | Break, no classes | ||||||||||
| |
|||||||||||
| 2/3 | Scikit-learn methods for OLSR, Cross-Validation, Visualization | Notebook | |||||||||
| |
|||||||||||
| 2/10 | National Sports Day, no classes | ||||||||||
| 2/12 | Data wrangling 1 - Fill in and Drop out: Missing values | Notebook |
Datasets: |
||||||||
| |
|||||||||||
| 2/17 | Data Wrangling 3 - Outliers 2 | ||||||||||
| 2/19 | Labtest 3: k-NN, Data Wrangling | ||||||||||
| |
|||||||||||
| 2/22 | Spring break | ||||||||||
| 2/24 | Spring break | ||||||||||
| 2/26 | Spring break | ||||||||||
| |
|||||||||||
| 3/3 | Feature engineering 1 - Feature Selection, Correlations, Feature Transformations | Notebook |
Datasets: |
||||||||
| 3/5 | Feature engineering 2 - Feature Transformations Lecture interrupted and cancelled because of war | Notebook of 3/3 | |||||||||
| |
|||||||||||
| 3/10 | Feature engineering 3 - Feature extraction methods for different data types | Notebook |
Datasets: |
||||||||
| 3/12 | Feature engineering 4 - Image data 1 | Notebook |
Cat image Dog image Panda image Sea image Lena b/w Petra image Imperia image Palm tree image Coder survey |
||||||||
| |
|||||||||||
| 3/17 | Feature engineering 6 - Image data 3 | Notebook | |||||||||
| 3/17 | Linear Models 1 | ||||||||||
| |
|||||||||||
| 3/22 | Eid al-Fitr, no classes | ||||||||||
| 3/24 | Eid al-Fitr, no classes | ||||||||||
| 3/26 | Eid al-Fitr, no classes | ||||||||||
| |
|||||||||||
| 3/31 |
Linear Models 3 - Examples of gradient methods for linear classifiers, SVM 1 Introduction to Project assignment |
pdf (SVM 1+2) | Gradient descent notebook | ||||||||
| 4/2 | Linear Models 4 - SVM 2 | Notebook | |||||||||
| |
|||||||||||
| 4/7 | Kernelization of Linear Models | Notebook | |||||||||
| |
|||||||||||
| 4/14 | Neural Networks 2 - Convolutional Neural Networks (CNNs) Project checkpoint |
||||||||||
| 4/16 | From Text Sequences to Generative AI: Attention, Transformers, and LLMs | ||||||||||
| |
|||||||||||
| |
|||||||||||
| Topic | Files | Due Dates |
|---|---|---|
| Homework 1: A Classification task: from data ingestion to model selection and testing | ||
| Homework 2: Data wrangling and Feature engineering for regression tasks | ||
| Homework 3: Comparison of ML models for supervised classification | ||
| Homework 4: Neural networks and unsupervised learning for image and time series data |
| Topic | Files | Due Dates |
|---|---|---|
| LabTest 1: Fundamentals of ML, NumPy, Jupyter notebooks and Markdow | ||
| LabTest 2: ML pipeline and python tools for a regression task | ||
| LabTest 3: Classification pipeline using K-NN | ||
| LabTest 4: Missing values, Outliers, Scaling | ||
| LabTest 5: Feature engineering | ||
| LabTest 6: Linear models, Gradient methods | ||
| LabTest 7: Neural networks | ||
| LabTest 8: Unsupervised learning | ||
| LabTest 9: Generative AI |
| Deliverables | Files | Due Dates |
|---|---|---|
| D1 [Report]: Initial Proposal and Dataset | ||
| D2 [Dataset and Notebook]: Final Dataset | ||
| D3 [Software and Notebook]: Query Answering Machine (QuAM) | ||
| D4 [Presentation]: Final report | ||
Homework is due on Gradescope by the posted deadline. Assignments submitted past the deadline will incur the use of late days.
You have 2 late days in total, but cannot use more than 1 late day per homework. No credit will be given for homework submitted more than 2 days after the due date. After the 2 late days have been used you will receive 30% off for each additional day late.
For homework, you can discuss the exercises with your classmates, but you should write up your own solutions. If you find a solution in any source other than the material provided on the course website or the textbook, you must mention the source.
In general, for all types of assignments and tests, CMU's directives for academic integrity apply and must be duly followed.
| Name | Hours | Location | |
|---|---|---|---|
| Gianni Di Caro | gdicaro@cmu.edu | By appointment, drop by | M 1007 |
| Yusuf Ansari | ma1@andrew.cmu.edu | TBD | Office |
| Salman Hajizada | shajizad@andrew.cmu.edu | TBD | ARC |