{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "\n", "**15-448: Machine Learning in a Nutshell**, *CMU-Qatar* Spring'20\n", "\n", "**Gianni A. Di Caro**, www.giannidicaro.com\n", "\n", "Lab Test \n", "\n", "***" ] }, { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "# Lab Test 1: Practice with the ML pipeline and python tools for a regression task\n", "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the handout you find the file `top50_spotify19.csv` that contains the top 50 most listened songs in the world by spotify. The dataset includes several attributes about the songs.\n", "\n", "Download the file and store it in some folder on your computer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read and inspect the data" ] }, { "cell_type": "code", "execution_count": 148, "metadata": {}, "outputs": [], "source": [ "# import pandas, numpy, scipy\n" ] }, { "cell_type": "code", "execution_count": 149, "metadata": {}, "outputs": [], "source": [ "# read the file with the dataset into a pandas DataFrame\n", "# Note: you'll need to add the argument econding='latin1' to the pandas csv method\n" ] }, { "cell_type": "code", "execution_count": 150, "metadata": {}, "outputs": [], "source": [ "# check your data: print them out, get the names of the column labels, and so on\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check for non-numeric entries" ] }, { "cell_type": "code", "execution_count": 151, "metadata": {}, "outputs": [], "source": [ "# check the data for the presence of possible nan or missing entries \n", "# use the pandas method .isnull() \n", "# don't know what .isnull() does? -> Use help()!\n" ] }, { "cell_type": "code", "execution_count": 152, "metadata": {}, "outputs": [], "source": [ "# show how many nan entries are in the data set: use isnull() in combination with np.sum()\n" ] }, { "cell_type": "code", "execution_count": 153, "metadata": {}, "outputs": [], "source": [ "# the dataset contains a set of columns with string data and a set of column with numeric data\n", "# as an excercise for practicing with indexes and ranges, select the columns with numeric data \n", "# and store them into a numpy array \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploratory Data Analysis by Visualization" ] }, { "cell_type": "code", "execution_count": 154, "metadata": {}, "outputs": [], "source": [ "# let's start doing some EDA, by getting some data visualization\n", "# At this purpose, import matplotlib and pyplot, set the resolution to 100 dpi\n" ] }, { "cell_type": "code", "execution_count": 155, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# make a visualization of the histogram distribution of the values of the 'popularity' attribute\n", "# remember to add meaningful axis labels and plot title\n", "# plotting a histogram is performed using the .hist() method of matplotlib\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scientific / practical question: predictor for popularity" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "The **scientific question** we want to address is the presence of any meaningful relationship between any of the numeric attributes and the index of **popularity** of a song.\n", "\n", "At this aim, you have to first visually explore the relationship between a few selected attributes and the popularity by creating and analyzing different *scatter plots*.\n", "\n", "After, you have to select one attribute and try to fit polynomials of different degrees.\n", "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scatter plots between pairs of features to check correlations" ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [], "source": [ "# make a scatter plot of Danceability vs. Popularity \n" ] }, { "cell_type": "code", "execution_count": 157, "metadata": {}, "outputs": [], "source": [ "# make a scatter plot of Liveness vs. Popularity \n" ] }, { "cell_type": "code", "execution_count": 158, "metadata": {}, "outputs": [], "source": [ "# make a scatter plot of Energy vs. Popularity \n" ] }, { "cell_type": "code", "execution_count": 159, "metadata": {}, "outputs": [], "source": [ "# make a scatter plot of Acousticness vs. Popularity \n" ] }, { "cell_type": "code", "execution_count": 160, "metadata": {}, "outputs": [], "source": [ "# make a scatter plot of Speechness vs. Popularity \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Make a selection of a predictor attribute" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Make a selection of an attribute that may seems as a mild/good predictor for popularity!**" ] }, { "cell_type": "code", "execution_count": 161, "metadata": {}, "outputs": [], "source": [ "# let's assume that you have selected speechnees, for the sake of clarity\n", "# extract speechness and popularity columns into two named arrays\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regression model: a linear relationship" ] }, { "cell_type": "code", "execution_count": 162, "metadata": {}, "outputs": [], "source": [ "# try to find a lineasr regression model for the selected pair of attributes\n", "# use sp.polyfit()\n", "# print out the resulting model parameters, the SSE, and the root mean SSE\n" ] }, { "cell_type": "code", "execution_count": 163, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# use sp.poly1d() to store the learned model parameters into a data object\n", "# print out the model\n" ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [], "source": [ "# plot the linear model together with the data\n", "# plotting the model requires to use plot(), while the data are plotted using scatter()\n", "# remember labels and titles\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Are you satisfied with the linear model? Probably not!\n", "\n", "**Let's try out polynomial models of degree 2 and 3**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regression models: polynomials of higher degree" ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [], "source": [ "# In the following cells you have to repeat the steps done for the linear hypothesis\n", "# considering quadratic and cubic models\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "Look at the results, both visually and in terms of losses. Provide a written summary of the results and decide which model seems more suitable. Add your final considerations.\n", "***" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }