{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# I/O and file processing\n", "\n", "I/O stands for input/output, and it encompasses every operation a program makes with the outside world. Examples of such operations are:\n", "- printing or displaying something on the screen (the screen is outside the program)\n", "- reacting to mouse or keyboard commands (the user is outside the program)\n", "- sending information through the network (the network is outside the program)\n", "- opening, writing, and saving files (files are outside the program)\n", "\n", "So far in the course, the programs you wrote were completely self-contained. Functions returned values depending on the parameters, and nothing more.\n", "\n", "In this lecture we will see how you can write python programs that interact with the outside world.\n", "\n", "## Print\n", "\n", "One of the simplest I/O operations is printing, which you are already familiar with." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello world!\n" ] } ], "source": [ "print(\"Hello world!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe that the string `\"Hello world!\"` is printed on the screen, differently from:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Bye world!'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"Bye world!\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "which is shown on the screen for our convenience.\n", "\n", "If this difference is hard to grasp, think of these two commands inside functions:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is printed, but the function returns nothing\n", "x = None\n" ] } ], "source": [ "def f1():\n", " print(\"This is printed, but the function returns nothing\")\n", " \n", "x = f1()\n", "print(\"x =\", x) # x is None, because the function returns nothing" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x = This is what the function returns.\n" ] } ], "source": [ "def f1():\n", " return \"This is what the function returns.\"\n", " \n", "x = f1()\n", "print(\"x =\", x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Printing can be quite useful, but it is very passive. It only allows the program to thrown some information into the outside world (*output*), but there is no way the outside world can influence the program. In order to achieve that, we need a way for the outside world to provide *inputs* to programs.\n", "\n", "## Input\n", "\n", "One of the simplest ways of obtaining input from the outside world is to read it from whatever is typed on the screen. This can be done using the python command `input()`.\n", "\n", "When a program is running, and it encounters `input()`, it will stop and wait for the user to type something on the screen. After typing, the user presses `enter` to indicate to python that it can read what was typed, and continue the program. Python reads the input as a string, and this needs to be assigned to a variable." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Something I typed\n" ] }, { "data": { "text/plain": [ "False" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def isInputNumber():\n", " x = input()\n", " if x.isdigit():\n", " return True\n", " else:\n", " return False\n", " \n", "isInputNumber()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1979\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "isInputNumber()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `input()` command is useful for writing programs that interact with the user. Like the following one:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hi! What is your name?\n", "Laila\n", "Hello Laila! Nice to meet you. My name is Eliza.\n" ] } ], "source": [ "def meetEliza():\n", " print(\"Hi! What is your name?\")\n", " name = input()\n", " if name == \"Eliza\":\n", " print(\"What a coincidence! I am also called Eliza. Nice to meet you!\")\n", " else:\n", " print(\"Hello \" + name + \"! Nice to meet you. My name is Eliza.\")\n", " \n", "meetEliza()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hi! What is your name?\n", "Eliza\n", "What a coincidence! I am also called Eliza. Nice to meet you!\n" ] } ], "source": [ "meetEliza()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading files\n", "\n", "Prints and inputs are good when we are dealing with small amounts of information, but when things become too large, files come to the rescue.\n", "\n", "Python can open files using the `open(filename)` function. The parameter `filename` is the path of the file as a string. This function returns a `TextIOWrapper` object. You do not need to know what this object is, but you need to be aware that the `open(filename)` function will not return the content of the file as a string. To get the whole content of the file as a string, you need to use the `.read()` function (invoked using the dot notation).\n", "\n", "If this file is in the same location as your program, it can be opened like this:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is the file sample.txt.\n", "There is nothing interesting to see here... move on.\n", "\n", "<_io.TextIOWrapper name='sample.txt' mode='r' encoding='UTF-8'>\n" ] } ], "source": [ "file = open(\"sample.txt\")\n", "content = file.read()\n", "print(content)\n", "print(file) # A weird object." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the file is somewhere else, you need to call `open` using the file *path*, which is the address of the file in your computer. For example, if this file is in your user folder, it could look something like this (for MacOS):" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "ename": "FileNotFoundError", "evalue": "[Errno 2] No such file or directory: '/Users/greis/sample.txt'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Wil raise FileNotFoundError because this path does not exist on my computer\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mfile\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mopen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"/Users/greis/sample.txt\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '/Users/greis/sample.txt'" ] } ], "source": [ "# Wil raise FileNotFoundError because this path does not exist on my computer\n", "file = open(\"/Users/greis/sample.txt\") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many times it will be convenient to process a file line by line. In this case, we do not need to read the whole file at once into a variable using `.read()`. Instead, we can loop through the lines of a file like this:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "680\n" ] } ], "source": [ "file = open(\"numbers.txt\")\n", "s = 0\n", "for line in file:\n", " n = int(line) # We know the line has one int\n", " s += n\n", "print(s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to get all the lines in a list, we can use `.readlines()`:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['34\\n', '65\\n', '23\\n', '87\\n', '3\\n', '65\\n', '8\\n', '14\\n', '56\\n', '73\\n', '69\\n', '93\\n', '4\\n', '14\\n', '27\\n', '45\\n']\n" ] } ], "source": [ "file = open(\"numbers.txt\")\n", "lines = file.readlines()\n", "print(lines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CSV files\n", "\n", "`csv` is a very popular file format for storing information. `csv` stands for *comma separated values*, and it is a file that contains values separated by commas. Each line corresponds to an entry, like a table row, and the values for that entry are separated by commas. So the table:\n", "\n", "| Movie | Year | Rating |\n", "|:-------|------|--------|\n", "|The Grand Budapest Hotel | 2014 | 88 |\n", "|Little Miss Sunshine | 2006 | 91 |\n", "|The Darjeeling Limited | 2007 | 67 |\n", "| Moonrise Kingdom | 2012 | 84 |\n", "| Mary and Max | 2009 | 81 |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "corresponds to the file:\n", "\n", "```\n", "Movie,Year,Rating\n", "The Grand Budapest Hotel,2014,88\n", "Little Miss Sunshine,2006,91\n", "The Darjeeling Limited,2007,67\n", "Moonrise Kingdom,2012,84\n", "Mary and Max,2009,81\n", "```\n", "\n", "In fact, any spreadsheet can be saved in this format.\n", "\n", "Lines that start with # in `csv` files are comments and should be ignored (like in python).\n", "\n", "Most of the datasets available for data analysis come in this format (see for example https://archive.ics.uci.edu/ml/index.php or https://www.kaggle.com/datasets), so it is useful to learn how to read and represent this kind of file.\n", "\n", "If I ask you what is the rating of Moonrise Kingdom according to the table above, you will most likely find the Moonrise Kingdom line in the table, and then go to the \"Rating\" column. We would like to do the same in python, namely, if our dataset is stored in a variable `ds`, we would like to do:\n", "```\n", "ds[\"Moonrise Kingdom\"][\"Rating\"]\n", "```\n", "\n", "The data structure that allows us to do that are dictionaries. Therefore, we will read the `csv` file as a dictionary of dictionaries. The outer dictionary is indexed by the rows (the movie titles), and its values are dictionaries. The inner dictionaries are indexed by the columns, and their values are the values on each position of the table.\n", "\n", "This double dictionary is created by reading the file line by line, and, for each line we create the inner dictionary. Remember that the first line should not be processed.\n", "\n", "One way of doing that is:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Little Miss Sunshine': {'Rating': '91', 'Year': '2006'},\n", " 'Mary and Max': {'Rating': '81', 'Year': '2009'},\n", " 'Moonrise Kingdom': {'Rating': '84', 'Year': '2012'},\n", " 'The Darjeeling Limited': {'Rating': '67', 'Year': '2007'},\n", " 'The Grand Budapest Hotel': {'Rating': '88', 'Year': '2014'}}" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "file = open(\"movies.csv\")\n", "lines = file.readlines()\n", "# strip removes newline char, split splits the values at the commas\n", "first_line_vals = lines[0].strip().split(\",\")\n", "col_names = first_line_vals[1:] \n", "\n", "ds = {}\n", "for line in lines[1:]:\n", " vals = line.strip().split(\",\")\n", " line_name = vals[0]\n", " col_vals = vals[1:]\n", " inner_dict = {}\n", " for i in range(len(col_vals)):\n", " inner_dict[col_names[i]] = col_vals[i]\n", " ds[line_name] = inner_dict\n", " \n", "ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can get any value from the table using an intuitive indexing:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'2006'" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds[\"Little Miss Sunshine\"][\"Year\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since `csv` is a very popular format, python has its own library for handling this kind of file: https://docs.python.org/3/library/csv.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing files\n", "\n", "If we are computing something that we want to save, we can write the result into a file. To write into files, we first need to open it. But this time we need to indicate we are opening a file for writing. This is done by passing an extra parameter to the `open(filename)` function." ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": true }, "outputs": [], "source": [ "file1 = open(\"results1.txt\", \"w\") # w indicates the file is opened for writing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `filename` parameter works as before: if you want to write to a file that is not in the same location as your python program, you need to write the path.\n", "\n", "**Attention:** If the file does not exist, it will be created. If it exists, it will be *overwritten*.\n", "\n", "If you want to write to a file that exists, but you do not want its content to be overwritten, you can open the file using `\"a\"`, for append:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": true }, "outputs": [], "source": [ "file2 = open(\"results2.txt\", \"a\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the file is open, we can write to it using the `.write(s)` function, which takes a string as a paramter." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "25" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "file1.write(\"Stuff to go into the file\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function returns the number of characters written. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise\n", "\n", "The file [https://web2.qatar.cmu.edu/~mhhammou/15110-f20/lectures/zoo.csv](https://web2.qatar.cmu.edu/~mhhammou/15110-f20/lectures/zoo.csv) is a dataset about animals and their characteristics. Download and open it in a text editor (notes or notepad) to inspect it and find out what are the column names. For this file in particular, they are not given on the first line.\n", "\n", "Implement the function `readZoo()` that reads the `zoo.csv` file and returns a dictionary of dictionaries corresponding to this dataset.\n", "\n", "Once you have this dictionary, write functions that take it as input and return:\n", "- all mammals that lay eggs\n", "- all mammals that are aquatic\n", "- which animals have five legs" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def readZoo():\n", " return {}" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 2 }