I/O and file processing

I/O stands for input/output, and it encompasses every operation a program makes with the outside world. Examples of such operations are:

So far in the course, the programs you wrote were completely self-contained. Functions returned values depending on the parameters, and nothing more.

In this lecture we will see how you can write python programs that interact with the outside world.

Print

One of the simplest I/O operations is printing, which you are already familiar with.

Observe that the string "Hello world!" is printed on the screen, differently from:

which is shown on the screen for our convenience.

If this difference is hard to grasp, think of these two commands inside functions:

Printing can be quite useful, but it is very passive. It only allows the program to thrown some information into the outside world (output), but there is no way the outside world can influence the program. In order to achieve that, we need a way for the outside world to provide inputs to programs.

Input

One of the simplest ways of obtaining input from the outside world is to read it from the standard input. In practice, this means whatever is typed on the screen. This can be done using the python command input().

When a program is running, and it encounters input(), it will stop and wait for the user to type something on the screen. After typing, the user presses enter to indicate to python that it can read what was typed, and continue the program. Python reads the input as a string, and this needs to be assigned to a variable.

The input() command is useful for writing programs that interact with the user. Like the following one:

Reading files

Prints and inputs are good when we are dealing with small amounts of information, but when things become too large, files come to the rescue.

Python can open files using the open(filename) function. The parameter filename is the path of the file as a string. This function returns a TextIOWrapper object. You do not need to know what this object is, but you need to be aware that the open(filename) function will not return the content of the file as a string. To get the whole content of the file as a string, you need to use the .read() function (invoked using the dot notation).

If this file is in the same location as your program, it can be opened like this:

If the file is somewhere else, you need to call open using the file path, which is the address of the file in your computer. For example, if this file is in your user folder, it could look something like this (for MacOS):

Many times it will be convenient to process a file line by line. In this case, we do not need to read the whole file at once into a variable using .read(). Instead, we can loop through the lines of a file like this:

If we want to get all the lines in a list, we can use .readlines():

CSV files

csv is a very popular file format for storing information. csv stands for comma separated values, and it is a file that contains values separated by commas. Each line corresponds to an entry, like a table row, and the values for that entry are separated by commas. So the table:

Movie Year Rating
The Grand Budapest Hotel 2014 88
Little Miss Sunshine 2006 91
The Darjeeling Limited 2007 67
Moonrise Kingdom 2012 84
Mary and Max 2009 81

corresponds to the file:

Movie,Year,Rating
The Grand Budapest Hotel,2014,88
Little Miss Sunshine,2006,91
The Darjeeling Limited,2007,67
Moonrise Kingdom,2012,84
Mary and Max,2009,81

In fact, any spreadsheet can be saved in this format.

Lines that start with # in csv files are comments and should be ignored (like in python).

Most of the datasets available for data analysis come in this format (see for example https://archive.ics.uci.edu/ml/index.php or https://www.kaggle.com/datasets), so it is useful to learn how to read and represent this kind of file.

If I ask you what is the rating of Moonrise Kingdom according to the table above, you will most likely find the Moonrise Kingdom line in the table, and then go to the "Rating" column. We would like to do the same in python, namely, if our dataset is stored in a variable ds, we would like to do:

ds["Moonrise Kingdom"]["Rating"]

The data structure that allows us to do that are dictionaries. Therefore, we will read the csv file as a dictionary of dictionaries. The outer dictionary is indexed by the rows (the movie titles), and its values are dictionaries. The inner dictionaries are indexed by the columns, and their values are the values on each position of the table.

This double dictionary is created by reading the file line by line, and, for each line we create the inner dictionary. Remember that the first line should not be processed.

One way of doing that is:

Now we can get any value from the table using an intuitive indexing:

Reading files using the csv library

Since csv is a very popular format, python has its own library for handling this kind of file: https://docs.python.org/3/library/csv.html

We have different options to open a csv file using this library. The first one is using the csv.reader function, which takes as input a file (which was already opened using the open function), and returns a reader object. This reader object can be iterated to get each line of the file as a list of string.

If the separators in the file are not commas, you can specify them with the optional delimiter parameter:

Notice that we could have achieved the same thing without the csv library with a few extra lines of code:

The csv library provides a function called DictReader that reads the file into dictionaries, but in a different way from what we did above. In this case, each line becomes a dictionary whose keys are the column names.

If the first line of the file is not the name of the columns, these can be specified by the optional fieldnames parameter (see this and other options at https://docs.python.org/3/library/csv.html#csv.DictReader).

Writing files

If we are computing something that we want to save, we can write the result into a file. To write into files, we first need to open it. But this time we need to indicate we are opening a file for writing. This is done by passing an extra parameter to the open(filename) function.

The filename parameter works as before: if you want to write to a file that is not in the same location as your python program, you need to write the path.

Attention: If the file does not exist, it will be created. If it exists, it will be overwritten.

If you want to write to a file that exists, but you do not want its content to be overwritten, you can open the file using "a", for append:

Once the file is open, we can write to it using the .write(s) function, which takes a string as a paramter.

This function returns the number of characters written.

Exercise

The file zoo.csv is a dataset about animals and their characteristics. Download and open it in a text editor (notes or notepad) to inspect it and find out what are the column names. For this file in particular, they are not given on the first line.

Implement the function readZoo() that reads the zoo.csv file and returns a dictionary of dictionaries corresponding to this dataset.

Once you have this dictionary, write functions that take it as input and return: