{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Functions on Strings\n", "\n", "Python is a language particularly useful when it comes to string processing [1]. This is because it provides a variety of useful functions for string processing and transformation. In this lecture we will look at a few of them that will make our lives much easier when programming. To know more about other string functions available in python, check the documentation [here](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str).\n", "\n", "All the functions we will see are part of the string *class*, which means they are called using the dot `.` notation: if `s` is a variable containing a string, and `f()` is a string function, we call `f` on `s` like this: `s.f()`. Of course, depending on the function, there might be arguments that need to be passed to the function.\n", "\n", "Observe that **none of the functions below modify the string**.\n", "\n", "[1] As a consequence, it is also a great language for file processing, since files are, most of the times, nothing more than really long strings.\n", "\n", "## split\n", "\n", "The `split()` function splits the string into parts. If we pass no arguments to it, it will split the string at whitespaces and linebreaks, and return a list of the parts. Alternatively, we can pass to it a string argument indicating what it should split on. For example, if `s = \"hello,world\"` then calling `s.split(\",\")` will return the list `[\"hello\", \"world\"]`.\n", "\n", "Another related function that might be useful is `splitlines()`, which will return a list with all the lines of the string." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Philosophy is the study of general and fundamental questions\\nabout existence, knowledge, values, reason, mind, and language', ' \\nSuch questions are often posed as problems to be studied or resolved', '\\nPhilosophical methods include questioning, critical discussion, \\nrational argument, and systematic presentation', ' Classic philosophical \\nquestions include: Is it possible to know anything and to prove it?\\nWhat is most real? Philosophers also pose more practical and concrete \\nquestions such as: Is there a best way to live? Is it better to be \\njust or unjust (if one can get away with it)? Do humans have free will?']\n", "['Philosophy', 'is', 'the', 'study', 'of', 'general', 'and', 'fundamental', 'questions', 'about', 'existence,', 'knowledge,', 'values,', 'reason,', 'mind,', 'and', 'language.', 'Such', 'questions', 'are', 'often', 'posed', 'as', 'problems', 'to', 'be', 'studied', 'or', 'resolved.', 'Philosophical', 'methods', 'include', 'questioning,', 'critical', 'discussion,', 'rational', 'argument,', 'and', 'systematic', 'presentation.', 'Classic', 'philosophical', 'questions', 'include:', 'Is', 'it', 'possible', 'to', 'know', 'anything', 'and', 'to', 'prove', 'it?', 'What', 'is', 'most', 'real?', 'Philosophers', 'also', 'pose', 'more', 'practical', 'and', 'concrete', 'questions', 'such', 'as:', 'Is', 'there', 'a', 'best', 'way', 'to', 'live?', 'Is', 'it', 'better', 'to', 'be', 'just', 'or', 'unjust', '(if', 'one', 'can', 'get', 'away', 'with', 'it)?', 'Do', 'humans', 'have', 'free', 'will?']\n", "['Philosophy is the study of general and fundamental questions', 'about existence, knowledge, values, reason, mind, and language. ', 'Such questions are often posed as problems to be studied or resolved.', 'Philosophical methods include questioning, critical discussion, ', 'rational argument, and systematic presentation. Classic philosophical ', 'questions include: Is it possible to know anything and to prove it?', 'What is most real? Philosophers also pose more practical and concrete ', 'questions such as: Is there a best way to live? Is it better to be ', 'just or unjust (if one can get away with it)? Do humans have free will?']\n" ] } ], "source": [ "s = \"\"\"Philosophy is the study of general and fundamental questions\n", "about existence, knowledge, values, reason, mind, and language. \n", "Such questions are often posed as problems to be studied or resolved.\n", "Philosophical methods include questioning, critical discussion, \n", "rational argument, and systematic presentation. Classic philosophical \n", "questions include: Is it possible to know anything and to prove it?\n", "What is most real? Philosophers also pose more practical and concrete \n", "questions such as: Is there a best way to live? Is it better to be \n", "just or unjust (if one can get away with it)? Do humans have free will?\"\"\"\n", "\n", "sentences = s.split(\".\") # Splits at final stops\n", "words = s.split() # Splits at blank spaces and linebreaks\n", "lines = s.split(\"\\n\") # Same as s.splitLines()\n", "\n", "print(sentences)\n", "print(words)\n", "print(lines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## strip\n", "\n", "The `strip()` function removes spurious content from the beginning and end of a string. When called without arguments, it removes whitespaces." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello world\n" ] } ], "source": [ "s = \"\"\" Hello world\n", "\n", "\"\"\"\n", "\n", "s_clean = s.strip()\n", "print(s_clean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When called with a string as an argument, it removes all characters that are in that string. Intuitively, if we use `r` as an argument, as in `s.strip(r)`, then `r` acts like a set of characters that must be removed from the beginning or end of `s`. Note that, since it is a set, they will be removed in any order. As soon as a character that is not in `r` is encountered, the stripping stops." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "anAwesomeSite\n" ] } ], "source": [ "s = \"www.anAwesomeSite.com\"\n", "s_stripped = s.strip(\"w.com\") # Note that the inner w and m are not removed\n", "print(s_stripped)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## lower & upper\n", "\n", "The `lower` and `upper` functions return the string where its aphabetic characters are replaced by their lower- or upper-case versions, respectively. All other characters remain the same." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1984 IS A DYSTOPIAN NOVEL BY ENGLISH NOVELIST GEORGE ORWELL.\n" ] } ], "source": [ "s = \"1984 is a dystopian novel by English novelist George Orwell.\"\n", "sU = s.upper()\n", "print(sU)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1984 is a dystopian novel by english novelist george orwell.\n" ] } ], "source": [ "sl = s.lower()\n", "print(sl)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## replace\n", "\n", "The `replace` function replaces substrings in a string (I bet you did not see that one coming!)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "These are interesting times!\n" ] } ], "source": [ "s = \"These are the times.\"\n", "s1 = s.replace(\"the\", \"interesting\") # Note that capitalization matters!\n", "s2 = s1.replace(\".\", \"!\")\n", "print(s2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, we can pass an integer `n` as the third argument of `replace`, and it will only replace the first `n` occurrences of the substring." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "These are interesting times to stay together.\n" ] } ], "source": [ "s = \"These are the times to stay together.\"\n", "s1 = s.replace(\"the\", \"interesting\", 1) # Replaces only the first 'the' (and leaves the one in togeTHEr alone)\n", "print(s1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## startswith & endswith\n", "\n", "The functions `startswith(prefix)` and `endswith(suffix)` return a boolean indicating whether the string starts with `prefix` or ends with `suffix`." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Starts with 'In'? True\n", "Ends with 'ever end'? True\n" ] } ], "source": [ "s = \"In the beginning it seemed that it would never end\"\n", "sw = s.startswith(\"In\")\n", "ew = s.endswith(\"ever end\")\n", "print(\"Starts with 'In'?\", sw)\n", "print(\"Ends with 'ever end'?\", ew)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## find\n", "\n", "The function `find(ss)` returns the first index where substring `ss` occurs. Alternatively, you can use `find(ss,start,end)`, where start and end are integers indicating the slice where `ss` should be searched for. If you want to search until the end of the string, you can ommit `end`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First occurence of TG: 2\n", "Second occurence of TG: 6\n" ] } ], "source": [ "s = \"ACTGTCTGATGTACCCT\"\n", "\n", "tg_first = s.find(\"TG\")\n", "print(\"First occurence of TG:\", tg_first)\n", "\n", " # Starting to search from the next position where we found the first occurrence\n", "tg_second = s.find(\"TG\", tg_first+2)\n", "print(\"Second occurence of TG:\", tg_second)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the substring `ss` does not occur in `s`, the function `find` returns -1. Another function which serves the same purpose is `index` (like the one for lists). The only difference is that `index` raises a `ValueError` if the substring is not found." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1: count occurrences \n", "\n", "The function `count(ss)` returns the number of times the substring `ss` occurs in a string, however, it only counts *non-overlapping* occurrences. That means that, for example, `\"AAA\".count(\"AA\")` will return 1. Implement the function `countOverlapping(s, ss)` that counts the number of times `ss` occurrs in `s` considering overlapping occurrences.\n", "\n", "For example, `countOverlapping(\"AAA\", \"AA\")` should return 2." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def countOverlapping(s, ss):\n", " return 42" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2: fixing messages\n", "\n", "People have become sloppy when it comes to typing on the internet. There are a lot of abbreviations, capitalization is not respected, nor is punctuation. Your job in this task is to fix internet messages. Implement the function `fix(M)` that takes a message `M` as a string for the input, and returns another message where the content of `M` is the same, but the following is fixed:\n", " \n", "- There is a period at the end of the message.\n", "- All punctutations (.,!?) must be followed by one space.\n", "- The first letter of each sentence is capitalized.\n", "- The following abbreviations are spelled out in full:\n", " - u -> you\n", " - ofc -> of course\n", " - thru -> through\n", " - ppl -> people\n", " - 2 -> too\n", " - 8 -> ate\n", " \n", "For example, let\n", " \n", "`M1 = \"it is already halfway thru the semester and u have no idea what is going on? rest assured u are not alone!a lot of ppl go thru the same desperation ofc.What u need to do is take a step back and figure out what u don't understand and why...it is not 2 l8 to recover. U can do this\"`\n", "\n", "then `fix(M1)` should return:\n", "\n", "`\"It is already halfway through the semester and you have no idea what is going on? Rest assured you are not alone! A lot of people go through the same desperation of course. What you need to do is take a step back and figure out what you don't understand and why... It is not too late to recover. You can do this.\"`\n", "\n", "You may find it easier to split this task into functions, one for each fix, and use them to implement `fix(M)`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def fix(M):\n", " return \"\"" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 2 }