{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Module 2, Practical 4\n", "\n", "In this practical we will learn how to deal with regular expressions.\n", "\n", "## Regular expressions\n", "\n", "Regular expressions, or *regex* are a powerful language for text pattern matching. They are extremely useful in searching for complex patterns of characters to filter, replace and validate user inputs. Regex are employed in search engines, word processors find and replace functions, and text processing utilities such as ```awk``` or ```grep```. \n", "\n", "*e.g. How do I check that a strings contains only alphabetic characters? \n", " How do I check that an email is properly formatted (i.e. user@domain.com)? \n", " How do I find all strings containing years from 2010 to 2017?*\n", "\n", "\n", "Pyhton provides a module, ```re```, to write and match regular expressions. But how do I write a regex?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic Patterns \n", "*(from [https://developers.google.com/edu/python/regular-expressions](https://developers.google.com/edu/python/regular-expressions) )*\n", "\n", "The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:\n", "\n", "```a, X, 9, <``` -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: *. ^ $ * + ? { [ ] \\ | ( )* \n", " \n", "```. (a period)``` -- matches any single character except newline '\\n' \n", "\n", "```+``` means at least one instance of the preceding character \n", "\n", "```*``` means zero or more instances of the preceding character \n", "\n", "```\\w``` -- (lowercase w) matches a \"word\" character: a letter or digit or underbar [[a-zA-Z0-9_]]. Note that although \"word\" is the mnemonic for this, it only matches a single word char, not a whole word. \n", "\n", "```\\W``` (upper case W) matches any non-word character. \n", "\n", "```\\b``` -- boundary between word and non-word \n", "\n", "```\\s``` -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [[\\n\\r\\t\\f]]. \n", "\n", "```\\S``` (upper case S) matches any non-whitespace character. \n", "\n", "```\\t, \\n, \\r``` -- tab, newline, return \n", "\n", "```\\d``` -- decimal digit [[0-9]] (some older regex utilities do not support but \\d, but they all support \\w and \\s) \n", "\n", "```^``` = start, ```$``` = end -- match the start or end of the string \n", "\n", "```\\``` -- inhibit the \"specialness\" of a character. So, for example, use \\. to match a period or \\\\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \\@, to make sure it is treated just as a character.\n", "\n", "```{m,n}``` -- causes the resulting regex to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible.\n", "\n", "```[]``` -- used to indicate a set of characters. e.g. integers from 0 to 9 = [[0-9]]\n", "\n", "\n", " \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see an example. Write a regex that matches *all a's followed by zero or more b's* :" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "None\n", "0\n", "3\n" ] } ], "source": [ "import re\n", "\n", "pattern = \"ab*\"\n", "\n", "print(re.search(pattern, \"ac\"))\n", "print(re.search(pattern, \"abc\"))\n", "print(re.search(pattern, \"abbc\"))\n", "\n", "print(re.search(pattern, \"cdfegh\")) # this will answer None...\n", "\n", "# another way of searching...\n", "m = re.search(pattern, \"abbc\")\n", "print(m.start())\n", "print(m.end())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```match``` objects always have a boolean value of ```True```. If no match is found, ```search``` returns ```None```, allowing you to test for the presence of matches like this:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Match found!\n", "No match of pattern ab* in defkklic\n" ] } ], "source": [ "import re\n", "\n", "pattern = \"ab*\"\n", "\n", "def doMatch(pattern, searchString):\n", " if re.search(pattern, searchString):\n", " print(\"Match found!\")\n", " else:\n", " print(\"No match of pattern\", pattern, \"in\", searchString)\n", "\n", "doMatch(pattern, \"defabbbc\") # this will print \"Match found!\"\n", "doMatch(pattern, \"defkklic\") # this will print \"No match of pattern ab* in defkklic\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can we then extract information for all matches ?" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "a\n", "b\n", "b\n" ] } ], "source": [ "# what happens here ?\n", "\n", "import re\n", "\n", "pattern = \"ab*\"\n", "\n", "myMatches = re.search(pattern, \"defabbckipabpoaoccdabbbb\")\n", "print(myMatches)\n", "for match in myMatches.group():\n", " print(match)\n", " " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['abb', 'ab', 'a', 'abbbb']\n", "abb\n", "ab\n", "a\n", "abbbb\n" ] } ], "source": [ "# use findall instead of search\n", "\n", "import re\n", "\n", "pattern = \"ab*\"\n", "\n", "myMatches = re.findall(pattern, \"defabbckipabpoaoccdabbbb\")\n", "print(myMatches)\n", "for match in myMatches:\n", " print(match)\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Match at positions: 3-6\n", "Match at positions: 10-12\n", "Match at positions: 14-15\n", "Match at positions: 19-24\n" ] } ], "source": [ "# or even better, use finditer\n", "\n", "import re\n", "\n", "pattern = \"ab*\"\n", "\n", "myMatches = re.finditer(pattern, \"defabbckipabpoaoccdabbbb\")\n", "for match in myMatches:\n", " print(\"Match <{}> at positions: {}-{}\".format(match.group(0), match.start(), match.end())) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Group extraction\n", "\n", "The ```group``` feature allows you to extract the different parts of a matched substring. \n", "Suppose we want to decompose an email address into username and domain. We can create a group by adding ```(``` and ```)``` parenthesis around the parts of the regex matching the username and domain, and then extract groups from the matches. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "john.doe@nih.gov\n", "john.doe\n", "nih.gov\n" ] } ], "source": [ "import re\n", "pattern = \"([\\w.-]+)@([\\w.-]+)\" # patter marching (everything in lowercase)@(everything in lowercase)\n", "\n", "m = re.match(pattern, \"john.doe@nih.gov\")\n", "print(m.group(0))\n", "print(m.group(1))\n", "print(m.group(2))\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "john.doe@nih.gov\n", "john.doe\n", "nih.gov\n" ] } ], "source": [ "import re\n", "# groups can also be named for ease of extraction\n", "pattern = \"(?P[\\w.-]+)@(?P[\\w.-]+)\"\n", "\n", "m = re.match(pattern, \"john.doe@nih.gov\")\n", "print(m.group(0))\n", "print(m.group(\"username\"))\n", "print(m.group(\"domain\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercises\n", "\n", "1. Write a regex to check that a string is alphanumeric, i.e. it contains only a the ```a-z```, ```A-Z``` and ```0-9``` set of characters.\n", "\n", "
Show/Hide Solution\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n", "False\n" ] } ], "source": [ "import re\n", "\n", "def isStringAlphanumeric(string):\n", " isMatch = re.search(\"[^a-zA-Z\\s0-9.]\", string)\n", " return not isMatch\n", "\n", "print(isStringAlphanumeric(\"The lazy dog jumped over the quick brown fox.\")) \n", "print(isStringAlphanumeric(\"*&%@#!}{\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "\n", "2. Write a regex that matches a word containing the letter ```z```.\n", "\n", "
Show/Hide Solution\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found a match!\n", "No match found!\n" ] } ], "source": [ "import re\n", "\n", "def matchZ(text):\n", " pattern = '\\w*z\\w*'\n", " if re.search(pattern, text):\n", " return 'Found a match!'\n", " else:\n", " return 'No match found!'\n", "\n", "print(matchZ(\"The quick brown fox jumps over the lazy dog.\"))\n", "print(matchZ(\"Python Exercises.\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "\n", "### re.compile\n", "\n", "Searching can also be performed by *compiling* the regex with ```re.compile(pattern)```, and using the returned object of the ```re.Pattern``` type to call ```search``` without having to specify the pattern every time search is called" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "\n", "myRegex = re.compile(r'do.+')\n", "myRegex.search(\"this animal is a donkey\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Substitution\n", "\n", "The ```re.sub``` function allows you to search for matches to a given regex and replace them by something else in your input string. The replacement string can include '\\1', '\\2', ..., '\\X' which refer to the text from group(1), group(2), ..., group(X) from the original matching text." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "john.doe@unitn.it\n", "luke.skywalker@nih.gov\n" ] } ], "source": [ "import re\n", "\n", "pattern = \"([\\w\\.-]+)@([\\w\\.-]+)\"\n", "print(re.sub(pattern, r\"\\1@unitn.it\", \"john.doe@nih.gov\"))\n", "print(re.sub(pattern, r\"luke.skywalker@\\2\", \"john.doe@nih.gov\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "Write a regex to remove all whitespaces from the input string.\n", "\n", "
Show/Hide Solution\n" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original string: The quick brown fox jumps over the lazy dog.\n", "Without spaces: Thequickbrownfoxjumpsoverthelazydog.\n" ] } ], "source": [ "import re\n", "\n", "text = \"The quick brown fox jumps over the lazy dog.\"\n", "print(\"Original string:\", text)\n", "print(\"Without spaces:\", re.sub(r'\\s+', '', text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "\n", "## Exercises\n", "\n", "1. Write a regex to convert a date of yyyy-mm-dd format to dd-mm-yyyy format.\n", "\n", "
Show/Hide Solution\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "def changeDateFormat(dt):\n", " return re.sub(r'(\\d{4})-(\\d{1,2})-(\\d{1,2})', '\\\\3-\\\\2-\\\\1', dt)\n", "\n", "myDate = \"2020-11-16\"\n", "\n", "print(\"Date in the YYY-MM-DD format: \", myDate)\n", "print(\"Date in the DD-MM-YYYY format: \", changeDateFormat(myDate))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "\n", "2. Use a regex to find all words starting with 'a' or 'e' in a given string.\n", "\n", "
Show/Hide Solution\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "text = \"\"\"The following example creates an ArrayList with a capacity of 50 elements. \n", "Four elements are then added to the ArrayList and the ArrayList is trimmed accordingly.\"\"\"\n", "\n", "list = re.findall(\"\\b[ae]\\w*\", text)\n", "\n", "print(list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "\n", "3. Write a regex to insert spaces between words starting with capital letters (e.g. \"CamelCase\" should become \"Camel Case\")\n", "\n", "
Show/Hide Solution\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "test = 'CamelCaseTestPhrase'\n", "splittedTmp = re.sub('([A-Z]+)', r' \\1', test)\n", "splittedFinal = splittedTmp[1:] # to remove first space...\n", "\n", "print(splittedFinal)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
" ] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "3.10.14", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 2 }