Module 2, Practical 4¶
In this practical we will learn how to deal with regular expressions.
Regular expressions¶
Regular expressions, or regex are a powerful language for text pattern matching. They are extremely useful in searching for complex patterns of characters to filter, replace and validate user inputs. Regex are employed in search engines, word processors find and replace functions, and text processing utilities such as awk
or grep
.
e.g. How do I check that a strings contains only alphabetic characters? How do I check that an email is properly formatted (i.e. user@domain.com)? How do I find all strings containing years from 2010 to 2017?
Pyhton provides a module, re
, to write and match regular expressions. But how do I write a regex?
Basic Patterns¶
(from https://developers.google.com/edu/python/regular-expressions )
The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:
a, X, 9, <
– ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ + ? { [ ] | ( )*
. (a period)
– matches any single character except newline ‘\n’
+
means at least one instance of the preceding character
*
means zero or more instances of the preceding character
\w
– (lowercase w) matches a “word” character: a letter or digit or underbar [[a-zA-Z0-9_]]. Note that although “word” is the mnemonic for this, it only matches a single word char, not a whole word.
\W
(upper case W) matches any non-word character.
\b
– boundary between word and non-word
\s
– (lowercase s) matches a single whitespace character – space, newline, return, tab, form [[\n\r\t\f]].
\S
(upper case S) matches any non-whitespace character.
\t, \n, \r
– tab, newline, return
\d
– decimal digit [[0-9]] (some older regex utilities do not support but \d, but they all support \w `and :nbsphinx-math:s`)
^
= start, $
= end – match the start or end of the string
\
– inhibit the “specialness” of a character. So, for example, use . to match a period or \ to match a slash. If you are unsure if a character has special meaning, such as ‘@’, you can put a slash in front of it, @, to make sure it is treated just as a character.
{m,n}
– causes the resulting regex to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible.
[]
– used to indicate a set of characters. e.g. integers from 0 to 9 = [[0-9]]
Let’s see an example. Write a regex that matches all a’s followed by zero or more b’s :
[6]:
import re
pattern = "ab*"
print(re.search(pattern, "ac"))
print(re.search(pattern, "abc"))
print(re.search(pattern, "abbc"))
print(re.search(pattern, "cdfegh")) # this will answer None...
# another way of searching...
m = re.search(pattern, "abbc")
print(m.start())
print(m.end())
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 2), match='ab'>
<re.Match object; span=(0, 3), match='abb'>
None
0
3
match
objects always have a boolean value of True
. If no match is found, search
returns None
, allowing you to test for the presence of matches like this:
[ ]:
import re
pattern = "ab*"
def doMatch(pattern, searchString):
if re.search(pattern, searchString):
print("Match found!")
else:
print("No match of pattern", pattern, "in", searchString)
doMatch(pattern, "defabbbc") # this will print "Match found!"
doMatch(pattern, "defkklic") # this will print "No match of pattern ab* in defkklic"
Can we then extract information for all matches ?
[8]:
# what happens here ?
import re
pattern = "ab*"
myMatches = re.search(pattern, "defabbckipabpoaoccdabbbb")
print(myMatches)
for match in myMatches.group():
print(match)
<re.Match object; span=(3, 6), match='abb'>
a
b
b
[9]:
# use findall instead of search
import re
pattern = "ab*"
myMatches = re.findall(pattern, "defabbckipabpoaoccdabbbb")
print(myMatches)
for match in myMatches:
print(match)
['abb', 'ab', 'a', 'abbbb']
abb
ab
a
abbbb
[10]:
# or even better, use finditer
import re
pattern = "ab*"
myMatches = re.finditer(pattern, "defabbckipabpoaoccdabbbb")
for match in myMatches:
print("Match <{}> at positions: {}-{}".format(match.group(0), match.start(), match.end()))
Match <abb> at positions: 3-6
Match <ab> at positions: 10-12
Match <a> at positions: 14-15
Match <abbbb> at positions: 19-24
Group extraction¶
group
feature allows you to extract the different parts of a matched substring.(
and )
parenthesis around the parts of the regex matching the username and domain, and then extract groups from the matches.[11]:
import re
pattern = "([\w.-]+)@([\w.-]+)" # patter marching (everything in lowercase)@(everything in lowercase)
m = re.match(pattern, "john.doe@nih.gov")
print(m.group(0))
print(m.group(1))
print(m.group(2))
john.doe@nih.gov
john.doe
nih.gov
[12]:
import re
# groups can also be named for ease of extraction
pattern = "(?P<username>[\w.-]+)@(?P<domain>[\w.-]+)"
m = re.match(pattern, "john.doe@nih.gov")
print(m.group(0))
print(m.group("username"))
print(m.group("domain"))
john.doe@nih.gov
john.doe
nih.gov
Exercises¶
Write a regex to check that a string is alphanumeric, i.e. it contains only a the
a-z
,A-Z
and0-9
set of characters.
Show/Hide Solution
Write a regex that matches a word containing the letter
z
.
Show/Hide Solution
re.compile¶
Searching can also be performed by compiling the regex with re.compile(pattern)
, and using the returned object of the re.Pattern
type to call search
without having to specify the pattern every time search is called
[29]:
import re
myRegex = re.compile(r'do.+')
myRegex.search("this animal is a donkey")
[29]:
<re.Match object; span=(17, 23), match='donkey'>
Substitution¶
The re.sub
function allows you to search for matches to a given regex and replace them by something else in your input string. The replacement string can include ‘\1’, ‘\2’, …, ‘\X’ which refer to the text from group(1), group(2), …, group(X) from the original matching text.
[34]:
import re
pattern = "([\w\.-]+)@([\w\.-]+)"
print(re.sub(pattern, r"\1@unitn.it", "john.doe@nih.gov"))
print(re.sub(pattern, r"luke.skywalker@\2", "john.doe@nih.gov"))
john.doe@unitn.it
luke.skywalker@nih.gov
Exercise¶
Write a regex to remove all whitespaces from the input string.
Show/Hide Solution
Exercises¶
Write a regex to convert a date of yyyy-mm-dd format to dd-mm-yyyy format.
Show/Hide Solution
Use a regex to find all words starting with ‘a’ or ‘e’ in a given string.
Show/Hide Solution
Write a regex to insert spaces between words starting with capital letters (e.g. “CamelCase” should become “Camel Case”)
Show/Hide Solution