Introduction to Regular Expressions
Regular expressions, or regex, is used to identify patterns in a text. In this tutorial we will focus on regex in python, but during your degree you may see it in multiple languages such as javascript and scala.
Syntax
Before starting, here’s a list of useful special characters in Python’s regex library to refer back to.
.
- any character except newline.\n
- new line.\s
- white space.\t
- tab.\d
- digit.^
- the start of a string or line.$
- the end of a string or line.*
- repeats the previous character 0 or more times.*?
- repeats the previous character 0 or more times (non-greedily).+
- repeats the previous character 1 or more times.+?
- repeats the previous character 1 or more times (non-greedily).[abc]
- will match one of the characters in the set.[^abc]
- will match any character not in the set.[a-z]
- will match any character betweena
andz
(inclusive).a|b
- matches either a or b.{n}
- matches exactly n occurrences.{m,n}
- matches as many as possible of m to n occurrences.
Getting started
To use regex in python you’ll need to use the re library. There’s also a regex library but it’s not yet needed. To find all the occurrences of a pattern you can use re.findall()
, but to also get the location of those matches, re.finditer()
is useful.
1
2
3
4
5
6
7
8
9
import re
str = "Using regex for the first time"
match = re.findall('e', str)
print(match)
for m in re.finditer('i.', str):
print(m) # Match object, position, match
Running the above code in python returns
1
2
3
4
['e', 'e', 'e', 'e']
<re.Match object; span=(2, 4), match='in'>
<re.Match object; span=(21, 23), match='ir'>
<re.Match object; span=(27, 29), match='im'>
Match a collection
The [...]
sign will match any character inside it while [^...]
will match any character not inside it.
1
2
3
4
5
6
7
8
9
10
11
12
import re
str = "H0w 0ur m1nds c4n d0 4m4z1ng 7h1ng5!"
match = re.findall('[mwn]', str)
print(match)
# if we want to match lower case letters instead
match = re.findall('[a-z]', str)
print(match)
# The above expression won't match the 'H', space and '!'
# To get them, match everything except the numbers
match = re.findall('[^0-9]', str)
print(match)
Running the above code in python returns
1
2
3
4
['w', 'm', 'n', 'n', 'm', 'n', 'n']
['w', 'u', 'r', 'm', 'n', 'd', 's', 'c', 'n', 'd', 'm', 'z', 'n', 'g', 'h', 'n', 'g']
['H', 'w', ' ', 'u', 'r', ' ', 'm', 'n', 'd', 's', ' ', 'c', 'n', ' ', 'd', ' ', 'm',
'z', 'n', 'g', ' ', 'h', 'n', 'g', '!']
Match repeating characters
The +
character can be used to match 1 or more occurrence while *
matches 0 or more occurrences. They are both greedy unless directly followed by a ?
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import re
str = "The given student ids are 46125656, 87654321, 46464321, 46561256"
match = re.findall('[0-9]+', str)
print(match)
# Ending in 56
match = re.findall('[0-9]+56', str)
print(match)
# Ending in 56 (non-greedy)
match = re.findall('[0-9]+?56', str)
print(match)
# Matches 0 or more occurrences of the numbers before 56
match = re.findall('[0-9]*?56', str)
print(match)
Running the above code in python returns
1
2
3
4
['46125656', '87654321', '46464321', '46561256']
['46125656', '46561256']
['461256', '4656', '1256']
['461256', '56', '4656', '1256']
Exercises
This link can be helpful to debug your code.
- Find all the occurrences of
b
. Example sentence: he made a bad bid for the bed. - Find all the occurrences of the word
ever
. Example sentence: We are never, ever, ever, ever getting back together. Make sure not to include everyone. - Find all the occurrences of any of the characters in the word
ever
. Example sentence: We are never, ever, ever, ever getting back together. Make sure not to include everyone. - Find all the occurrences of the phrase
ev
orre
. Example sentence: We are never, ever, ever, ever getting back together. Make sure not to include everyone. - Find all occurrences of substrings starting with
b
and ending withd
. Example sentence: Abid made a baaad bid for the bed and got hit in the abdomen. - Advanced: Find all words in a text. The text can contain multiple sentences ending in one of
!?.
. Only words containing letters from the English alphabet should be included.