Python RegEx
In this tutorial, you will learn about regular expressions (RegEx) and how to use Python's re
module to work with RegEx.
A Regular Expression (RegEx) is a sequence of characters that represents a search pattern.
RegEx can be used to verify if a string contains the specified search pattern.
Let us see an example of a RegEx.
^w........n$
The above code creates a RegEx pattern. The pattern is any 10 letter string starting with w
and ending with n
.
A pattern created using RegEx can be used to match against a string.
Expression | String | Is Matched? |
---|---|---|
^w........n$ |
watern |
No |
watermelon |
Yes | |
whats |
No | |
wabcdef |
No | |
waaabbccddn |
Yes |
RegEx Module
Python offers a built-in package called re
to work with Regular Expressions.
Let us see the following example:
import re
pattern = '^w........n$'
test = 'watermelon'
result = re.match(pattern, test)
if result:
print("Search successful.")
else:
print("Search unsuccessful.")
Output
Search successful.
As we can see above, we used the re.match()
function to search the pattern
within the test
string. The re.match()
method returns a match object if the search is successful; otherwise, it returns None
.
Before going in deep into functions that the re
module offers to work with RegEx. Let us first learn about regular expressions.
Specify Pattern Using RegEx
Metacharacters are used to define regular expressions. In the above example, ^
and $
are metacharacters.
Metacharacters
Metacharacters are characters with a special meaning which are interpreted by the RegEx engine.
Here's a list of metacharacters: [] . $ * + ? {} () \ |
Square brackets define a set of characters you want to match.
Expression | String | Matched? |
---|---|---|
[xyz] |
x |
1 match |
hello |
No matches | |
xz |
2 matches | |
xyz is xyz |
6 matches |
Above, the expression [xyz]
will match if the string we are trying to match contains any of the x
, y
, z
characters.
We can also define a range of characters using a hyphen -
inside square brackets.
[a-g]
is the same as[abcdefg]
.[1-7]
is the same as[1234567]
.[1-58]
is the same as[123458]
.
We can take the opposite match of the characters by using a caret ^
symbol at the start of a square bracket.
[^abcd]
means any character excepta
orb
orc
ord
.[^0-9]
means any non-digit character.
Period - .
A period .
matches any single character (except newline \n
).
Expression | String | Matched? |
---|---|---|
... |
a |
No match |
ad |
No match | |
abcd |
1 match | |
abcdefg |
2 matches (contains 7 characters) |
Caret - ^
The care symbol ^
is used to verify if a string starts with a certain character.
Expression | String | Matched? |
---|---|---|
^c |
c |
1 match |
study |
No match | |
car |
1 match | |
^ho |
hope | 1 match |
hello |
No match | |
hold |
1 match |
Dollar - $
The dollar symbol $
is used to verify if a string ends with a certain character.
Expression | String | Matched? |
---|---|---|
$d |
word |
1 match |
d |
1 match | |
reminder |
No match |
Star - *
The star symbol *
is used to match zero or more occurrences of the pattern.
Expression | String | Matched? |
---|---|---|
hu*n |
hn |
1 match |
hun |
1 match | |
huuun |
1 match | |
human |
No match (`u` is not followed by `n`) | |
carhun |
1 match |
Plus - +
The plus symbol +
is used to match one or more occurrences of the pattern.
Expression | String | Matched? |
---|---|---|
hu*n |
hn |
No match |
hun |
1 match | |
huuun |
1 match | |
human |
No match (`u` is not followed by `n`) | |
carhun |
1 match |
Question Mark - ?
The question mark symbol ?
is used to match zero or one occurrence of the pattern.
Expression | String | Matched? |
---|---|---|
hu?n |
hn |
1 match |
hun |
1 match | |
huuun |
No match (more than one `u` character) | |
human |
No match (`u` is not followed by `n`) | |
carhun |
1 match |
Braces - {}
The braces symbol {}
is used to match exactly the specified number of occurrences of the pattern.
The code {a, b}
means that at least a
, and at most b
repetitions of the pattern.
Expression | String | Matched? |
---|---|---|
e{2, 3} |
cdef bef |
No match |
cdef beef |
1 match (at beef ) |
|
cdeef beeef |
2 matches (at cdeef and beeef ) |
|
cdeef beeeef |
2 matches (at cdeef and beeeef ) |
Let us try another example. The RegEx [0-9]{3, 4}
matches at least 3 digits but not more than 4 digits.
Expression | String | Matched? |
---|---|---|
[0-9]{3, 4} |
abc957mn |
1 match (at abc957mn ) |
578 and 9863127 |
3 match (at 578 , 9863 , 127 ) |
|
45 and 96 |
No match |
Alternation - |
The vertical bar symbol |
is used for alternation (or
operator).
Expression | String | Matched? |
---|---|---|
c|d |
abef |
No match |
abcef |
1 match (at abcef ) |
|
abcdef |
2 match (at ) |
Above, the RegEx rc|d
match any string that contains either c
or d
.
Group - ()
The parentheses symbol ()
is used to group sub-patterns. For example, (m|n|o)ab
match any string matches either m
or n
or o
followed by ab
.
Expression | String | Matched? |
---|---|---|
(m|n|o)ab |
mn ab |
No match |
mnab |
1 match at mnab |
|
mab denab |
2 matches (at mab denab ) |
Backslash - \
The backslash symbol \
is used to escape different characters, including all metacharacters.
For example, the RegEx \$c
match if a string contains $
followed by c
. Here, $
is not specially interpreted by a RegEx engine (It is escaped).
If we are hesitant if a character has a particular meaning or not, we can put \
in front of it. This way, we are sure the character is not treated in a specific way.
Special Sequences
Special sequences make commonly used patterns easier to write. Here is a list of special sequences:
The RegEx \A
matches if the specified characters are the start of a string.
Expression | String | Matched? |
---|---|---|
\Athe |
the earth |
Match |
In the earth |
No match |
The RegEx \b
matches if the specified characters are the beginning or end of a word.
Expression | String | Matched? |
---|---|---|
\bbasket |
basketball |
Match |
a basketball |
Match | |
abasketball |
No match | |
basket\b |
the basket |
Match |
the abasket test |
Match | |
the abasket test |
Match | |
the abaskettest |
No match |
The RegEx \B
is the opposite of \b
. It matches if the specified characters are not at the beginning or end of a word.
Expression | String | Matched? |
---|---|---|
\Bbasket |
basketball |
No match |
a basketball |
No match | |
abasketball |
Match | |
basket\B |
the basket |
No match |
the abasket test | No match | |
the abaskettest |
Match |
The RegEx \d
matches any decimal digit. It is equal to [0-9]
Expression | String | Matched? |
---|---|---|
\d |
65abcdef9 |
3 matches (at 65abcdef9 ) |
Hello |
No match |
The RegEx \D
matches any non-decimal digit. It is equal to [^0-9]
Expression | String | Matched? |
---|---|---|
\D |
6abc19_4 |
3 matches (at 6abc19_4 ) |
5454 |
No match |
The RegEx \s
matches where a string contains any whitespace character. It is equal to [ \t\n\r\f\v]
.
Expression | String | Matched? |
---|---|---|
\s |
Hello World |
1 match |
HelloWorld | No match |
The RegEx \S
matches where a string contains any non-whitespace character. It is equal to [^ \t\n\r\f\v]
.
Expression | String | Matched? |
---|---|---|
\s |
Hello World |
2 matches (at hello world ) |
|
No match |
The RegEx \w
matches any alphanumeric character (digits and alphabets). It is equal to [a-zA-Z0-9_]
. The unerscore _
is also considered an alphanumeric character.
Expression | String | Matched? |
---|---|---|
\w |
63&@; :d |
3 matches (at 63&@; :d ) |
?!"@ |
No match |
The RegEx \W
matches any non-alphanumeric character. It is equal to [a-zA-Z0-9_]
.
Expression | String | Matched? |
---|---|---|
\W |
45sh?fdg4 |
1 match (at 45sh?fdg4 ) |
Hello |
No match |
The RegEx \Z
matches if the specified characters are at the end of a string.
Expression | String | Matched? |
---|---|---|
World\Z | Hello World |
1 match |
Hello World from the other side |
No match | |
The world is huge |
No match |
Tip: To create and test regular expressions, you can use RegEx tester tools like regex101. Using this tool can make it easy to create regular expressions and understand them.
Now after learning the basics of RegEx, let us see how to use RegEx in Python code.
Python RegEx
To work with regular expression, Python has a module named re
. To use it, we need to import it.
import re
The re
modules define different functions and constants to work with RegEx.
re.findall()
The re.findall()
method returns a list of strings containing all matches. If the pattern is not found, an empty list is returned.
Example: re.findall()
In the following example, we will create a program to extract numbers from a string.
import re
str = "Hello 45 World. 7 python is awesome."
pattern = "\d+"
res = re.findall(pattern, str)
print(res)
Output
['45', '7', '736']
re.split()
The re.split()
method divides the string where there is a match and returns a list of strings where the splits have occurred. If the pattern is not found, a list containing the original string is returned.
Example: re.split()
In the following example, we will create a program to split the string where there is a match.
import re
str = "Hello 45 World. 7 python is awesome."
pattern = "\d+"
res = re.split(pattern, str)
print(res)
Output
['Hello ', ' World. ', ' python is awesome.']
The re.split()
method accepts a maxsplit
argument which is the maximum number of splits that will occur.
The default value of the maxsplit
argument is 0, which means all possible splits.
import re
str = "Hello 45 World. 7 python is awesome."
pattern = "\d+"
res = re.split(pattern, str, 1)
print(res)
Output
['Hello ', ' World. 7 python is awesome.']
re.sub()
The syntax of re.sub()
can be given as follows:
re.sub(pattern, replace, string)
The re.sub()
method returns a string where matched occurrences are replaced with the content of the replace
variable. If the pattern is not found, the original string is returned.
Example: re.sub()
In the following example, we will create a program to remove all whitespaces.
import re
str = "hello 4544 de 7747 \n jajhs 6"
# matches all whitespace characters
pattern ="\s+"
replace = ""
res = re.sub(pattern, replace, str)
print(res)
Output
hello4544de7747jajhs6
The re.sub()
method accepts a count
argument as a fourth parameter which, if it is omitted it results in 0 and then will replace all occurrences.
In the following example, we will create a program to remove the first whitespace.
import re
str = "hello 4544 de 7747 \n jajhs 6"
# matches all whitespace characters
pattern ="\s+"
replace = ""
res = re.sub(pattern, replace, str, 1)
print(res)
Output
hello4544 de 7747
jajhs 6
re.subn()
The re.subn()
is similar to re.sub()
except it returns a tuple of 2 items holding the new string and the number of changes performed.
Example: re.subn()
In the following example, we will create a program to remove all the whitespace.
import re
str = "hello 4544 de 7747 \n jajhs 6"
# matches all whitespace characters
pattern ="\s+"
replace = ""
res = re.subn(pattern, replace, str)
print(res)
Output
('hello4544de7747jajhs6', 5)
re.search()
The re.search()
method looks for the first location where the RegEx pattern produces a match with the string. It takes two arguments: a pattern and a string.
The re.search
returns a match object if the search is successful; otherwise, it returns None
.
match = rs.search(pattern, str)
Example: re.search()
In the following example, we will create a program to check if "Hello" is at the beginning.
import re
str = "Hello Wrold"
# check if "Hello" is at the beginning
match = re.search("\AHello", str)
if match:
print("pattern found inside the string")
else:
print("pattern not found")
Output
pattern found inside the string
Match object
A Match object is an object containing information about the search and the result.
If there is no match, the value None
will be returned instead of the Match Object.
Some of the most used methods and attributes of the Match object are:
match.group()
match.start()
match.end()
match.span()
match.re
match.string
match.group()
The match.group()
method returns the part of the string where there is a match.
Example: Match object
import re
str = "47965 478, 111785 457"
# Four digit number followed by space followed by two three digit number
pattern = "(\d{4}) (\d{3})"
# match variable contains a Match object
match = re.search(pattern, str)
if match:
print(match.group())
else:
print("pattern not found")
Output
7965 478
In the above code, a match
variable contains a Match object.
The pattern (\d{4}) (\d{3})
has two subgroups (\d{4})
and (\d{3})
. We can also get the part of the string of these parenthesized subgroups by executing the following code:
>>> match.group(1)
'7965'
>>> match.group(2)
'478'
>>> match.group(1, 2)
('7965', '478')
>>> match.groups()
('7965', '478')
match.start(), match.end() and match.span()
The start()
method returns the index of the start of the matched substring. In the same way, the end()
method returns the end index of the matched substring.
>>> match.start()
1
>>> match.end()
9
The span()
method returns a tuple containing the start and end of the matched part.
>>> match.span()
(1, 9)
match.re and match.string
The re
attribute of a matched object returns a regular expression object. In the same way, the string
attribute returns the passed string.
>>> match.re
re.compile(r'(\d{4}) (\d{3})', re.UNICODE)
>>> match.string
'47965 478, 111785 457'
Using r prefix before RegEx
When r
or R
prefix is used before a regular expression, it means raw string. For example, \t
is a tab character, whereas r\t
means two characters: a backslash \
followed by t
.
The backslash \
is used to escape various characters, including all metacharacters. However, using the r
prefix makes \
treat as a normal
character.
Example: Raw string using r prefix
import re
str = "\t and \n are espe sequances"
result = re.findall(r'[\n\t]', str)
print(result)
Output
['\t', '\n']