Python RegEx



In this tutorial, you will learn about regular expressions (RegEx) and how to use Python's re module to work with RegEx.

A Regular Expression (RegEx) is a sequence of characters that represents a search pattern.

RegEx can be used to verify if a string contains the specified search pattern.

Let us see an example of a RegEx.

^w........n$

The above code creates a RegEx pattern. The pattern is any 10 letter string starting with w and ending with n.

A pattern created using RegEx can be used to match against a string.

Expression String Is Matched?
^w........n$ watern No
watermelon Yes
whats No
wabcdef No
waaabbccddn Yes

RegEx Module

Python offers a built-in package called re to work with Regular Expressions.

Let us see the following example:

import re

pattern = '^w........n$'
test = 'watermelon'
result = re.match(pattern, test)

if result:
    print("Search successful.")
else:
    print("Search unsuccessful.")

Output

Search successful.

As we can see above, we used the re.match() function to search the pattern within the test string. The re.match() method returns a match object if the search is successful; otherwise, it returns None.


Before going in deep into functions that the re module offers to work with RegEx. Let us first learn about regular expressions.


Specify Pattern Using RegEx

Metacharacters are used to define regular expressions. In the above example, ^ and $ are metacharacters.


Metacharacters

Metacharacters are characters with a special meaning which are interpreted by the RegEx engine.

Here's a list of metacharacters: [] . $ * + ? {} () \ |


Square brackets define a set of characters you want to match.

Expression String Matched?
[xyz] x 1 match
hello No matches
xz 2 matches
xyz is xyz 6 matches

Above, the expression [xyz] will match if the string we are trying to match contains any of the x, y, z characters.

We can also define a range of characters using a hyphen - inside square brackets.

  • [a-g] is the same as [abcdefg].
  • [1-7] is the same as [1234567].
  • [1-58] is the same as [123458].

We can take the opposite match of the characters by using a caret ^ symbol at the start of a square bracket.

  • [^abcd] means any character except a or b or c or d.
  • [^0-9] means any non-digit character.

Period - .

A period . matches any single character (except newline \n).

Expression String Matched?
... a No match
ad No match
abcd 1 match
abcdefg 2 matches (contains 7 characters)

Caret - ^

The care symbol ^ is used to verify if a string starts with a certain character.

Expression String Matched?
^c c 1 match
study No match
car 1 match
^ho hope 1 match
hello No match
hold 1 match

Dollar - $

The dollar symbol $ is used to verify if a string ends with a certain character.

Expression String Matched?
$d word 1 match
d 1 match
reminder No match

Star - *

The star symbol * is used to match zero or more occurrences of the pattern.

Expression String Matched?
hu*n hn 1 match
hun 1 match
huuun 1 match
human No match (`u` is not followed by `n`)
carhun 1 match

Plus - +

The plus symbol + is used to match one or more occurrences of the pattern.

Expression String Matched?
hu*n hn No match
hun 1 match
huuun 1 match
human No match (`u` is not followed by `n`)
carhun 1 match

Question Mark - ?

The question mark symbol ? is used to match zero or one occurrence of the pattern.

Expression String Matched?
hu?n hn 1 match
hun 1 match
huuun No match (more than one `u` character)
human No match (`u` is not followed by `n`)
carhun 1 match

Braces - {}

The braces symbol {} is used to match exactly the specified number of occurrences of the pattern.

The code {a, b} means that at least a, and at most b repetitions of the pattern.

Expression String Matched?
e{2, 3} cdef bef No match
cdef beef 1 match (at beef)
cdeef beeef 2 matches (at cdeef and beeef)
cdeef beeeef 2 matches (at cdeef and beeeef)

Let us try another example. The RegEx [0-9]{3, 4} matches at least 3 digits but not more than 4 digits.

Expression String Matched?
[0-9]{3, 4} abc957mn 1 match (at abc957mn)
578 and 9863127 3 match (at 578, 9863, 127)
45 and 96 No match

Alternation - |

The vertical bar symbol | is used for alternation (or operator).

Expression String Matched?
c|d abef No match
abcef 1 match (at abcef)
abcdef 2 match (at )

Above, the RegEx rc|d match any string that contains either c or d.


Group - ()

The parentheses symbol () is used to group sub-patterns. For example, (m|n|o)ab match any string matches either m or n or o followed by ab.

Expression String Matched?
(m|n|o)ab mn ab No match
mnab 1 match at mnab
mab denab 2 matches (at mab denab)

Backslash - \

The backslash symbol \ is used to escape different characters, including all metacharacters.

For example, the RegEx \$c match if a string contains $ followed by c. Here, $ is not specially interpreted by a RegEx engine (It is escaped).

If we are hesitant if a character has a particular meaning or not, we can put \ in front of it. This way, we are sure the character is not treated in a specific way.


Special Sequences

Special sequences make commonly used patterns easier to write. Here is a list of special sequences:


The RegEx \A matches if the specified characters are the start of a string.

Expression String Matched?
\Athe the earth Match
In the earth No match

The RegEx \b matches if the specified characters are the beginning or end of a word.

Expression String Matched?
\bbasket basketball Match
a basketball Match
abasketball No match
basket\b the basket Match
the abasket test Match
the abasket test Match
the abaskettest No match

The RegEx \B is the opposite of \b. It matches if the specified characters are not at the beginning or end of a word.

Expression String Matched?
\Bbasket basketball No match
a basketball No match
abasketball Match
basket\B the basket No match
the abasket test No match
the abaskettest Match

The RegEx \d matches any decimal digit. It is equal to [0-9]

Expression String Matched?
\d 65abcdef9 3 matches (at 65abcdef9)
Hello No match

The RegEx \D matches any non-decimal digit. It is equal to [^0-9]

Expression String Matched?
\D 6abc19_4 3 matches (at 6abc19_4)
5454 No match

The RegEx \s matches where a string contains any whitespace character. It is equal to [ \t\n\r\f\v].

Expression String Matched?
\s Hello World 1 match
HelloWorld No match

The RegEx \S matches where a string contains any non-whitespace character. It is equal to [^ \t\n\r\f\v].

Expression String Matched?
\s Hello World 2 matches (at hello world)
    No match

The RegEx \w matches any alphanumeric character (digits and alphabets). It is equal to [a-zA-Z0-9_]. The unerscore _ is also considered an alphanumeric character.

Expression String Matched?
\w 63&@; :d 3 matches (at 63&@; :d)
?!"@ No match

The RegEx \W matches any non-alphanumeric character. It is equal to [a-zA-Z0-9_].

Expression String Matched?
\W 45sh?fdg4 1 match (at 45sh?fdg4)
Hello No match

The RegEx \Z matches if the specified characters are at the end of a string.

Expression String Matched?
World\Z Hello World 1 match
Hello World from the other side No match
The world is huge No match

Tip: To create and test regular expressions, you can use RegEx tester tools like regex101. Using this tool can make it easy to create regular expressions and understand them.

Now after learning the basics of RegEx, let us see how to use RegEx in Python code.


Python RegEx

To work with regular expression, Python has a module named re. To use it, we need to import it.

import re

The re modules define different functions and constants to work with RegEx.


re.findall()

The re.findall() method returns a list of strings containing all matches. If the pattern is not found, an empty list is returned.


Example: re.findall()

In the following example, we will create a program to extract numbers from a string.

import re

str = "Hello 45 World. 7 python is awesome."
pattern = "\d+"

res = re.findall(pattern, str)
print(res)

Output

['45', '7', '736']

re.split()

The re.split() method divides the string where there is a match and returns a list of strings where the splits have occurred. If the pattern is not found, a list containing the original string is returned.


Example: re.split()

In the following example, we will create a program to split the string where there is a match.

import re

str = "Hello 45 World. 7 python is awesome."
pattern = "\d+"

res = re.split(pattern, str)
print(res)

Output

['Hello ', ' World. ', ' python is awesome.']

The re.split() method accepts a maxsplit argument which is the maximum number of splits that will occur.

The default value of the maxsplit argument is 0, which means all possible splits.

import re

str = "Hello 45 World. 7 python is awesome."
pattern = "\d+"

res = re.split(pattern, str, 1)
print(res)

Output

['Hello ', ' World. 7 python is awesome.']

re.sub()

The syntax of re.sub() can be given as follows:

re.sub(pattern, replace, string)

The re.sub() method returns a string where matched occurrences are replaced with the content of the replace variable. If the pattern is not found, the original string is returned.


Example: re.sub()

In the following example, we will create a program to remove all whitespaces.

import re

str = "hello 4544 de 7747 \n jajhs 6"

# matches all whitespace characters
pattern ="\s+"

replace = ""

res = re.sub(pattern, replace, str)
print(res)

Output

hello4544de7747jajhs6

The re.sub() method accepts a count argument as a fourth parameter which, if it is omitted it results in 0 and then will replace all occurrences.

In the following example, we will create a program to remove the first whitespace.

import re

str = "hello 4544 de 7747 \n jajhs 6"

# matches all whitespace characters
pattern ="\s+"

replace = ""

res = re.sub(pattern, replace, str, 1)
print(res)

Output

hello4544 de 7747 
 jajhs 6

re.subn()

The re.subn() is similar to re.sub() except it returns a tuple of 2 items holding the new string and the number of changes performed.


Example: re.subn()

In the following example, we will create a program to remove all the whitespace.

import re

str = "hello 4544 de 7747 \n jajhs 6"

# matches all whitespace characters
pattern ="\s+"

replace = ""

res = re.subn(pattern, replace, str)
print(res)

Output

('hello4544de7747jajhs6', 5)

re.search()

The re.search() method looks for the first location where the RegEx pattern produces a match with the string. It takes two arguments: a pattern and a string.

The re.search returns a match object if the search is successful; otherwise, it returns None.

match = rs.search(pattern, str)

Example: re.search()

In the following example, we will create a program to check if "Hello" is at the beginning.

import re

str = "Hello Wrold"

# check if "Hello" is at the beginning 
match = re.search("\AHello", str)

if match:
    print("pattern found inside the string")
else:
    print("pattern not found")

Output

pattern found inside the string

Match object

A Match object is an object containing information about the search and the result.

If there is no match, the value None will be returned instead of the Match Object.

Some of the most used methods and attributes of the Match object are:

  • match.group()
  • match.start()
  • match.end()
  • match.span()
  • match.re
  • match.string

match.group()

The match.group() method returns the part of the string where there is a match.

Example: Match object

import re 

str = "47965 478, 111785 457"

# Four digit number followed by space followed by two three digit number
pattern = "(\d{4}) (\d{3})"

# match variable contains a Match object
match = re.search(pattern, str)

if match:
    print(match.group())
else:
    print("pattern not found")

Output

7965 478

In the above code, a match variable contains a Match object.

The pattern (\d{4}) (\d{3}) has two subgroups (\d{4}) and (\d{3}). We can also get the part of the string of these parenthesized subgroups by executing the following code:

>>> match.group(1)
'7965'
>>> match.group(2)
'478'
>>> match.group(1, 2)
('7965', '478')
>>> match.groups()
('7965', '478')

match.start(), match.end() and match.span()

The start() method returns the index of the start of the matched substring. In the same way, the end() method returns the end index of the matched substring.

>>> match.start()
1
>>> match.end()
9

The span() method returns a tuple containing the start and end of the matched part.

>>> match.span()
(1, 9)

match.re and match.string

The re attribute of a matched object returns a regular expression object. In the same way, the string attribute returns the passed string.

>>> match.re
re.compile(r'(\d{4}) (\d{3})', re.UNICODE)

>>> match.string
'47965 478, 111785 457'

Using r prefix before RegEx

When r or R prefix is used before a regular expression, it means raw string. For example, \t is a tab character, whereas r\t means two characters: a backslash \ followed by t.

The backslash \ is used to escape various characters, including all metacharacters. However, using the r prefix makes \ treat as a normal character.


Example: Raw string using r prefix

import re

str = "\t and \n are espe sequances"

result = re.findall(r'[\n\t]', str)

print(result)

Output

['\t', '\n']


ExpectoCode is optimized for learning. Tutorials and examples are constantly reviewed to avoid errors, but we cannot warrant full correctness of all content. While using this site, you agree to have read and accepted our terms of use, cookie and privacy policy.
Copyright 2020-2021 by ExpectoCode. All Rights Reserved.