Python RegEx

❮ Previous Next ❯

In this tutorial, you will learn about regular expressions (RegEx) and how to use Python's re module to work with RegEx.

A Regular Expression (RegEx) is a sequence of characters that represents a search pattern.

RegEx can be used to verify if a string contains the specified search pattern.

Let us see an example of a RegEx.

^w........n$

The above code creates a RegEx pattern. The pattern is any 10 letter string starting with w and ending with n.

A pattern created using RegEx can be used to match against a string.

Expression	String	Is Matched?
`^w........n$`	`watern`	No
	`watermelon`	Yes
	`whats`	No
	`wabcdef`	No
	`waaabbccddn`	Yes

RegEx Module

Python offers a built-in package called re to work with Regular Expressions.

Let us see the following example:

import re

pattern = '^w........n$'
test = 'watermelon'
result = re.match(pattern, test)

if result:
    print("Search successful.")
else:
    print("Search unsuccessful.")

Output

Search successful.

As we can see above, we used the re.match() function to search the pattern within the test string. The re.match() method returns a match object if the search is successful; otherwise, it returns None.

Before going in deep into functions that the re module offers to work with RegEx. Let us first learn about regular expressions.

Specify Pattern Using RegEx

Metacharacters are used to define regular expressions. In the above example, ^ and $ are metacharacters.

Metacharacters

Metacharacters are characters with a special meaning which are interpreted by the RegEx engine.

Here's a list of metacharacters: [] . $ * + ? {} () \ |

Square brackets define a set of characters you want to match.

Expression	String	Matched?
`[xyz]`	`x`	1 match
	`hello`	No matches
	`xz`	2 matches
	`xyz is xyz`	6 matches

Above, the expression [xyz] will match if the string we are trying to match contains any of the x, y, z characters.

We can also define a range of characters using a hyphen - inside square brackets.

[a-g] is the same as [abcdefg].
[1-7] is the same as [1234567].
[1-58] is the same as [123458].

We can take the opposite match of the characters by using a caret ^ symbol at the start of a square bracket.

[^abcd] means any character except a or b or c or d.
[^0-9] means any non-digit character.

Period - `.`

A period . matches any single character (except newline \n).

Expression	String	Matched?
`...`	`a`	No match
	`ad`	No match
	`abcd`	1 match
	`abcdefg`	2 matches (contains 7 characters)

Caret - `^`

The care symbol ^ is used to verify if a string starts with a certain character.

Expression	String	Matched?
`^c`	`c`	1 match
	`study`	No match
	`car`	1 match
`^ho`	hope	1 match
	`hello`	No match
	`hold`	1 match

Dollar - `$`

The dollar symbol $ is used to verify if a string ends with a certain character.

Expression	String	Matched?
`$d`	`word`	1 match
	`d`	1 match
	`reminder`	No match

Star - `*`

The star symbol * is used to match zero or more occurrences of the pattern.

Expression	String	Matched?
`hu*n`	`hn`	1 match
	`hun`	1 match
	`huuun`	1 match
	`human`	No match (`u` is not followed by `n`)
	`carhun`	1 match

Plus - `+`

The plus symbol + is used to match one or more occurrences of the pattern.

Expression	String	Matched?
`hu*n`	`hn`	No match
	`hun`	1 match
	`huuun`	1 match
	`human`	No match (`u` is not followed by `n`)
	`carhun`	1 match

Question Mark - `?`

The question mark symbol ? is used to match zero or one occurrence of the pattern.

Expression	String	Matched?
`hu?n`	`hn`	1 match
	`hun`	1 match
	`huuun`	No match (more than one `u` character)
	`human`	No match (`u` is not followed by `n`)
	`carhun`	1 match

Braces - `{}`

The braces symbol {} is used to match exactly the specified number of occurrences of the pattern.

The code {a, b} means that at least a, and at most b repetitions of the pattern.

Expression	String	Matched?
`e{2, 3}`	`cdef bef`	No match
	`cdef beef`	1 match (at `beef`)
	`cdeef beeef`	2 matches (at `cdeef` and `beeef`)
	`cdeef beeeef`	2 matches (at `cdeef` and `beeeef`)

Let us try another example. The RegEx [0-9]{3, 4} matches at least 3 digits but not more than 4 digits.

Expression	String	Matched?
`[0-9]{3, 4}`	`abc957mn`	1 match (at `abc957mn`)
	`578 and 9863127`	3 match (at `578`, `9863`, `127`)
	`45 and 96`	No match

Alternation - `|`

The vertical bar symbol | is used for alternation (or operator).

Expression	String	Matched?
`c\|d`	`abef`	No match
	`abcef`	1 match (at `abcef`)
	`abcdef`	2 match (at )

Above, the RegEx rc|d match any string that contains either c or d.

Group - `()`

The parentheses symbol () is used to group sub-patterns. For example, (m|n|o)ab match any string matches either m or n or o followed by ab.

Expression	String	Matched?
`(m\|n\|o)ab`	`mn ab`	No match
	`mnab`	1 match at `mnab`
	`mab denab`	2 matches (at `mab denab`)

Backslash - `\`

The backslash symbol \ is used to escape different characters, including all metacharacters.

For example, the RegEx \$c match if a string contains $ followed by c. Here, $ is not specially interpreted by a RegEx engine (It is escaped).

If we are hesitant if a character has a particular meaning or not, we can put \ in front of it. This way, we are sure the character is not treated in a specific way.

Special Sequences

Special sequences make commonly used patterns easier to write. Here is a list of special sequences:

The RegEx \A matches if the specified characters are the start of a string.

Expression	String	Matched?
`\Athe`	`the earth`	Match
`\Athe`	`In the earth`	No match

The RegEx \b matches if the specified characters are the beginning or end of a word.

Expression	String	Matched?
`\bbasket`	`basketball`	Match
	`a basketball`	Match
	`abasketball`	No match
`basket\b`	`the basket`	Match
	`the abasket test`	Match
	`the abasket test`	Match
	`the abaskettest`	No match

The RegEx \B is the opposite of \b. It matches if the specified characters are not at the beginning or end of a word.

Expression	String	Matched?
`\Bbasket`	`basketball`	No match
	`a basketball`	No match
	`abasketball`	Match
`basket\B`	`the basket`	No match
	the abasket test	No match
	`the abaskettest`	Match

The RegEx \d matches any decimal digit. It is equal to [0-9]

Expression	String	Matched?
`\d`	`65abcdef9`	3 matches (at `65abcdef9`)
`\d`	`Hello`	No match

The RegEx \D matches any non-decimal digit. It is equal to [^0-9]

Expression	String	Matched?
`\D`	`6abc19_4`	3 matches (at `6abc19_4`)
`\D`	`5454`	No match

The RegEx \s matches where a string contains any whitespace character. It is equal to [ \t\n\r\f\v].

Expression	String	Matched?
`\s`	`Hello World`	1 match
`\s`	HelloWorld	No match

The RegEx \S matches where a string contains any non-whitespace character. It is equal to [^ \t\n\r\f\v].

Expression	String	Matched?
`\s`	`Hello World`	2 matches (at `hello world`)
`\s`		No match

The RegEx \w matches any alphanumeric character (digits and alphabets). It is equal to [a-zA-Z0-9_]. The unerscore _ is also considered an alphanumeric character.

Expression	String	Matched?
`\w`	`63&@; :d`	3 matches (at `63&@; :d`)
`\w`	`?!"@`	No match

The RegEx \W matches any non-alphanumeric character. It is equal to [a-zA-Z0-9_].

Expression	String	Matched?
`\W`	`45sh?fdg4`	1 match (at `45sh?fdg4`)
`\W`	`Hello`	No match

The RegEx \Z matches if the specified characters are at the end of a string.

Expression	String	Matched?
World\Z	`Hello World`	1 match
	`Hello World from the other side`	No match
	`The world is huge`	No match

Tip: To create and test regular expressions, you can use RegEx tester tools like regex101. Using this tool can make it easy to create regular expressions and understand them.

Now after learning the basics of RegEx, let us see how to use RegEx in Python code.

Python RegEx

To work with regular expression, Python has a module named re. To use it, we need to import it.

import re

The re modules define different functions and constants to work with RegEx.

re.findall()

The re.findall() method returns a list of strings containing all matches. If the pattern is not found, an empty list is returned.

Example: re.findall()

In the following example, we will create a program to extract numbers from a string.

import re

str = "Hello 45 World. 7 python is awesome."
pattern = "\d+"

res = re.findall(pattern, str)
print(res)

Output

['45', '7', '736']

re.split()

The re.split() method divides the string where there is a match and returns a list of strings where the splits have occurred. If the pattern is not found, a list containing the original string is returned.

Example: re.split()

In the following example, we will create a program to split the string where there is a match.

import re

str = "Hello 45 World. 7 python is awesome."
pattern = "\d+"

res = re.split(pattern, str)
print(res)

Output

['Hello ', ' World. ', ' python is awesome.']

The re.split() method accepts a maxsplit argument which is the maximum number of splits that will occur.

The default value of the maxsplit argument is 0, which means all possible splits.

import re

str = "Hello 45 World. 7 python is awesome."
pattern = "\d+"

res = re.split(pattern, str, 1)
print(res)

Output

['Hello ', ' World. 7 python is awesome.']

re.sub()

The syntax of re.sub() can be given as follows:

re.sub(pattern, replace, string)

The re.sub() method returns a string where matched occurrences are replaced with the content of the replace variable. If the pattern is not found, the original string is returned.

Example: re.sub()

In the following example, we will create a program to remove all whitespaces.

import re

str = "hello 4544 de 7747 \n jajhs 6"

# matches all whitespace characters
pattern ="\s+"

replace = ""

res = re.sub(pattern, replace, str)
print(res)

Output

hello4544de7747jajhs6

The re.sub() method accepts a count argument as a fourth parameter which, if it is omitted it results in 0 and then will replace all occurrences.

In the following example, we will create a program to remove the first whitespace.

import re

str = "hello 4544 de 7747 \n jajhs 6"

# matches all whitespace characters
pattern ="\s+"

replace = ""

res = re.sub(pattern, replace, str, 1)
print(res)

Output

hello4544 de 7747 
 jajhs 6

re.subn()

The re.subn() is similar to re.sub() except it returns a tuple of 2 items holding the new string and the number of changes performed.

Example: re.subn()

In the following example, we will create a program to remove all the whitespace.

import re

str = "hello 4544 de 7747 \n jajhs 6"

# matches all whitespace characters
pattern ="\s+"

replace = ""

res = re.subn(pattern, replace, str)
print(res)

Output

('hello4544de7747jajhs6', 5)

re.search()

The re.search() method looks for the first location where the RegEx pattern produces a match with the string. It takes two arguments: a pattern and a string.

The re.search returns a match object if the search is successful; otherwise, it returns None.

match = rs.search(pattern, str)

Example: re.search()

In the following example, we will create a program to check if "Hello" is at the beginning.

import re

str = "Hello Wrold"

# check if "Hello" is at the beginning 
match = re.search("\AHello", str)

if match:
    print("pattern found inside the string")
else:
    print("pattern not found")

Output

pattern found inside the string

Match object

A Match object is an object containing information about the search and the result.

If there is no match, the value None will be returned instead of the Match Object.

Some of the most used methods and attributes of the Match object are:

match.group()
match.start()
match.end()
match.span()
match.re
match.string

match.group()

The match.group() method returns the part of the string where there is a match.

Example: Match object

import re 

str = "47965 478, 111785 457"

# Four digit number followed by space followed by two three digit number
pattern = "(\d{4}) (\d{3})"

# match variable contains a Match object
match = re.search(pattern, str)

if match:
    print(match.group())
else:
    print("pattern not found")

Output

7965 478

In the above code, a match variable contains a Match object.

The pattern (\d{4}) (\d{3}) has two subgroups (\d{4}) and (\d{3}). We can also get the part of the string of these parenthesized subgroups by executing the following code:

>>> match.group(1)
'7965'
>>> match.group(2)
'478'
>>> match.group(1, 2)
('7965', '478')
>>> match.groups()
('7965', '478')

match.start(), match.end() and match.span()

The start() method returns the index of the start of the matched substring. In the same way, the end() method returns the end index of the matched substring.

>>> match.start()
1
>>> match.end()
9

The span() method returns a tuple containing the start and end of the matched part.

>>> match.span()
(1, 9)

match.re and match.string

The re attribute of a matched object returns a regular expression object. In the same way, the string attribute returns the passed string.

>>> match.re
re.compile(r'(\d{4}) (\d{3})', re.UNICODE)

>>> match.string
'47965 478, 111785 457'

Using r prefix before RegEx

When r or R prefix is used before a regular expression, it means raw string. For example, \t is a tab character, whereas r\t means two characters: a backslash \ followed by t.

The backslash \ is used to escape various characters, including all metacharacters. However, using the r prefix makes \ treat as a normal character.

Example: Raw string using r prefix

import re

str = "\t and \n are espe sequances"

result = re.findall(r'[\n\t]', str)

print(result)

Output

['\t', '\n']

❮ Previous Next ❯

Python Tutorial

Flow Control

Python Datatypes

Python Functions

Python Files

Python Exception

Python OOP

Advanced Topics

Date and time

Python RegEx

RegEx Module

Specify Pattern Using RegEx

Metacharacters

Period - .

Caret - ^

Dollar - $

Star - *

Plus - +

Question Mark - ?

Braces - {}

Alternation - |

Group - ()

Backslash - \

Special Sequences

Python RegEx

re.findall()

Example: re.findall()

re.split()

Example: re.split()

re.sub()

Example: re.sub()

re.subn()

Example: re.subn()

re.search()

Example: re.search()

Match object

match.group()

Example: Match object

match.start(), match.end() and match.span()

match.re and match.string

Using r prefix before RegEx

Example: Raw string using r prefix

Period - `.`

Caret - `^`

Dollar - `$`

Star - `*`

Plus - `+`

Question Mark - `?`

Braces - `{}`

Alternation - `|`

Group - `()`

Backslash - `\`