Python Regular Expressions | Python Regex Tutorial


1. Python Regex Tutorial

I’m a fan of words. And, it has been a long wait in our journey, but the day has come for my favorite topic- regular expressions. Let’s delve into this without wasting a moment and Learn Python3 Regex, Python regular expression examples, regular expression match example, search, Python findall, regular expression ignore case, Python multiline in short a complete Python Regex Cheat Sheet.

2. What is a Python Regular Expression or Regex?

Python Regular Expression / Python Regex

Python Regular Expression / Python Regex

Essentially, a regular expression in Python is a sequence of characters, that defines a search pattern. We can then use this pattern in a string-searching algorithm to “find” or “find and replace” on strings. You would’ve seen this feature in Microsoft Word as well.

In this Python Regex tutorial, we will learn the basics of regular expressions with Python. For this, we will use the ‘re’ module. Let’s import it before we begin.

>>> import re

 

3. Metacharacters

Each character in a Python Regex is either a metacharacter or a regular character. A metacharacter has a special meaning, while a regular character matches itself. Python has the following metacharacters:

MetacharacterDescription
^Matches the start of the string
.Matches a single character, except a newline

But when used inside square brackets, a dot is matched

[ ]A bracket expression matches a single character from the ones inside it

[abc] matches ‘a’, ‘b’, and ‘c’

[a-z] matches characters from ‘a’ to ‘z’

[a-cx-z] matches ‘a’, ’b’, ’c’, ’x’, ’y’, and ‘z’

[^ ]Matches a single character from those except the ones mentioned in the brackets

[^abc] matches all characters except ‘a’, ‘b’ and ‘c’

( )Parentheses define a marked subexpression, also called a block, or a capturing group
\t, \n, \r, \fTab, newline, return, form feed
*Matches the preceding character zero or more times

ab*c matches ‘ac’, ‘abc’, ‘abbc’, and so on

[ab]* matches ‘’, ‘a’, ‘b’, ‘ab’, ‘ba’, ‘aba’, and so on

(ab)* matches ‘’, ‘ab’, ‘abab’, ‘ababab’, and so on

{m,n}Matches the preceding character minimum m times, and maximum n times

a{2,4} matches ‘aa’, ‘aaa’, and ‘aaaa’

{m}Matches the preceding character exactly m times
?Matches the preceding character zero or one times

ab?c matches ‘ac’ or ‘abc’

+Matches the preceding character one or one times

ab+c matches ‘abc’, ‘abbc’, ‘abbbc’, and so on, but not ‘ac’

|The choice operator matches either the expression before it, or the one after

abc|def matches ‘abc’ or ‘def’

\wMatches a word character (a-zA-Z0-9)

\W matches single non-word characters

\bMatches the boundary between word and non-word characters
\sMatches a single whitespace character

\S matches a single non-whitespace character

\dMatches a single decimal digit character (0-9)
\A single backslash inhibits a character’s specialness

Examples- \.    \\     \*

When unsure if a character has a special meaning, put a \ before it:

\@

$A dollar matches the end of the string

 

A raw string literal does not handle backslashes in any special way. For this, prepend an ‘r’ before the pattern. Without this, you may have to use ‘\\\\’ for a single backslash character. But with this, you only need r’\’.

Regular characters match themselves.

4. Rules for a Match

So, how does this work? The following rules must be met:

  1. The search scans the string start to end.
  2. The whole pattern must match, but not necessarily the whole string.
  3. The search stops at the first match.

If a match is found, the group() method returns the matching phrase. If not, it returns None.

>>> print(re.search('na','no'))

None

Let’s look at about a couple important functions now.

5. Functions

We have a few functions to help us use Python regular expressions.

a. match()

match() takes two arguments- a pattern and a string. If they match, it returns the string. Else, it returns None. Let’s take a few Python regular expression match examples.

>>> print(re.match('center','centre'))

None

>>> print(re.match('...\w\we','centre'))

<_sre.SRE_Match object; span=(0, 6), match=’centre’>

b. search()

Python Regular expression search(), like match(), takes two arguments- the pattern and the string to be searched. Let’s take a few examples.

>>> match=re.search('aa?yushi','ayushi')
>>> match.group()

‘ayushi’

>>> match=re.search('aa?yushi?','ayush ayushi')
>>> match.group()

‘ayush’

>>> match=re.search('\w*end','Hey! What are your plans for the weekend?')
>>> match.group()

‘weekend’

>>> match=re.search('^\w*end','Hey! What are your plans for the weekend?')
>>> match.group()
Traceback (most recent call last):
  File "<pyshell#337>", line 1, in <module>
   match.group()

AttributeError: ‘NoneType’ object has no attribute ‘group’

Here, an AttributeError raised because it found no match. This is because we specified that this pattern should be at the beginning of the string. Let’s try searching for a space.

>>> match=re.search('i\sS','Ayushi Sharma')
>>> match.group()

‘i S’

>>> match=re.search('\w+c{2}\w*','Occam\'s Razor')
>>> match.group()

‘Occam’

It really will take some practice to get it into habit what the metacharacters mean. But since we don’t have so many, this will hardly take an hour.

6. More Python Regex Examples

Let’s try crafting a Python regex for an email address. Hmm, so what does one look like? It looks like this: abc-def@ghi.com

Let’s try the following code:

>>> match=re.search(r'[\w.-]+@[\w-]+.[\w]+','Please mail it to ayushiwashere@gmail.com')
>>> match.group()

‘ayushiwashere@gmail.com’

It worked perfectly!

Here, if you would have typed [\w-.] instead of [\w.-], it would have raised the following error:

>>> match=re.search(r'[\w-.]+@[\w-]+.[\w]+','Please mail it to ayushiwashere@gmail.com')
Traceback (most recent call last):
 File "<pyshell#347>", line 1, in <module>
   match=re.search(r'[\w-.]+@[\w-]+.[\w]+','Please mail it to ayushiwashere@gmail.com')

File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\re.py”, line 182, in search

return _compile(pattern, flags).search(string)

File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\re.py”, line 301, in _compile

p = sre_compile.compile(pattern, flags)

File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\sre_compile.py”, line 562, in compile

p = sre_parse.parse(p, flags)

File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py”, line 856, in parse

p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, False)

File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py”, line 415, in _parse_sub

itemsappend(_parse(source, state, verbose))

File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py”, line 547, in _parse

raise source.error(msg, len(this) + 1 + len(that))

sre_constants.error: bad character range \w-. at position 1

This is because normally, we use a dash (-) to indicate a range.

7. Group Extraction

Let’s continue with the example on emails. What if you only want the username? For this, you can provide an argument(like an index) to the group() method. Take a look at this:

>>> match=re.search(r'([\w.-]+)@([\w-]+).([\w]+)','Please mail it to ayushiwashere@gmail.com')
>>> match.group()

‘ayushiwashere@gmail.com’

>>> match.group(1)

‘ayushiwashere’

>>> match.group(2)

‘gmail’

>>> match.group(3)

‘com’

Parentheses let you extract the parts you want. Note that for this, we divided the pattern into groups using parentheses:

r'([\w.-]+)@([\w-]+).([\w]+)’

8. Python findall()

Above, we saw that regex search() stops at the first match. But Python findall() returns a list of all matches found.

>>> match=re.findall(r'advi[cs]e','I could advise you on your poem, but you would disparage my advice')

We can then iterate on it.

>>> for i in match:

print(i)

advise

advice

>>> type(match)

<class ‘list’>

9. findall() with Files

We have worked with files, and we know how to read and write them. Why not make life easier by using Python findall() with files? We’ll first use the os module to get to the desktop. Let’s see.

>>> import os
>>> os.chdir('C:\\Users\\lifei\\Desktop')
>>> f=open('Today.txt')

We have a file called Today.txt on our Desktop. These are its contents:

OS, DBMS, DS, ADA

HTML, CSS, jQuery, JavaScript

Python, C++, Java

This sem’s subjects

Now, let’s call findall().

>>> match=re.findall(r'Java[\w]*',f.read())

Finally, let’s iterate on it.

>>> for i in match:

print(i)

JavaScript

Java

10. findall() with Groups

We saw how we can divide a pattern into groups using parentheses. Watch what happens when we call Python Regex findall().

>>> match=re.findall(r'([\w]+)\s([\w]+)','Ayushi Sharma, Fluffy Sharma, Leo Sharma, Candy Sharma')
>>> for i in match:

print(i)

(‘Ayushi’, ‘Sharma’)

(‘Fluffy’, ‘Sharma’)

(‘Leo’, ‘Sharma’)

(‘Candy’, ‘Sharma’)

11. Options

The functions we discussed may take an optional argument as well. These options are:

a. Regular expression IGNORECASE

This Python Regular Expression ignore case ignores the case while matching.

Take this example of Python Regex IGNORECASE:

>>> match=re.findall(r'hi','Hi, did you ship it, Hillary?',re.IGNORECASE)
>>> for i in match:

print(i)

Hi

hi

Hi

b. Python MULTILINE

Working with a string of multiple lines, this allows ^ and $ to match the start and end of each line, not just the whole string.

>>> match=re.findall(r'^Hi','Hi, did you ship it, Hillary?\nNo, I didn\'t, but Hi',re.MULTILINE)
>>> for i in match:

print(i)

Hi

c. Python DOTALL

.* does not scan everything in a multiline string; it only matches the first line. This is because . does not match a newline. To allow this, we use DOTALL.

>>> match=re.findall(r'.*','Hi, did you ship it, Hillary?\nNo, I didn\'t, but Hi',re.DOTALL)
>>> for i in match:

print(i)

Hi, did you ship it, Hillary?

No, I didn’t, but Hi

12. Greedy vs Non-Greedy

The metacharacters *, +, and ? are greedy. This means that they keep searching. Let’s take an example.

>>> match=re.findall(r'(<.*>)','<em>Strong</em> <i>Italic</i>')
>>> for i in match:

print(i)

<em>Strong</em> <i>Italic</i>

This gave us the whole string, because it greedily keeps searching. What if we just want the opening and closing tags? Look:

print(i)

>>> match=re.findall(r'(<.*?>)','<em>Strong</em> <i>Italic</i>')
>>> for i in match:

print(i)

<em>

</em>

<i>

</i>

The .* is greedy, and the ? makes it non-greedy.

Alternatively, we could also do this:

>>> match=re.findall(r'</?\w+>','<em>Strong</em> <i>Italic</i>')
>>> for i in match:

print(i)

<em>

</em>

<i>

</i>

Here’s another example:

>>> match=re.findall('(a*?)b','aaabbc')
>>> for i in match:

print(i)

aaa

Here, the ? makes * non-greedy. Also, if we would have skipped the b after the ?, it would have returned an empty string. The ? here needs a character after it to stop at. This works for all three- *?, +?, and ??.

Similarly, {m,n}? makes it non-greedy, and matches as few occurrences as possible.

13. Substitution

We can use the sub() function to substitute the part of a string with another. sub() takes three arguments- pattern, substring, and string.

>>> re.sub(‘^a’,’an’,’a apple’)

‘an apple’

Here, we used ^ so it won’t change apple to anpple. The grammar police approve.

14. Applications

So, we learned so much about regular expressions, but where do we use them? They find use in these places:

Search engines

Find and Replace dialogues of word processor and text editors

Text processing utilities like sed and AWK

Lexical analysis

This was all about the Python Regex Tutorial

15. Conclusion

These were the basics of Python regular expressions. Honestly, we think it is really cool to have such a tool in hand. If you love English, try experimenting, and make a small project with it.

If you have a doubt in the Python Regex Tutorial, feel free to ask in the comments.

Leave a comment

Your email address will not be published. Required fields are marked *