Learn Python Regex Tutorial – Python Regular Expression Functions

Stay updated with the latest technology trends while you're on the move - Join DataFlair's Telegram Channel

1. Python Regex Tutorial

Python Regular Expression is one of my favourite topics. Let’s delve into this without wasting a moment to learn Python Regex Tutorial. Here, we will discuss Metacharacters, examples & functions of Python Regex. Along with this, we will cover Python findall, Python multiline.

So, let’s start a short Python Regex Cheat Sheet.

Python Regular Expression / Python Regex

Python Regular Expression / Python Regex

2. What is the Python Regular Expression (Regex)?

Essentially, a Python regular expression is a sequence of characters, that defines a search pattern. We can then use this pattern in a string-searching algorithm to “find” or “find and replace” on strings. You would’ve seen this feature in Microsoft Word as well.

In this Python Regex tutorial, we will learn the basics of regular expressions in Python. For this, we will use the ‘re’ module. Let’s import it before we begin.

>>> import re

3. Python Regex – Metacharacters

Each character in a Python Regex is either a metacharacter or a regular character. A metacharacter has a special meaning, while a regular character matches itself. Python has the following metacharacters:

 

MetacharacterDescription
^Matches the start of the string
.Matches a single character, except a newline
But when used inside square brackets, a dot is matched
[ ]A bracket expression matches a single character from the ones inside it
[abc] matches ‘a’, ‘b’, and ‘c’
[a-z] matches characters from ‘a’ to ‘z’
[a-cx-z] matches ‘a’, ’b’, ’c’, ’x’, ’y’, and ‘z’
[^ ]Matches a single character from those except the ones mentioned in the brackets[^abc] matches all characters except ‘a’, ‘b’ and ‘c’
( )Parentheses define a marked subexpression, also called a block, or a capturing group
\t, \n, \r, \fTab, newline, return, form feed
*Matches the preceding character zero or more times
ab*c matches ‘ac’, ‘abc’, ‘abbc’, and so on
[ab]* matches ‘’, ‘a’, ‘b’, ‘ab’, ‘ba’, ‘aba’, and so on
(ab)* matches ‘’, ‘ab’, ‘abab’, ‘ababab’, and so on
{m,n}Matches the preceding character minimum m times, and maximum n times
a{2,4} matches ‘aa’, ‘aaa’, and ‘aaaa’
{m}Matches the preceding character exactly m times
?Matches the preceding character zero or one times
ab?c matches ‘ac’ or ‘abc’
+Matches the preceding character one or one times
ab+c matches ‘abc’, ‘abbc’, ‘abbbc’, and so on, but not ‘ac’
|The choice operator matches either the expression before it, or the one after
abc|def matches ‘abc’ or ‘def’
\wMatches a word character (a-zA-Z0-9)
\W matches single non-word characters
\bMatches the boundary between word and non-word characters
\sMatches a single whitespace character
\S matches a single non-whitespace character
\dMatches a single decimal digit character (0-9)
\A single backslash inhibits a character’s specialness
Examples- \.    \\     \*
When unsure if a character has a special meaning, put a \ before it:
\@
$A dollar matches the end of the string

A raw string literal does not handle backslashes in any special way. For this, prepend an ‘r’ before the pattern. Without this, you may have to use ‘\\\\’ for a single backslash character. But with this, you only need r’\’.

Regular characters match themselves.

4. Rules for a Match

So, how does this work? The following rules must be met:

  1. The search scans the string start to end.
  2. The whole pattern must match, but not necessarily the whole string.
  3. The search stops at the first match.

If a match is found, the group() method returns the matching phrase. If not, it returns None.

>>> print(re.search('na','no'))

None
Let’s look at about a couple important functions now.

5. Python Regular Expression Functions

We have a few functions to help us use Python regex.

a. match()

match() takes two arguments- a pattern and a string. If they match, it returns the string. Else, it returns None. Let’s take a few Python regular expression match examples.

>>> print(re.match('center','centre'))

None

>>> print(re.match('...\w\we','centre'))

<_sre.SRE_Match object; span=(0, 6), match=’centre’>

b. search()

search(), like match(), takes two arguments- the pattern and the string to be searched. Let’s take a few examples.

>>> match=re.search('aa?yushi','ayushi')
>>> match.group()

‘ayushi’

>>> match=re.search('aa?yushi?','ayush ayushi')
>>> match.group()

‘ayush’

>>> match=re.search('\w*end','Hey! What are your plans for the weekend?')
>>> match.group()

‘weekend’

>>> match=re.search('^\w*end','Hey! What are your plans for the weekend?')
>>> match.group()

Traceback (most recent call last):

File “<pyshell#337>”, line 1, in <module>

match.group()

AttributeError: ‘NoneType’ object has no attribute ‘group’

Here, an AttributeError raised because it found no match. This is because we specified that this pattern should be at the beginning of the string. Let’s try searching for space.

>>> match=re.search('i\sS','Ayushi Sharma')
>>> match.group()

‘i S’

>>> match=re.search('\w+c{2}\w*','Occam\'s Razor')
>>> match.group()

‘Occam’

It really will take some practice to get it into habit what the metacharacters mean. But since we don’t have so many, this will hardly take an hour.

6. Python Regex Examples

Let’s try crafting a Python regex for an email address. Hmm, so what does one look like? It looks like this: abc-def@ghi.com

Let’s try the following code:

>>> match=re.search(r'[\w.-]+@[\w-]+\.[\w]+','Please mail it to ayushiwasthere@gmail.com')
>>> match.group()

‘ayushiwasthere@gmail.com’

It worked perfectly!

Here, if you would have typed [\w-.] instead of [\w.-], it would have raised the following error:

>>> match=re.search(r'[\w-.]+@[\w-]+\.[\w]+','Please mail it to ayushiwasthere@gmail.com')

Traceback (most recent call last):

File “<pyshell#347>”, line 1, in <module>

match=re.search(r'[\w-.]+@[\w-]+\.[\w]+’,’Please mail it to ayushiwasthere@gmail.com’)

File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\re.py”, line 182, in search
return _compile(pattern, flags).search(string)

File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\re.py”, line 301, in _compile

p = sre_compile.compile(pattern, flags)

File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\sre_compile.py”, line 562, in compile

p = sre_parse.parse(p, flags)

File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py”, line 856, in parse

p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, False)

File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py”, line 415, in _parse_sub

itemsappend(_parse(source, state, verbose))

File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py”, line 547, in _parse

raise source.error(msg, len(this) + 1 + len(that))

sre_constants.error: bad character range \w-. at position 1

This is because normally, we use a dash (-) to indicate a range.

7. Group Extraction

Let’s continue with the example on emails. What if you only want the username? For this, you can provide an argument(like an index) to the group() method. Take a look at this:

>>> match=re.search(r'([\w.-]+)@([\w-]+)\.([\w]+)','Please mail it to ayushiwasthere@gmail.com')
>>> match.group()

‘ayushiwasthere@gmail.com’

>>> match.group(1)

‘ayushiwasthere’

>>> match.group(2)

‘gmail’

>>> match.group(3)

‘com’
Parentheses let you extract the parts you want. Note that for this, we divided the pattern into groups using parentheses:

r'([\w.-]+)@([\w-]+)\.([\w]+)’

8. Python findall()

Above, we saw that Python regex search() stops at the first match. But Python findall() returns a list of all matches found.

>>> match=re.findall(r'advi[cs]e','I could advise you on your poem, but you would disparage my advice')

We can then iterate on it.

>>> for i in match:
     print(i)

advise

advice

>>> type(match)

<class ‘list’>

9. findall() with Files

We have worked with files, and we know how to read and write them. Why not make life easier by using Python findall() with files? We’ll first use the os module to get to the desktop. Let’s see.

>>> import os
>>> os.chdir('C:\\Users\\lifei\\Desktop')
>>> f=open('Today.txt')

We have a file called Today.txt on our Desktop. These are its contents:

OS, DBMS, DS, ADA

HTML, CSS, jQuery, JavaScript

Python, C++, Java

This sem’s subjects

Now, let’s call findall().

>>> match=re.findall(r'Java[\w]*',f.read())

Finally, let’s iterate on it.

>>> for i in match:
      print(i)

JavaScript

Java

10. findall() with Groups

We saw how we can divide a pattern into groups using parentheses. Watch what happens when we call Python Regex findall().

>>> match=re.findall(r'([\w]+)\s([\w]+)','Ayushi Sharma, Fluffy Sharma, Leo Sharma, Candy Sharma')
>>> for i in match:
   print(i)

(‘Ayushi’, ‘Sharma’)

(‘Fluffy’, ‘Sharma’)

(‘Leo’, ‘Sharma’)

(‘Candy’, ‘Sharma’)

11. Python Regex Options

The functions we discussed may take an optional argument as well. These options are:

a. Python Regular Expression IGNORECASE

This Python Regex ignore case ignores the case while matching.

Take this example of Python Regex IGNORECASE:

>>> match=re.findall(r'hi','Hi, did you ship it, Hillary?',re.IGNORECASE)
>>> for i in match:
      print(i)

Hi

hi

Hi

b. Python MULTILINE

Working with a string of multiple lines, this allows ^ and $ to match the start and end of each line, not just the whole string.

>>> match=re.findall(r'^Hi','Hi, did you ship it, Hillary?\nNo, I didn\'t, but Hi',re.MULTILINE)
>>> for i in match:
      print(i)

Hi

c. Python DOTALL

.* does not scan everything in a multiline string; it only matches the first line. This is because . does not match a newline. To allow this, we use DOTALL.

>>> match=re.findall(r'.*','Hi, did you ship it, Hillary?\nNo, I didn\'t, but Hi',re.DOTALL)
>>> for i in match:
     print(i)

Hi, did you ship it, Hillary?

No, I didn’t, but Hi

12. Greedy vs Non-Greedy

The metacharacters *, +, and ? are greedy. This means that they keep searching. Let’s take an example.

>>> match=re.findall(r'(<.*>)','<em>Strong</em> <i>Italic</i>')
>>> for i in match:
     print(i)

<em>

</em>

<i>

</i>

This gave us the whole string, because it greedily keeps searching. What if we just want the opening and closing tags? Look:

print(i)

>>> match=re.findall(r'(<.*?>)','<em>Strong</em> <i>Italic</i>')
>>> for i in match:
       print(i)

<em>

</em>

<i>

</i>

The .* is greedy, and the ? makes it non-greedy.

Alternatively, we could also do this:

>>> match=re.findall(r'</?\w+>','<em>Strong</em> <i>Italic</i>')
>>> for i in match:
     print(i)

<em>

</em>

<i>

</i>

Here’s another example:

>>> match=re.findall('(a*?)b','aaabbc')
>>> for i in match:
     print(i)

aaa

Here, the ? makes * non-greedy. Also, if we would have skipped the b after the ?, it would have returned an empty string. The ? here needs a character after it to stop at. This works for all three- *?, +?, and ??.

Similarly, {m,n}? makes it non-greedy, and matches as few occurrences as possible.

13. Substitution

We can use the sub() function to substitute the part of a string with another. sub() takes three arguments- pattern, substring, and string.

>>> re.sub('^a','an','a apple')

‘an apple’

Here, we used ^ so it won’t change apple to anpple. The grammar police approve.

Python Interview Questions

14. Python Regex Applications

So, we learned so much about Python regular expressions, but where do we use them? They find use in these places:

  • Search engines
  • Find and Replace dialogues of word processor and text editors
  • Text processing utilities like sed and AWK
  • Lexical analysis

This was all about the Python Regex Tutorial

15. Python Regex – Conclusion

These were the basics of Python regular expressions. Honestly, we think it is really cool to have such a tool in hand. If you love English, try experimenting, and make a small project with it.

Furthermore, if you have a doubt in the Python Regex Tutorial, feel free to ask in the comments.

For reference

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.