Learn Python Regex Tutorial – Python Regular Expression Functions
Master Python with 70+ Hands-on Projects and Get Job-ready - Learn Python
Python Regular Expression is one of my favourite topics. Let’s delve into this without wasting a moment to learn Python Regex Tutorial.
Here, we will discuss Metacharacters, examples & functions of Python Regex. Along with this, we will cover Python findall, Python multiline.
So, let’s start a short Python Regex Cheat Sheet.
What is the Python Regular Expression (Regex)?
Essentially, a Python regular expression is a sequence of characters, that defines a search pattern.
We can then use this pattern in a string-searching algorithm to “find” or “find and replace” on strings. You would’ve seen this feature in Microsoft Word as well.
In this Python Regex tutorial, we will learn the basics of regular expressions in Python. For this, we will use the ‘re’ module.
Let’s import it before we begin.
>>> import re
Python Regex – Metacharacters
Each character in a Python Regex is either a metacharacter or a regular character. A metacharacter has a special meaning, while a regular character matches itself.
Technology is evolving rapidly!
Stay updated with DataFlair on WhatsApp!!
Python has the following metacharacters:
Metacharacter | Description |
^ | Matches the start of the string |
. | Matches a single character, except a newline But when used inside square brackets, a dot is matched |
[ ] | A bracket expression matches a single character from the ones inside it [abc] matches ‘a’, ‘b’, and ‘c’ [a-z] matches characters from ‘a’ to ‘z’ [a-cx-z] matches ‘a’, ’b’, ’c’, ’x’, ’y’, and ‘z’ |
[^ ] | Matches a single character from those except the ones mentioned in the brackets[^abc] matches all characters except ‘a’, ‘b’ and ‘c’ |
( ) | Parentheses define a marked subexpression, also called a block, or a capturing group |
\t, \n, \r, \f | Tab, newline, return, form feed |
* | Matches the preceding character zero or more times ab*c matches ‘ac’, ‘abc’, ‘abbc’, and so on [ab]* matches ‘’, ‘a’, ‘b’, ‘ab’, ‘ba’, ‘aba’, and so on (ab)* matches ‘’, ‘ab’, ‘abab’, ‘ababab’, and so on |
{m,n} | Matches the preceding character minimum m times, and maximum n times a{2,4} matches ‘aa’, ‘aaa’, and ‘aaaa’ |
{m} | Matches the preceding character exactly m times |
? | Matches the preceding character zero or one times ab?c matches ‘ac’ or ‘abc’ |
+ | Matches the preceding character one or one times ab+c matches ‘abc’, ‘abbc’, ‘abbbc’, and so on, but not ‘ac’ |
| | The choice operator matches either the expression before it, or the one after abc|def matches ‘abc’ or ‘def’ |
\w | Matches a word character (a-zA-Z0-9) \W matches single non-word characters |
\b | Matches the boundary between word and non-word characters |
\s | Matches a single whitespace character \S matches a single non-whitespace character |
\d | Matches a single decimal digit character (0-9) |
\ | A single backslash inhibits a character’s specialness Examples- \. \\ \* When unsure if a character has a special meaning, put a \ before it: \@ |
$ | A dollar matches the end of the string |
A raw string literal does not handle backslashes in any special way. For this, prepend an ‘r’ before the pattern.
Without this, you may have to use ‘\\\\’ for a single backslash character. But with this, you only need r’\’.
Regular characters match themselves.
Rules for a Match
So, how does this work? The following rules must be met:
- The search scans the string start to end.
- The whole pattern must match, but not necessarily the whole string.
- The search stops at the first match.
If a match is found, the group() method returns the matching phrase. If not, it returns None.
>>> print(re.search('na','no'))
Output
None
Let’s look at about a couple important functions now.
Python Regular Expression Functions
We have a few functions to help us use Python regex.
1. match()
match() takes two arguments- a pattern and a string. If they match, it returns the string. Else, it returns None.
Let’s take a few Python regular expression match examples.
>>> print(re.match('center','centre'))
Output
None
>>> print(re.match('...\w\we','centre'))
Output
<_sre.SRE_Match object; span=(0, 6), match=’centre’>
2. search()
search(), like match(), takes two arguments- the pattern and the string to be searched.
Let’s take a few examples.
>>> match=re.search('aa?yushi','ayushi') >>> match.group()
Output
‘ayushi’
>>> match=re.search('aa?yushi?','ayush ayushi') >>> match.group()
Output
‘ayush’
>>> match=re.search('\w*end','Hey! What are your plans for the weekend?') >>> match.group()
Output
‘weekend’
>>> match=re.search('^\w*end','Hey! What are your plans for the weekend?') >>> match.group()
Output
Traceback (most recent call last):File “<pyshell#337>”, line 1, in <module>
match.group()
AttributeError: ‘NoneType’ object has no attribute ‘group’
Here, an AttributeError raised because it found no match. This is because we specified that this pattern should be at the beginning of the string.
Let’s try searching for space.
>>> match=re.search('i\sS','Ayushi Sharma') >>> match.group()
Output
‘i S’
>>> match=re.search('\w+c{2}\w*','Occam\'s Razor') >>> match.group()
Output
‘Occam’
It really will take some practice to get it into habit what the metacharacters mean.
But since we don’t have so many, this will hardly take an hour.
Python Regex Examples
Let’s try crafting a Python regex for an email address. Hmm, so what does one look like? It looks like this: [email protected]
Let’s try the following code:
>>> match=re.search(r'[\w.-]+@[\w-]+\.[\w]+','Please mail it to [email protected]') >>> match.group()
Output
‘[email protected]’It worked perfectly!
Here, if you would have typed [\w-.] instead of [\w.-], it would have raised the following error:
>>> match=re.search(r'[\w-.]+@[\w-]+\.[\w]+','Please mail it to [email protected]')
Output
Traceback (most recent call last):File “<pyshell#347>”, line 1, in <module>
match=re.search(r'[\w-.]+@[\w-]+\.[\w]+’,’Please mail it to [email protected]’)
File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\re.py”, line 182, in search
return _compile(pattern, flags).search(string)
File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\re.py”, line 301, in _compile
p = sre_compile.compile(pattern, flags)
File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\sre_compile.py”, line 562, in compile
p = sre_parse.parse(p, flags)
File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py”, line 856, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, False)
File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py”, line 415, in _parse_sub
itemsappend(_parse(source, state, verbose))
File “C:\Users\lifei\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py”, line 547, in _parse
raise source.error(msg, len(this) + 1 + len(that))
sre_constants.error: bad character range \w-. at position 1
This is because normally, we use a dash (-) to indicate a range.
Group Extraction
Let’s continue with the example on emails. What if you only want the username?
For this, you can provide an argument(like an index) to the group() method.
Take a look at this:
>>> match=re.search(r'([\w.-]+)@([\w-]+)\.([\w]+)','Please mail it to [email protected]') >>> match.group()
Output
>>> match.group(1)
Output
‘ayushiwasthere’
>>> match.group(2)
Output
‘gmail’
>>> match.group(3)
Output
‘com’
Parentheses let you extract the parts you want. Note that for this, we divided the pattern into groups using parentheses:
r'([\w.-]+)@([\w-]+)\.([\w]+)’
Python findall()
Above, we saw that Python regex search() stops at the first match.
But Python findall() returns a list of all matches found.
>>> match=re.findall(r'advi[cs]e','I could advise you on your poem, but you would disparage my advice')
We can then iterate on it.
>>> for i in match: print(i)
Output
advice
>>> type(match)
Output
findall() with Files
We have worked with files, and we know how to read and write them. Why not make life easier by using Python findall() with files?
We’ll first use the os module to get to the desktop. Let’s see.
>>> import os >>> os.chdir('C:\\Users\\lifei\\Desktop') >>> f=open('Today.txt')
We have a file called Today.txt on our Desktop. These are its contents:
OS, DBMS, DS, ADA
HTML, CSS, jQuery, JavaScript
Python, C++, Java
This sem’s subjects
Now, let’s call findall().
>>> match=re.findall(r'Java[\w]*',f.read())
Finally, let’s iterate on it.
>>> for i in match: print(i)
Output
Java
findall() with Groups
We saw how we can divide a pattern into groups using parentheses. Watch what happens when we call Python Regex findall().
>>> match=re.findall(r'([\w]+)\s([\w]+)','Ayushi Sharma, Fluffy Sharma, Leo Sharma, Candy Sharma') >>> for i in match: print(i)
Output
(‘Ayushi’, ‘Sharma’)
(‘Fluffy’, ‘Sharma’)
(‘Leo’, ‘Sharma’)
(‘Candy’, ‘Sharma’)
Python Regex Options
The functions we discussed may take an optional argument as well. These options are:
1. Python Regular Expression IGNORECASE
This Python Regex ignore case ignores the case while matching.
Take this example of Python Regex IGNORECASE:
>>> match=re.findall(r'hi','Hi, did you ship it, Hillary?',re.IGNORECASE) >>> for i in match: print(i)
Output
Hihi
Hi
2. Python MULTILINE
Working with a string of multiple lines, this allows ^ and $ to match the start and end of each line, not just the whole string.
>>> match=re.findall(r'^Hi','Hi, did you ship it, Hillary?\nNo, I didn\'t, but Hi',re.MULTILINE) >>> for i in match: print(i)
Output
3. Python DOTALL
.* does not scan everything in a multiline string; it only matches the first line. This is because . does not match a newline.
To allow this, we use DOTALL.
>>> match=re.findall(r'.*','Hi, did you ship it, Hillary?\nNo, I didn\'t, but Hi',re.DOTALL) >>> for i in match: print(i)
Output
Greedy vs Non-Greedy
The metacharacters *, +, and ? are greedy. This means that they keep searching. Let’s take an example.
>>> match=re.findall(r'(<.*>)','<em>Strong</em> <i>Italic</i>') >>> for i in match: print(i)
Output
<em></em>
<i>
</i>
This gave us the whole string, because it greedily keeps searching. What if we just want the opening and closing tags? Look:
print(i)
>>> match=re.findall(r'(<.*?>)','<em>Strong</em> <i>Italic</i>') >>> for i in match: print(i)
Output
<em></em>
<i>
</i>
The .* is greedy, and the ? makes it non-greedy.
Alternatively, we could also do this:
>>> match=re.findall(r'</?\w+>','<em>Strong</em> <i>Italic</i>') >>> for i in match: print(i)
Output
<em></em>
<i>
</i>
Here’s another example:
>>> match=re.findall('(a*?)b','aaabbc') >>> for i in match: print(i)
Output
Here, the ? makes * non-greedy. Also, if we would have skipped the b after the ?, it would have returned an empty string.
The ? here needs a character after it to stop at. This works for all three- *?, +?, and ??.
Similarly, {m,n}? makes it non-greedy, and matches as few occurrences as possible.
Substitution
We can use the sub() function to substitute the part of a string with another. sub() takes three arguments- pattern, substring, and string.
>>> re.sub('^a','an','a apple')
Output
Here, we used ^ so it won’t change apple to anpple. The grammar police approve.
Python Regex Applications
So, we learned so much about Python regular expressions, but where do we use them? They find use in these places:
- Search engines
- Find and Replace dialogues of word processor and text editors
- Text processing utilities like sed and AWK
- Lexical analysis
This was all about the Python Regex Tutorial
Python Interview Questions on Regular Expressions
- What is regular expression in Python? Explain with example.
- How to use regular expression in Python?
- What is the meaning of question mark in regular expression in Python?
- How to split a regular expression in Python?
- How to check if a regular expression is in Python?
Conclusion
These were the basics of Python regular expressions. Honestly, we think it is really cool to have such a tool in hand.
If you love English, try experimenting, and make a small project with it.
We work very hard to provide you quality material
Could you take 15 seconds and share your happy experience on Google
In Greedy and Non Greedy part, please check your examples. While explaining the foundation of Greedy concept, your first example does not align to the concept.
We really appreciate your observation, we have noted your opinion and we will be making the necessary changes shortly. Thanks a lot aman for the feedback.
Great, I finally understand it!