Mastering Regular Expressions in Python
Regular expressions (regex or regexp) are powerful tools for pattern matching within strings. Python’s re
module provides a comprehensive interface for working with them, enabling sophisticated text manipulation and data extraction. This tutorial will guide you through the essential functions and concepts, empowering you to effectively leverage the power of regular expressions in your Python projects.
Table of Contents
re.match()
: Matching at the Beginningre.search()
: Finding the First Matchre.compile()
: Optimizing Performance- Flags: Modifying Matching Behavior
- Character Sets: Defining Allowed Characters
- Search and Replace with
re.sub()
re.findall()
: Extracting All Matchesre.finditer()
: Iterating Through Matchesre.split()
: Splitting Strings by Pattern- Basic Patterns: Anchors, Character Classes
- Repetition: Quantifiers and Greedy vs. Non-Greedy Matching
- Special Sequences: Digits, Whitespace, Word Characters
re.escape()
: Handling Special Characters- Capturing Groups and the
group()
Method
1. re.match()
: Matching at the Beginning
The re.match()
function attempts to match the pattern only at the very beginning of the string. It returns a match object if successful, otherwise None
.
import re
text = "Hello World"
pattern = "Hello"
match = re.match(pattern, text)
if match:
print("Match found:", match.group(0))
else:
print("No match found")
2. re.search()
: Finding the First Match
re.search()
scans the entire string for the first occurrence of the pattern. Unlike re.match()
, it doesn’t require the match to be at the beginning.
import re
text = "Hello World"
pattern = "World"
match = re.search(pattern, text)
if match:
print("Match found:", match.group(0))
else:
print("No match found")
3. re.compile()
: Optimizing Performance
For better performance, especially with repeated use of the same pattern, compile it using re.compile()
. This creates a reusable pattern object.
import re
compiled_pattern = re.compile(r"d+") # Compile the pattern
text1 = "There are 123 apples"
text2 = "And 456 oranges"
match1 = compiled_pattern.search(text1)
match2 = compiled_pattern.search(text2)
print(match1.group(0)) # Output: 123
print(match2.group(0)) # Output: 456
4. Flags: Modifying Matching Behavior
Flags modify the matching process. re.IGNORECASE
performs case-insensitive matching, and re.MULTILINE
treats each line as a separate string for ^
and $
anchors.
import re
text = "Hello world"
pattern = re.compile("hello", re.IGNORECASE)
match = pattern.search(text)
print(match.group(0)) # Output: Hello
5. Character Sets: Defining Allowed Characters
Character sets ([]
) specify allowed characters. For instance, [a-z]
matches lowercase letters.
import re
text = "abc123XYZ"
pattern = re.compile("[a-z]+")
match = pattern.search(text)
print(match.group(0)) # Output: abc
6. Search and Replace with re.sub()
re.sub()
replaces occurrences of a pattern with a replacement string.
import re
text = "Hello World"
new_text = re.sub("World", "Python", text)
print(new_text) # Output: Hello Python
7. re.findall()
: Extracting All Matches
re.findall()
returns a list of all non-overlapping matches.
import re
text = "123 abc 456 def"
numbers = re.findall(r"d+", text)
print(numbers) # Output: ['123', '456']
8. re.finditer()
: Iterating Through Matches
re.finditer()
returns an iterator, yielding match objects. More memory-efficient for many matches in large strings.
import re
text = "123 abc 456 def"
for match in re.finditer(r"d+", text):
print(match.group(0)) # Output: 123, 456 (on separate lines)
9. re.split()
: Splitting Strings by Pattern
re.split()
splits a string based on a pattern.
import re
text = "apple,banana,cherry"
fruits = re.split(r",", text)
print(fruits) # Output: ['apple', 'banana', 'cherry']
10. Basic Patterns: Anchors, Character Classes
.
: Matches any character except newline.^
: Matches the beginning of the string.$
: Matches the end of the string.[]
: Matches a set of characters (e.g.,[abc]
,[a-z]
).[^...]
: Matches any character *not* in the set (negated character set).
11. Repetition: Quantifiers and Greedy vs. Non-Greedy Matching
*
: Zero or more occurrences.+
: One or more occurrences.?
: Zero or one occurrence.{m}
: Exactlym
occurrences.{m,n}
: Fromm
ton
occurrences.*?
,+?
,??
,{m,n}?
: Non-greedy versions (match the shortest possible string).
12. Special Sequences: Digits, Whitespace, Word Characters
d
: Matches any digit (0-9).D
: Matches any non-digit character.s
: Matches any whitespace character (space, tab, newline).S
: Matches any non-whitespace character.w
: Matches any alphanumeric character (letters, numbers, underscore).W
: Matches any non-alphanumeric character.
13. re.escape()
: Handling Special Characters
re.escape()
escapes special characters in a string, allowing you to use it as a literal pattern without unintended regex interpretations.
14. Capturing Groups and the group()
Method
Parentheses ()
create capturing groups. The group()
method accesses captured substrings.
import re
text = "My phone number is 123-456-7890"
match = re.search(r"(d{3})-(d{3})-(d{4})", text)
if match:
area_code = match.group(1)
prefix = match.group(2)
line_number = match.group(3)
print(f"Area Code: {area_code}, Prefix: {prefix}, Line Number: {line_number}")
This tutorial provides a solid foundation in Python’s re
module. Further exploration of advanced techniques will significantly enhance your string processing capabilities. Remember to consult the official Python documentation for a complete reference.