📅  最后修改于: 2023-12-03 14:46:03.535000             🧑  作者: Mango
When it comes to processing text data, one of the most common tasks is to extract email addresses from a bunch of text. Even though email addresses have a well-defined format, using regular expressions to match them is still hard. In this tutorial, we'll learn how to extract email addresses from a string of text and exclude any email addresses that contain ".jpg".
For this task, we'll be using the re
library that comes pre-installed with Python. It provides support for regular expressions and makes it easy to search for patterns in text data.
import re
In order to extract the email addresses from a string of text, we need to define a regular expression that matches the pattern of an email address. Here's what the expression looks like:
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b(?<!\.jpg)'
Let's break down the above expression:
\b
matches a word boundary (i.e., the beginning or end of a word).[A-Za-z0-9._%+-]+
matches one or more characters that are either letters of the alphabet, digits, or one of the special characters ._%+-
.@
matches the "@" symbol.[A-Za-z0-9.-]+
matches one or more characters that are either letters of the alphabet, digits, or one of the special characters .-
.\.
matches a "." character.[A-Z|a-z]{2,}
matches two or more characters that are either uppercase or lowercase letters of the alphabet.\b
again matches a word boundary.(?<!\.jpg)
is a negative lookbehind assertion that excludes any matches that end with ".jpg".Once we have defined the regular expression, we can use the re.findall()
function to extract all email addresses from a string of text.
Here's an example:
text = 'my email address is me@example.com and my colleague\'s email address is you@example.com. However, we don\'t want to receive emails with .jpg attachment.'
emails = re.findall(email_pattern, text)
print(emails)
The output will be:
['me@example.com', 'you@example.com']
As you can see, the output only includes email addresses that do not contain the ".jpg" extension.
In this tutorial, we learned how to extract email addresses from a string of text using regular expressions in Python. We also learned how to exclude email addresses that contain the ".jpg" extension. Regular expressions are a powerful tool for processing text data and can be used for many other text manipulation tasks as well.