Python – 使用正则表达式替换文本中的模式
正则表达式 (regex) 用于从任何基于模式的文本中提取所需信息。它们还广泛用于操作基于模式的文本,从而导致文本预处理,并且非常有助于实现自然语言处理 (NLP)等数字技能。
本文通过提供多个示例来演示如何使用正则表达式替换模式,其中每个示例都是一个独特的场景。理解re
(正则表达式)模块的re.sub()
方法来理解给定的解是非常有必要的。
re.sub()
方法对给定的字符串执行全局搜索和全局替换。它用于替换字符串中的特定模式。此函数共有 5 个参数。
Syntax: re.sub(pattern, repl, string, count=0, flags=0)
Parameters:
pattern – the pattern which is to be searched and substituted
repl – the string with which the pattern is to be replaced
string – the name of the variable in which the pattern is stored
count – number of characters up to which substitution will be performed
flags – it is used to modify the meaning of the regex pattern
count
and flags
are optional arguments.
示例 1:替换特定文本模式
在此示例中,将搜索给定的文本模式并将其替换为字符串。这个想法是使用re.sub()
方法的非常正常的形式,只有前 3 个参数。
下面是实现。
# Python implementation of substituting a
# specific text pattern in a string using regex
# importing regex module
import re
# Function to perform
# operations on the strings
def substitutor():
# a string variable
sentence1 = "It is raining outside."
# replacing text 'raining' in the string
# variable sentence1 with 'sunny' thus
# passing first parameter as raining
# second as sunny, third as the
# variable name in which string is stored
# and printing the modified string
print(re.sub(r"raining", "sunny", sentence1))
# a string variable
sentence2 = "Thank you very very much."
# replacing text 'very' in the string
# variable sentence2 with 'so' thus
# passing parameters at their
# appropriate positions and printing
# the modified string
print(re.sub(r"very", "so", sentence2))
# Driver Code:
substitutor()
It is sunny outside.
Thank you so so much.
No matter how many time the required pattern is present in the string, the re.sub()
function replaces all of them with the given pattern. That’s why both the ‘very’ are replaced by ‘so’ in the above example.
示例 2:用特定字符替换字符集
任务是用给定的字符替换字符集。一个字符集意味着一个字符范围。在re.sub()
方法中,字符集写在 [ ](方括号)内。
在这个例子中,小写字符集,即 [az] 将被数字 0 替换。下面是实现。
# Python implementation of substituting
# a character set with a specific character
# importing regex module
import re
# Function to perform
# operations on the strings
def substitutor():
# a string variable
sentence = "22 April is celebrated as Earth Day."
# replacing every lower case characters
# in the variable sentence with 0 and
# printing the modified string
print(re.sub(r"[a-z]", "0", sentence))
# Driver Code:
substitutor()
22 A0000 00 0000000000 00 E0000 D00.
If there is a need to substitute both lowercase and uppercase character set then we have to introduce the uppercase character set in this way: [a-zA-Z] or the effective way to do is by using flags.
示例 3:不区分大小写的字符集替换为特定字符
在此示例中,小写和大写字符都将替换为给定的字符。使用flags可以非常轻松地执行此任务。
re.I
标志代表 re。忽略。通过在re.sub()
方法中引入此标志并提及任何一个字符集,即小写或大写,可以完成任务。
下面是实现。
# Python implementation of case-insensitive substitution
# of a character set with a specific character
# importing regex module
import re
# Function to perform
# operations on the strings
def substitutor():
# a string variable
sentence = "22 April is celebrated as Earth Day."
# replacing both lowercase and
# uppercase characters with 0 in
# the variable sentence by using
# flag and printing the modified string
print(re.sub(r"[a-z]", "0", sentence, flags = re.I))
# Driver Code:
substitutor()
22 00000 00 0000000000 00 00000 000.
示例 4:执行替换到一定数量的字符
在此示例中,替换最多为特定数量的字符,而不是整个字符串。要执行这种类型的替换, re.sub()
方法有一个参数count
。
通过为该参数提供一个数值,可以控制发生替换的字符数。下面是实现。
# Python implementation to perform substitution
# up to a certain number of characters
# importing regex module
import re
# Function to perform
# operations on the strings
def substitutor():
# a string variable
sentence = "Follow your Passion."
# case-insensitive substitution
# on variable sentence upto
# eight characters and printing
# the modified string
print(re.sub(r"[a-z]", "0", sentence, 8, flags = re.I))
# Driver Code:
substitutor()
000000 00ur Passion.
示例 5:使用速记字符类的替换和文本的预处理
Regex 模块为那些在文本预处理过程中非常常见的字符集提供了许多速记字符类。使用速记字符类可以编写高效的代码并减少记住每个字符集范围的需要。
要获得速记字符类的详细说明以及如何在Python中编写正则表达式以进行文本预处理,请单击此处。以下是一些常用的速记字符类:
\w: matches alpha numeric characters
\W: matches non-alpha numeric characters like @, #, ‘, +, %, –
\d: matches digit characters
\s: matches white space characters
Meaning of some syntax:
adding a plus(+) symbol after a character class or set: repetition of preceding character class or set for at least 1 or more times.
adding an asterisk(*) symbol after a character class or set: repetition of preceding character class or set for at least 0 or more times.
adding a caret(^) symbol before a character class or set: matching position is determined for that character class or set at the beginning of the string.
adding a dollar($) symbol after a character class or set: matching position is determined for that character class or set at the end of the string.
这个例子演示了使用提到的速记字符类来替换和预处理文本以获得干净和无错误的字符串。下面是实现。
# Python implementation of Substitution using
# shorthand character class and preprocessing of text
# importing regex module
import re
# Function to perform
# operations on the strings
def substitutor():
# list of strings
S = ["2020 Olympic games have @# been cancelled",
"Dr Vikram Sarabhai was +%--the ISRO’s first chairman",
"Dr Abdul Kalam, the father of India's missile programme"]
# loop to iterate every element of list
for i in range(len(S)):
# replacing every non-word character with a white space
S[i] = re.sub(r"\W", " ", S[i])
# replacing every digit character with a white space
S[i] = re.sub(r"\d", " ", S[i])
# replacing one or more white space with a single white space
S[i] = re.sub(r"\s+", " ", S[i])
# replacing alphabetic characters which have one or more
# white space before and after them with a white space
S[i] = re.sub(r"\s+[a-z]\s+", " ", S[i], flags = re.I)
# substituting one or more white space which is at
# beginning of the string with an empty string
S[i] = re.sub(r"^\s+", "", S[i])
# substituting one or more white space which is at
# end of the string with an empty string
S[i] = re.sub(r"\s+$", "", S[i])
# loop to iterate every element of list
for i in range(len(S)):
# printing each modified string
print(S[i])
# Driver Code:
substitutor()
Olympic games have been cancelled
Dr Vikram Sarabhai was the ISRO first chairman
Dr Abdul Kalam the father of India missile programme