📜  Python中的正则表达式与示例|设置 1

📅  最后修改于: 2022-05-13 01:54:39.577000             🧑  作者: Mango

Python中的正则表达式与示例|设置 1

正则表达式 (RegEx)是一个特殊的字符序列,它使用搜索模式来查找一个字符串或一组字符串。它可以通过与特定模式匹配来检测文本的存在与否,也可以将一个模式拆分为一个或多个子模式。 Python提供了一个re模块,支持在Python中使用正则表达式。它的主要函数是提供搜索,它需要一个正则表达式和一个字符串。在这里,它要么返回第一个匹配,要么不返回。

例子:

Python3
import re
 
s = 'GeeksforGeeks: A computer science portal for geeks'
 
match = re.search(r'portal', s)
 
print('Start Index:', match.start())
print('End Index:', match.end())


Python3
import re
 
s = 'geeks.forgeeks'
 
# without using \
match = re.search(r'.', s)
print(match)
 
# using \
match = re.search(r'\.', s)
print(match)


Python3
import re


Python3
# A Python program to demonstrate working of
# findall()
import re
 
# A sample text string where regular expression
# is searched.
string = """Hello my Number is 123456789 and
            my friend's number is 987654321"""
 
# A sample regular expression to find digits.
regex = '\d+'
 
match = re.findall(regex, string)
print(match)
 
# This example is contributed by Ayush Saluja.


Python
# Module Regular Expression is imported
# using __import__().
import re
 
# compile() creates regular expression
# character class [a-e],
# which is equivalent to [abcde].
# class [abcde] will match with string with
# 'a', 'b', 'c', 'd', 'e'.
p = re.compile('[a-e]')
 
# findall() searches for the Regular Expression
# and return a list upon finding
print(p.findall("Aye, said Mr. Gibenson Stark"))


Python
import re
 
# \d is equivalent to [0-9].
p = re.compile('\d')
print(p.findall("I went to him at 11 A.M. on 4th July 1886"))
 
# \d+ will match a group on [0-9], group
# of one or greater size
p = re.compile('\d+')
print(p.findall("I went to him at 11 A.M. on 4th July 1886"))


Python
import re
 
# \w is equivalent to [a-zA-Z0-9_].
p = re.compile('\w')
print(p.findall("He said * in some_lang."))
 
# \w+ matches to group of alphanumeric character.
p = re.compile('\w+')
print(p.findall("I went to him at 11 A.M., he \
said *** in some_language."))
 
# \W matches to non alphanumeric characters.
p = re.compile('\W')
print(p.findall("he said *** in some_language."))


Python
import re
 
# '*' replaces the no. of occurrence
# of a character.
p = re.compile('ab*')
print(p.findall("ababbaabbb"))


Python
from re import split
 
# '\W+' denotes Non-Alphanumeric Characters
# or group of characters Upon finding ','
# or whitespace ' ', the split(), splits the
# string from that point
print(split('\W+', 'Words, words , Words'))
print(split('\W+', "Word's words Words"))
 
# Here ':', ' ' ,',' are not AlphaNumeric thus,
# the point where splitting occurs
print(split('\W+', 'On 12th Jan 2016, at 11:02 AM'))
 
# '\d+' denotes Numeric Characters or group of
# characters Splitting occurs at '12', '2016',
# '11', '02' only
print(split('\d+', 'On 12th Jan 2016, at 11:02 AM'))


Python
import re
 
# Splitting will occurs only once, at
# '12', returned list will have length 2
print(re.split('\d+', 'On 12th Jan 2016, at 11:02 AM', 1))
 
# 'Boy' and 'boy' will be treated same when
# flags = re.IGNORECASE
print(re.split('[a-f]+', 'Aey, Boy oh boy, come here', flags=re.IGNORECASE))
print(re.split('[a-f]+', 'Aey, Boy oh boy, come here'))


Python
import re
 
# Regular Expression pattern 'ub' matches the
# string at "Subject" and "Uber". As the CASE
# has been ignored, using Flag, 'ub' should
# match twice with the string Upon matching,
# 'ub' is replaced by '~*' in "Subject", and
# in "Uber", 'Ub' is replaced.
print(re.sub('ub', '~*', 'Subject has Uber booked already',
             flags=re.IGNORECASE))
 
# Consider the Case Sensitivity, 'Ub' in
# "Uber", will not be replaced.
print(re.sub('ub', '~*', 'Subject has Uber booked already'))
 
# As count has been given value 1, the maximum
# times replacement occurs is 1
print(re.sub('ub', '~*', 'Subject has Uber booked already',
             count=1, flags=re.IGNORECASE))
 
# 'r' before the pattern denotes RE, \s is for
# start and end of a String.
print(re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam',
             flags=re.IGNORECASE))


Python
import re
 
print(re.subn('ub', '~*', 'Subject has Uber booked already'))
 
t = re.subn('ub', '~*', 'Subject has Uber booked already',
            flags=re.IGNORECASE)
print(t)
print(len(t))
 
# This will give same output as sub() would have
print(t[0])


Python
import re
 
# escape() returns a string with BackSlash '\',
# before every Non-Alphanumeric Character
# In 1st case only ' ', is not alphanumeric
# In 2nd case, ' ', caret '^', '-', '[]', '\'
# are not alphanumeric
print(re.escape("This is Awesome even 1 AM"))
print(re.escape("I Asked what is this [a-9], he said \t ^WoW"))


Python3
# A Python program to demonstrate working of re.match().
import re
 
# Lets use a regular expression to match a date string
# in the form of Month name followed by day number
regex = r"([a-zA-Z]+) (\d+)"
 
match = re.search(regex, "I was born on June 24")
 
if match != None:
 
    # We reach here when the expression "([a-zA-Z]+) (\d+)"
    # matches the date string.
 
    # This will print [14, 21), since it matches at index 14
    # and ends at 21.
    print ("Match at index %s, %s" % (match.start(), match.end()))
 
    # We us group() method to get all the matches and
    # captured groups. The groups contain the matched values.
    # In particular:
    # match.group(0) always returns the fully matched string
    # match.group(1) match.group(2), ... return the capture
    # groups in order from left to right in the input string
    # match.group() is equivalent to match.group(0)
 
    # So this will print "June 24"
    print ("Full match: %s" % (match.group(0)))
 
    # So this will print "June"
    print ("Month: %s" % (match.group(1)))
 
    # So this will print "24"
    print ("Day: %s" % (match.group(2)))
 
else:
    print ("The regex pattern does not match.")


Python3
import re
 
s = "Welcome to GeeksForGeeks"
 
# here x is the match object
res = re.search(r"\bG", s)
 
print(res.re)
print(res.string)


Python3
import re
 
s = "Welcome to GeeksForGeeks"
 
# here x is the match object
res = re.search(r"\bGee", s)
 
print(res.start())
print(res.end())
print(res.span())


Python3
import re
 
s = "Welcome to GeeksForGeeks"
 
# here x is the match object
res = re.search(r"\D{2} t", s)
 
print(res.group())


输出
Start Index: 34
End Index: 40

上面的代码给出了字符串入口的起始索引和结束索引。

注意:这里的 r字符(r'portal') 代表原始,而不是正则表达式。原始字符串与常规字符串略有不同,它不会将 \字符解释为转义字符。这是因为正则表达式引擎使用 \字符进行转义。

在开始使用Python正则表达式模块之前,让我们看看如何使用元字符或特殊序列实际编写正则表达式。

元字符

为了理解 RE 的类比,MetaCharacters 是有用的、重要的,并且将用于模块 re 的函数中。下面是元字符列表。

MetaCharactersDescription
\Used to drop the special meaning of character following it
[]Represent a character class
^Matches the beginning
$Matches the end
.Matches any character except newline
|Means OR (Matches with any of the characters separated by it.
?Matches zero or one occurrence
*Any number of occurrences (including 0 occurrences)
+One or more occurrences
{}Indicate the number of occurrences of a preceding regex to match.
()Enclose a group of Regex

让我们详细讨论每个元字符

\ - 反斜杠

反斜杠 (\) 确保不以特殊方式处理该字符。这可以被认为是转义元字符的一种方式。例如,如果您想在字符串中搜索点(.),那么您会发现点(.)将被视为特殊字符,就像元字符之一一样(如上表所示)。所以对于这种情况,我们将在点(.)之前使用反斜杠(\),这样它就会失去它的特殊性。请参阅以下示例以获得更好的理解。

例子:

Python3

import re
 
s = 'geeks.forgeeks'
 
# without using \
match = re.search(r'.', s)
print(match)
 
# using \
match = re.search(r'\.', s)
print(match)
输出
<_sre.SRE_Match object; span=(0, 1), match='g'>
<_sre.SRE_Match object; span=(5, 6), match='.'>

[] - 方括号

方括号 ([]) 表示由一组我们希望匹配的字符组成的字符类。例如,字符类 [abc] 将匹配任何单个 a、b 或 c。

我们还可以在方括号内使用 – 指定字符范围。例如,

  • [0, 3] 是样本为 [0123]
  • [ac] 与 [abc] 相同

我们还可以使用插入符号 (^) 反转字符类。例如,

  • [^0-3] 表示除 0、1、2 或 3 以外的任何数字
  • [^ac] 表示除 a、b 或 c 之外的任何字符

^ – 插入符号

插入符号 (^) 符号匹配字符串的开头,即检查字符串是否以给定字符开头。例如 -

  • ^g 将检查字符串是否以 g 开头,例如 geeks、globe、girl、g 等。
  • ^ge 将检查字符串是否以 ge 开头,例如 geeks、geeksforgeeks 等。

$ - 美元

Dollar($) 符号匹配字符字符串。例如 -

  • s$ 将检查以 a 结尾的字符串,例如 geeks、ends、s 等。
  • ks$ 将检查以 ks 结尾的字符串,例如 geeks、geeksforgeeks、ks 等。

. – 点

点 (.) 符号仅匹配除字符(\n) 之外的单个字符。例如 -

  • ab 将检查在点的位置包含任何字符的字符串,例如 acb、acbd、abbb 等
  • .. 将检查字符串是否包含至少 2 个字符

| - 或者

Or 符号用作 or运算符,这意味着它检查 or 符号之前或之后的模式是否存在于字符串中。例如 -

  • a|b 将匹配任何包含 a 或 b 的字符串,例如 acd、bcd、abcd 等。

? – 问号

问号(?)检查正则表达式中问号之前的字符串是否至少出现一次或根本不出现。例如 -

  • ab?c 将与字符串ac、acb、dabc 匹配,但不会与 abbc 匹配,因为有两个 b。同样,它也不会匹配到 abdc,因为 b 后面没有 c。

* - 星星

星 (*) 符号匹配 * 符号之前出现的零次或多次正则表达式。例如 -

  • ab*c 将匹配字符串ac、abc、abbbc、dabc 等,但不会匹配 abdc,因为 b 后面没有 c。

+ – 加号

加号 (+) 符号匹配 + 符号之前出现的一个或多个正则表达式。例如 -

  • ab+c 将匹配字符串abc, abbc, dabc,但不会匹配 ac, abdc,因为 ac 中没有 b 并且 b 后面没有 c 在 abdc 中。

{m, n} – 大括号

大括号匹配从 m 到 n 的正则表达式之前的任何重复,包括两个端点。例如 -

  • a{2, 4} 将匹配字符串aaab、baaaac、gaad,但不会匹配诸如 abc、bc 之类的字符串,因为在这两种情况下只有一个 a 或没有 a。

() – 组

组符号用于对子模式进行分组。例如 -

  • (a|b)cd 将匹配 acd、abcd、gacd 等字符串。

特殊序列

特殊序列不匹配字符串中的实际字符,而是告诉搜索字符串中必须发生匹配的特定位置。它使编写常用模式变得更加容易。

特殊序列列表

Special SequenceDescriptionExamples
\AMatches if the string begins with the given character\Afor for geeks
for the world
\bMatches if the word begins or ends with the given character. \b(string) will check for the beginning of the word and (string)\b will check for the ending of the word.\bgegeeks
get
\BIt is the opposite of the \b i.e. the string should not start or end with the given regex.\Bgetogether
forge
\dMatches any decimal digit, this is equivalent to the set class [0-9]\d123
gee1
\DMatches any non-digit character, this is equivalent to the set class [^0-9]\Dgeeks
geek1
\sMatches any whitespace character.\sgee ks
a bc a
\SMatches any non-whitespace character\Sa bd
abcd
\wMatches any alphanumeric character, this is equivalent to the class [a-zA-Z0-9_].\w123
geeKs4
\WMatches any non-alphanumeric character.\W>$
gee<>
\ZMatches if the string ends with the given regexab\Zabcdab
abababab

Python中的正则表达式模块

Python有一个名为 re 的模块,用于Python中的正则表达式。我们可以使用 import 语句来导入这个模块。

示例:在Python中导入 re 模块

Python3

import re

让我们看看这个模块提供的各种函数,用于在Python中使用正则表达式。

re.findall()

返回字符串字符串。从左到右扫描字符串,并按找到的顺序返回匹配项。

示例:查找模式的所有出现

Python3

# A Python program to demonstrate working of
# findall()
import re
 
# A sample text string where regular expression
# is searched.
string = """Hello my Number is 123456789 and
            my friend's number is 987654321"""
 
# A sample regular expression to find digits.
regex = '\d+'
 
match = re.findall(regex, string)
print(match)
 
# This example is contributed by Ayush Saluja.
输出
['123456789', '987654321']

重新编译()

正则表达式被编译成模式对象,这些对象具有各种操作的方法,例如搜索模式匹配或执行字符串替换。

示例 1:

Python

# Module Regular Expression is imported
# using __import__().
import re
 
# compile() creates regular expression
# character class [a-e],
# which is equivalent to [abcde].
# class [abcde] will match with string with
# 'a', 'b', 'c', 'd', 'e'.
p = re.compile('[a-e]')
 
# findall() searches for the Regular Expression
# and return a list upon finding
print(p.findall("Aye, said Mr. Gibenson Stark"))

输出:

['e', 'a', 'd', 'b', 'e', 'a']

了解输出:

  • “Aye”中的第一个出现是“e”,而不是“A”,因为它区分大小写。
  • Next Occurrence 是“said”中的“a”,然后是“said”中的“d”,然后是“Gibenson”中的“b”和“e”,最后一个“a”与“Stark”匹配。
  • 元字符反斜杠“\”具有非常重要的作用,因为它表示各种序列。如果要使用反斜杠而没有它作为元字符的特殊含义,请使用'\\'

示例 2:设置类 [\s,.] 将匹配任何空白字符',' 或 '.' .

Python

import re
 
# \d is equivalent to [0-9].
p = re.compile('\d')
print(p.findall("I went to him at 11 A.M. on 4th July 1886"))
 
# \d+ will match a group on [0-9], group
# of one or greater size
p = re.compile('\d+')
print(p.findall("I went to him at 11 A.M. on 4th July 1886"))

输出:

['1', '1', '4', '1', '8', '8', '6']
['11', '4', '1886']

示例 3:

Python

import re
 
# \w is equivalent to [a-zA-Z0-9_].
p = re.compile('\w')
print(p.findall("He said * in some_lang."))
 
# \w+ matches to group of alphanumeric character.
p = re.compile('\w+')
print(p.findall("I went to him at 11 A.M., he \
said *** in some_language."))
 
# \W matches to non alphanumeric characters.
p = re.compile('\W')
print(p.findall("he said *** in some_language."))

输出:

['H', 'e', 's', 'a', 'i', 'd', 'i', 'n', 's', 'o', 'm', 'e', '_', 'l', 'a', 'n', 'g']
['I', 'went', 'to', 'him', 'at', '11', 'A', 'M', 'he', 'said', 'in', 'some_language']
[' ', ' ', '*', '*', '*', ' ', ' ', '.']

示例 4:

Python

import re
 
# '*' replaces the no. of occurrence
# of a character.
p = re.compile('ab*')
print(p.findall("ababbaabbb"))

输出:

['ab', 'abb', 'a', 'abbb']

了解输出:

  • 我们的 RE 是 ab*,其中 'a' 伴随着任何编号。的'b',从0开始。
  • 输出 'ab' 是有效的,因为单个 'a' 伴随着单个 'b'。
  • 输出 'abb' 是有效的,因为单个 'a' 伴随着 2 个 'b'。
  • 输出 'a' 是有效的,因为单个 'a' 伴随着 0 'b'。
  • 输出 'abbb' 是有效的,因为单个 'a' 伴随着 3 个 'b'。

re.split()

通过字符或模式的出现来拆分字符串,在找到该模式后,字符串中的剩余字符将作为结果列表的一部分返回。

句法 :

re.split(pattern, string, maxsplit=0, flags=0)

第一个参数,pattern 表示正则表达式, 字符串是给定的字符串,将在其中搜索 pattern 并在其中进行拆分,如果不提供 maxsplit 则认为为零'0',如果提供了任何非零值,则最多发生那么多分裂。如果 maxsplit = 1,则字符串将只拆分一次,生成长度为 2 的列表。标志非常有用,可以帮助缩短代码,它们不是必需的参数,例如:flags = re.IGNORECASE,在此拆分中, 大小写,即小写或大写都会被忽略。

示例 1:

Python

from re import split
 
# '\W+' denotes Non-Alphanumeric Characters
# or group of characters Upon finding ','
# or whitespace ' ', the split(), splits the
# string from that point
print(split('\W+', 'Words, words , Words'))
print(split('\W+', "Word's words Words"))
 
# Here ':', ' ' ,',' are not AlphaNumeric thus,
# the point where splitting occurs
print(split('\W+', 'On 12th Jan 2016, at 11:02 AM'))
 
# '\d+' denotes Numeric Characters or group of
# characters Splitting occurs at '12', '2016',
# '11', '02' only
print(split('\d+', 'On 12th Jan 2016, at 11:02 AM'))

输出:

['Words', 'words', 'Words']
['Word', 's', 'words', 'Words']
['On', '12th', 'Jan', '2016', 'at', '11', '02', 'AM']
['On ', 'th Jan ', ', at ', ':', ' AM']

示例 2:

Python

import re
 
# Splitting will occurs only once, at
# '12', returned list will have length 2
print(re.split('\d+', 'On 12th Jan 2016, at 11:02 AM', 1))
 
# 'Boy' and 'boy' will be treated same when
# flags = re.IGNORECASE
print(re.split('[a-f]+', 'Aey, Boy oh boy, come here', flags=re.IGNORECASE))
print(re.split('[a-f]+', 'Aey, Boy oh boy, come here'))

输出:

['On ', 'th Jan 2016, at 11:02 AM']
['', 'y, ', 'oy oh ', 'oy, ', 'om', ' h', 'r', '']
['A', 'y, Boy oh ', 'oy, ', 'om', ' h', 'r', '']

re.sub()

函数中的'sub'代表SubString,在给定的字符串(第三个参数)中搜索某个正则表达式模式,并在找到子字符串模式时替换为repl(第二个参数),count检查并维护次数发生这种情况。

句法:

re.sub(pattern, repl, string, count=0, flags=0)

示例 1:

Python

import re
 
# Regular Expression pattern 'ub' matches the
# string at "Subject" and "Uber". As the CASE
# has been ignored, using Flag, 'ub' should
# match twice with the string Upon matching,
# 'ub' is replaced by '~*' in "Subject", and
# in "Uber", 'Ub' is replaced.
print(re.sub('ub', '~*', 'Subject has Uber booked already',
             flags=re.IGNORECASE))
 
# Consider the Case Sensitivity, 'Ub' in
# "Uber", will not be replaced.
print(re.sub('ub', '~*', 'Subject has Uber booked already'))
 
# As count has been given value 1, the maximum
# times replacement occurs is 1
print(re.sub('ub', '~*', 'Subject has Uber booked already',
             count=1, flags=re.IGNORECASE))
 
# 'r' before the pattern denotes RE, \s is for
# start and end of a String.
print(re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam',
             flags=re.IGNORECASE))

输出

S~*ject has ~*er booked already
S~*ject has Uber booked already
S~*ject has Uber booked already
Baked Beans & Spam

re.subn()

subn() 在所有方面都类似于 sub(),除了它提供输出的方式。它返回一个元组,其中包含替换总数和新字符串,而不仅仅是字符串。

句法:

re.subn(pattern, repl, string, count=0, flags=0)

例子:

Python

import re
 
print(re.subn('ub', '~*', 'Subject has Uber booked already'))
 
t = re.subn('ub', '~*', 'Subject has Uber booked already',
            flags=re.IGNORECASE)
print(t)
print(len(t))
 
# This will give same output as sub() would have
print(t[0])

输出

('S~*ject has Uber booked already', 1)
('S~*ject has ~*er booked already', 2)
Length of Tuple is:  2
S~*ject has ~*er booked already

re.escape()

返回所有非字母数字反斜杠的字符串,如果要匹配其中可能包含正则表达式元字符的任意字面量字符串,这很有用。

句法:

re.escape(string)

例子:

Python

import re
 
# escape() returns a string with BackSlash '\',
# before every Non-Alphanumeric Character
# In 1st case only ' ', is not alphanumeric
# In 2nd case, ' ', caret '^', '-', '[]', '\'
# are not alphanumeric
print(re.escape("This is Awesome even 1 AM"))
print(re.escape("I Asked what is this [a-9], he said \t ^WoW"))
输出
This\ is\ Awesome\ even\ 1\ AM
I\ Asked\ what\ is\ this\ \[a\-9\]\,\ he\ said\ \    \ \^WoW

研究()

此方法要么返回 None (如果模式不匹配),要么 re.MatchObject 包含有关字符串匹配部分的信息。此方法在第一次匹配后停止,因此它最适合测试正则表达式而不是提取数据。

示例:搜索模式的出现

Python3

# A Python program to demonstrate working of re.match().
import re
 
# Lets use a regular expression to match a date string
# in the form of Month name followed by day number
regex = r"([a-zA-Z]+) (\d+)"
 
match = re.search(regex, "I was born on June 24")
 
if match != None:
 
    # We reach here when the expression "([a-zA-Z]+) (\d+)"
    # matches the date string.
 
    # This will print [14, 21), since it matches at index 14
    # and ends at 21.
    print ("Match at index %s, %s" % (match.start(), match.end()))
 
    # We us group() method to get all the matches and
    # captured groups. The groups contain the matched values.
    # In particular:
    # match.group(0) always returns the fully matched string
    # match.group(1) match.group(2), ... return the capture
    # groups in order from left to right in the input string
    # match.group() is equivalent to match.group(0)
 
    # So this will print "June 24"
    print ("Full match: %s" % (match.group(0)))
 
    # So this will print "June"
    print ("Month: %s" % (match.group(1)))
 
    # So this will print "24"
    print ("Day: %s" % (match.group(2)))
 
else:
    print ("The regex pattern does not match.")
输出
Match at index 14, 21
Full match: June 24
Month: June
Day: 24

匹配对象

Match 对象包含有关搜索和结果的所有信息,如果未找到匹配项,则将返回 None。下面我们来看看match对象的一些常用方法和属性。

获取字符串和正则表达式

match.re属性返回传递和匹配的正则表达式。 字符串属性返回传递的字符串。

示例:获取匹配对象的字符串和正则表达式

Python3

import re
 
s = "Welcome to GeeksForGeeks"
 
# here x is the match object
res = re.search(r"\bG", s)
 
print(res.re)
print(res.string)
输出
re.compile('\\bG')
Welcome to GeeksForGeeks

获取匹配对象的索引

  • start() 方法返回匹配子串的起始索引
  • end() 方法返回匹配子串的结束索引
  • span() 方法返回一个元组,其中包含匹配子字符串的开始和结束索引

示例:获取匹配对象的索引

Python3

import re
 
s = "Welcome to GeeksForGeeks"
 
# here x is the match object
res = re.search(r"\bGee", s)
 
print(res.start())
print(res.end())
print(res.span())
输出
11
14
(11, 14)

获取匹配的子字符串

group() 方法返回模式匹配的字符串部分。请参阅以下示例以获得更好的理解。

示例:获取匹配的子字符串

Python3

import re
 
s = "Welcome to GeeksForGeeks"
 
# here x is the match object
res = re.search(r"\D{2} t", s)
 
print(res.group())
输出
me t

在上面的示例中,我们的模式指定包含至少 2 个字符后跟一个空格的字符串,并且该空格后跟一个 t。

相关文章:
https://www.geeksforgeeks.org/regular-expressions-python-set-1-search-match-find/

参考:
https://文档。 Python.org/2/library/re.html