📜  Perl-正则表达式

📅  最后修改于: 2020-10-16 05:36:37             🧑  作者: Mango


正则表达式是字符的字符串,它定义了正在查看的图案或图案。 Perl中正则表达式的语法与其他正则表达式支持程序(例如sedgrepawk )非常相似。

应用正则表达式的基本方法是使用模式绑定运算符=〜和 〜。第一个运算符是测试和赋值运算符。

Perl中有三个正则表达式运算符。

  • 匹配正则表达式-m //
  • 替换正则表达式-s ///
  • 音译正则表达式-tr ///

在每种情况下,正斜杠都是您指定的正则表达式(regex)的分隔符。如果您对其他定界符感到满意,则可以使用正斜杠代替。

匹配运算符

匹配运算符m //用于将字符串或语句与正则表达式匹配。例如,要将字符序列“ foo”与标量$ bar匹配,可以使用如下语句:

#!/usr/bin/perl

$bar = "This is foo and again foo";
if ($bar =~ /foo/) {
   print "First time is matching\n";
} else {
   print "First time is not matching\n";
}

$bar = "foo";
if ($bar =~ /foo/) {
   print "Second time is matching\n";
} else {
   print "Second time is not matching\n";
}

当执行上述程序时,将产生以下结果-

First time is matching
Second time is matching

m //实际上与q //运算符系列的工作方式相同。您可以使用自然匹配字符的任意组合作为表达式的定界符。例如,m {},m()和m> <均有效。所以上面的例子可以重写如下:

#!/usr/bin/perl

$bar = "This is foo and again foo";
if ($bar =~ m[foo]) {
   print "First time is matching\n";
} else {
   print "First time is not matching\n";
}

$bar = "foo";
if ($bar =~ m{foo}) {
   print "Second time is matching\n";
} else {
   print "Second time is not matching\n";
}

如果定界符为正斜杠,则可以从m //中省略m,但是对于所有其他定界符,必须使用m前缀。

请注意,如果整个表达式匹配,则整个match表达式(即=〜或!〜左侧的表达式以及match运算符)将返回true(在标量上下文中)。因此,声明-

$true = ($foo =~ m/foo/);

如果$ foo匹配正则表达式,则将$ true设置为1;如果匹配失败,则将$ true设置为0。在列表上下文中,匹配项返回所有分组表达式的内容。例如,从时间字符串提取小时,分钟和秒时,我们可以使用-

my ($hours, $minutes, $seconds) = ($time =~ m/(\d+):(\d+):(\d+)/);

匹配运算符修饰符

匹配运算符支持自己的一组修饰符。 / g修饰符允许全局匹配。 / i修饰符将使区分大小写不区分大小写。这是修饰符的完整列表

Sr.No. Modifier & Description
1

i

Makes the match case insensitive.

2

m

Specifies that if the string has newline or carriage return characters, the ^ and $ operators will now match against a newline boundary, instead of a string boundary.

3

o

Evaluates the expression only once.

4

s

Allows use of . to match a newline character.

5

x

Allows you to use white space in the expression for clarity.

6

g

Globally finds all matches.

7

cg

Allows the search to continue even after a global match fails.

只匹配一次

匹配运算符还有一个更简单的版本-?PATTERN?运算符。这与m //运算符基本相同,除了它在每个重置调用之间搜索的字符串内仅匹配一次。

例如,您可以使用它来获取列表中的第一个和最后一个元素-

#!/usr/bin/perl

@list = qw/food foosball subeo footnote terfoot canic footbrdige/;

foreach (@list) {
   $first = $1 if /(foo.*?)/;
   $last = $1 if /(foo.*)/;
}
print "First: $first, Last: $last\n";

当执行上述程序时,将产生以下结果-

First: foo, Last: footbrdige

正则表达式变量

正则表达式变量包括$ ,它包含匹配的最后一个分组匹配的内容; $& ,包含整个匹配的字符串; $` ,包含匹配字符串之前的所有内容;和$’ ,其中包含匹配字符串之后的所有内容。以下代码演示了结果-

#!/usr/bin/perl

$string = "The food is in the salad bar";
$string =~ m/foo/;
print "Before: $`\n";
print "Matched: $&\n";
print "After: $'\n";

当执行上述程序时,将产生以下结果-

Before: The
Matched: foo
After: d is in the salad bar

替代运营商

替换运算符s ///实际上只是match运算符的扩展,它使您可以用某些新文本替换匹配的文本。运算符的基本形式是-

s/PATTERN/REPLACEMENT/;

PATTERN是我们要查找的文本的正则表达式。 REPLACEMENT是我们要用来替换找到的文本的文本或正则表达式的规范。例如,我们可以使用下面的正则表达式替换的所有出现-

#/user/bin/perl

$string = "The cat sat on the mat";
$string =~ s/cat/dog/;

print "$string\n";

当执行上述程序时,将产生以下结果-

The dog sat on the mat

替代运算符修饰符

这是与替代运算符使用的所有修饰符的列表。

Sr.No. Modifier & Description
1

i

Makes the match case insensitive.

2

m

Specifies that if the string has newline or carriage return characters, the ^ and $ operators will now match against a newline boundary, instead of a string boundary.

3

o

Evaluates the expression only once.

4

s

Allows use of . to match a newline character.

5

x

Allows you to use white space in the expression for clarity.

6

g

Replaces all occurrences of the found expression with the replacement text.

7

e

Evaluates the replacement as if it were a Perl statement, and uses its return value as the replacement text.

翻译运算符

翻译与替换原理相似但不相同,但是与替换不同,翻译(或音译)不使用正则表达式搜索替换值。翻译运算符是-

tr/SEARCHLIST/REPLACEMENTLIST/cds
y/SEARCHLIST/REPLACEMENTLIST/cds

翻译替换字符出现的所有SEARCHLIST与REPLACEMENTLIST相应的字符。例如,使用“猫坐在垫子上”。我们在本章中一直使用的字符串-

#/user/bin/perl

$string = 'The cat sat on the mat';
$string =~ tr/a/o/;

print "$string\n";

当执行上述程序时,将产生以下结果-

The cot sot on the mot.

也可以使用标准Perl范围,从而允许您通过字母或数字值指定字符范围。要更改字符串,可以使用以下语法代替uc函数。

$string =~ tr/a-z/A-Z/;

翻译运算符修饰符

以下是与翻译相关的运算符的列表。

Sr.No. Modifier & Description
1

c

Complements SEARCHLIST.

2

d

Deletes found but unreplaced characters.

3

s

Squashes duplicate replaced characters.

/ d修饰符删除与SEARCHLIST匹配的,在REPLACEMENTLIST中没有相应条目的字符。例如-

#!/usr/bin/perl 

$string = 'the cat sat on the mat.';
$string =~ tr/a-z/b/d;

print "$string\n";

当执行上述程序时,将产生以下结果-

b b   b.

最后一个修饰符/ s删除被替换的重复字符序列,因此-

#!/usr/bin/perl

$string = 'food';
$string = 'food';
$string =~ tr/a-z/a-z/s;

print "$string\n";

当执行上述程序时,将产生以下结果-

fod

更复杂的正则表达式

您不仅需要匹配固定的字符串。实际上,通过使用更复杂的正则表达式,您几乎可以匹配任何您梦dream以求的东西。这是一个快速备忘单-

下表列出了Python可用的正则表达式语法。

Sr.No. Pattern & Description
1

^

Matches beginning of line.

2

$

Matches end of line.

3

.

Matches any single character except newline. Using m option allows it to match newline as well.

4

[…]

Matches any single character in brackets.

5

[^…]

Matches any single character not in brackets.

6

*

Matches 0 or more occurrences of preceding expression.

7

+

Matches 1 or more occurrence of preceding expression.

8

?

Matches 0 or 1 occurrence of preceding expression.

9

{ n}

Matches exactly n number of occurrences of preceding expression.

10

{ n,}

Matches n or more occurrences of preceding expression.

11

{ n, m}

Matches at least n and at most m occurrences of preceding expression.

12

a| b

Matches either a or b.

13

\w

Matches word characters.

14

\W

Matches nonword characters.

15

\s

Matches whitespace. Equivalent to [\t\n\r\f].

16

\S

Matches nonwhitespace.

17

\d

Matches digits. Equivalent to [0-9].

18

\D

Matches nondigits.

19

\A

Matches beginning of string.

20

\Z

Matches end of string. If a newline exists, it matches just before newline.

21

\z

Matches end of string.

22

\G

Matches point where last match finished.

23

\b

Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.

24

\B

Matches nonword boundaries.

25

\n, \t, etc.

Matches newlines, carriage returns, tabs, etc.

26

\1…\9

Matches nth grouped subexpression.

27

\10

Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code.

28

[aeiou]

Matches a single character in the given set

29

[^aeiou]

Matches a single character outside the given set

^元字符匹配字符串的开头,$元符号匹配字符串。这里有一些简短的例子。

# nothing in the string (start and end are adjacent)
/^$/   

# a three digits, each followed by a whitespace
# character (eg "3 4 5 ")
/(\d\s) {3}/  

# matches a string in which every
# odd-numbered letter is a (eg "abacadaf")
/(a.)+/  

# string starts with one or more digits
/^\d+/

# string that ends with one or more digits
/\d+$/

让我们看另一个例子。

#!/usr/bin/perl

$string = "Cats go Catatonic\nWhen given Catnip";
($start) = ($string =~ /\A(.*?) /);
@lines = $string =~ /^(.*?) /gm;
print "First word: $start\n","Line starts: @lines\n";

当执行上述程序时,将产生以下结果-

First word: Cats
Line starts: Cats When

匹配边界

\ b在任何单词边界都匹配,这由\ w类和\ W类之间的差异定义。因为\ w包含单词的字符,而\ W包含相反的字符,这通常意味着单词的终止。 \ B断言匹配不是单词边界的任何位置。例如-

/\bcat\b/ # Matches 'the cat sat' but not 'cat on the mat'
/\Bcat\B/ # Matches 'verification' but not 'the cat on the mat'
/\bcat\B/ # Matches 'catatonic' but not 'polecat'
/\Bcat\b/ # Matches 'polecat' but not 'catatonic'

选择替代品

|字符就像Perl中的标准或按位或。它在正则表达式或组中指定备用匹配项。例如,要在表达式中匹配“ cat”或“ dog”,您可以使用以下代码-

if ($string =~ /cat|dog/)

您可以将表达式的各个元素组合在一起,以支持复杂的匹配。搜索两个人的名字可以通过两个单独的测试来完成,如下所示:

if (($string =~ /Martin Brown/) ||  ($string =~ /Sharon Brown/))

This could be written as follows

if ($string =~ /(Martin|Sharon) Brown/)

分组匹配

从正则表达式的角度来看,两者之间没有区别,只是前者稍微清晰一点。

$string =~ /(\S+)\s+(\S+)/;

and 

$string =~ /\S+\s+\S+/;

但是,分组的好处是它允许我们从正则表达式中提取序列。分组以列表在原始组中出现的顺序作为列表返回。例如,在以下片段中,我们从字符串拉出了小时,分钟和秒。

my ($hours, $minutes, $seconds) = ($time =~ m/(\d+):(\d+):(\d+)/);

除此直接方法外,还可以在特殊的$ x变量中使用匹配的组,其中x是正则表达式中组的编号。因此,我们可以将前面的示例重写如下:

#!/usr/bin/perl

$time = "12:05:30";

$time =~ m/(\d+):(\d+):(\d+)/;
my ($hours, $minutes, $seconds) = ($1, $2, $3);

print "Hours : $hours, Minutes: $minutes, Second: $seconds\n";

当执行上述程序时,将产生以下结果-

Hours : 12, Minutes: 05, Second: 30

在替换表达式中使用组时,可以在替换文本中使用$ x语法。因此,我们可以使用以下命令重新格式化日期字符串-

#!/usr/bin/perl

$date = '03/26/1999';
$date =~ s#(\d+)/(\d+)/(\d+)#$3/$1/$2#;

print "$date\n";

当执行上述程序时,将产生以下结果-

1999/03/26

\ G断言

\ G断言允许您从最后一次匹配的位置继续搜索。例如,在下面的代码中,我们使用\ G,以便我们可以搜索到正确的位置然后提取一些信息,而无需创建更复杂的单个正则表达式-

#!/usr/bin/perl

$string = "The time is: 12:31:02 on 4/12/00";

$string =~ /:\s+/g;
($time) = ($string =~ /\G(\d+:\d+:\d+)/);
$string =~ /.+\s+/g;
($date) = ($string =~ m{\G(\d+/\d+/\d+)});

print "Time: $time, Date: $date\n";

当执行上述程序时,将产生以下结果-

Time: 12:31:02, Date: 4/12/00

\ G断言实际上只是pos函数的元符号等效项,因此在正则表达式调用之间,您可以继续使用pos,甚至可以通过将pos用作左值子例程来修改pos的值(因此也可以修改\ G的值)。

正则表达式示例

字面量字符

Sr.No. Example & Description
1

Perl

Match “Perl”.

字符类

Sr.No. Example & Description
1

[Pp]ython

Matches “Python” or “python”

2

rub[ye]

Matches “ruby” or “rube”

3

[aeiou]

Matches any one lowercase vowel

4

[0-9]

Matches any digit; same as [0123456789]

5

[a-z]

Matches any lowercase ASCII letter

6

[A-Z]

Matches any uppercase ASCII letter

7

[a-zA-Z0-9]

Matches any of the above

8

[^aeiou]

Matches anything other than a lowercase vowel

9

[^0-9]

Matches anything other than a digit

特殊字符类

Sr.No. Example & Description
1

.

Matches any character except newline

2

\d

Matches a digit: [0-9]

3

\D

Matches a nondigit: [^0-9]

4

\s

Matches a whitespace character: [ \t\r\n\f]

5

\S

Matches nonwhitespace: [^ \t\r\n\f]

6

\w

Matches a single word character: [A-Za-z0-9_]

7

\W

Matches a nonword character: [^A-Za-z0-9_]

重复案例

Sr.No. Example & Description
1

ruby?

Matches “rub” or “ruby”: the y is optional

2

ruby*

Matches “rub” plus 0 or more ys

3

ruby+

Matches “rub” plus 1 or more ys

4

\d{3}

Matches exactly 3 digits

5

\d{3,}

Matches 3 or more digits

6.

\d{3,5}

Matches 3, 4, or 5 digits

非贪婪重复

这匹配最小的重复次数-

Sr.No. Example & Description
1

<.*>

Greedy repetition: matches “perl>”

2

<.*?>

Nongreedy: matches “” in “perl>”

用括号分组

Sr.No. Example & Description
1

\D\d+

No group: + repeats \d

2

(\D\d)+

Grouped: + repeats \D\d pair

3

([Pp]ython(, )?)+

Match “Python”, “Python, python, python”, etc.

反向引用

这再次匹配先前匹配的组-

Sr.No. Example & Description
1

([Pp])ython&\1ails

Matches python&pails or Python&Pails

2

([‘”])[^\1]*\1

Single or double-quoted string. \1 matches whatever the 1st group matched. \2 matches whatever the 2nd group matched, etc.

备择方案

Sr.No. Example & Description
1

python|perl

Matches “python” or “perl”

2

rub(y|le))

Matches “ruby” or “ruble”

3

Python(!+|\?)

“Python” followed by one or more ! or one ?

锚点

这需要指定匹配位置。

Sr.No. Example & Description
1

^Python

Matches “Python” at the start of a string or internal line

2

Python$

Matches “Python” at the end of a string or line

3

\APython

Matches “Python” at the start of a string

4

Python\Z

Matches “Python” at the end of a string

5

\bPython\b

Matches “Python” at a word boundary

6

\brub\B

\B is nonword boundary: match “rub” in “rube” and “ruby” but not alone

7

Python(?=!)

Matches “Python”, if followed by an exclamation point

8

Python(?!!)

Matches “Python”, if not followed by an exclamation point

带括号的特殊语法

Sr.No. Example & Description
1

R(?#comment)

Matches “R”. All the rest is a comment

2

R(?i)uby

Case-insensitive while matching “uby”

3

R(?i:uby)

Same as above

4

rub(?:y|le))

Group only without creating \1 backreference