模式搜索是计算机科学中的一个重要问题。当我们在记事本/单词文件,浏览器或数据库中搜索字符串,将使用模式搜索算法来显示搜索结果。典型的问题陈述将是-
给定一个文本txt [0..n-1]和一个模式pat [0..m-1],其中n是文本的长度,m是模式的长度,编写一个函数search(char pat [] ,char txt [])打印出txt []中所有出现的pat []。您可以假设n> m。
例子:
Input: txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10
Input: txt[] = "AABAACAADAABAABA"
pat[] = "AABA"
Output: Pattern found at index 0
Pattern found at index 9
Pattern found at index 12
在本文中,我们将讨论Boyer Moore模式搜索算法。像KMP和有限自动机算法一样,Boyer Moore算法也对模式进行预处理。
博耶·摩尔(Boyer Moore)是以下两种方法的组合。
1)坏字符启发式
2)良好的后缀启发式
以上两种启发式方法也可以单独用于搜索文本中的模式。让我们首先了解两种独立的方法在Boyer Moore算法中如何协同工作。如果我们看一下朴素算法,它会在文本上将模式一个接一个地滑动。 KMP算法会对模式进行预处理,以便可以将模式移动不止一个。出于相同的原因,Boyer Moore算法进行了预处理。它处理模式并为两种启发式方法中的每一种创建不同的数组。在每个步骤中,它都会根据两种启发式方法各自建议的最大滑动量来滑动模式。因此,它在每一步都使用两种启发式方法建议的最大偏移量。
与以前的模式搜索算法不同, Boyer Moore算法从模式的最后一个字符开始匹配。
在本文中,我们将在下一篇文章中讨论不良字符启发式和良好后缀启发式。
错误字符启发式
坏字符启发式的想法很简单。不使用该模式的当前字符匹配的文本的字符被称为不良字符。在不匹配时,我们将模式更改为–
1)不匹配变成匹配
2)模式P移过不匹配的字符。
案例1 –不匹配成为匹配
我们将查找模式中不匹配字符的最后一次出现的位置,如果模式中存在不匹配字符,则我们将移动模式以使其与文本T中的不匹配字符对齐。
说明:在上面的示例中,我们在位置3处出现不匹配。此处,我们的不匹配字符为“ A”。现在,我们将搜索模式中最后出现的“ A”。我们在模式1的位置(以蓝色显示)获得了“ A”,这是它的最后一次出现。现在我们将模式改变2次,以使模式中的“ A”与文本中的“ A”对齐。
情况2 –模式越过不匹配字符
我们将在模式中查找最后一次出现不匹配字符的位置,如果不存在字符,我们将使模式移过不匹配字符。
说明:这里我们在位置7有一个不匹配。位置7之前的模式中不存在不匹配字符“ C”,因此我们将模式移到位置7,最终在上述示例中,我们获得了模式的完美匹配(以绿色显示)。我们这样做是因为模式中不存在“ C”,因此在位置7之前的每个班次,我们都将不匹配,并且搜索将无济于事。
在以下实现中,我们预处理模式并将每个可能出现的字符的最后一次出现存储在大小等于字母大小的数组中。如果根本不存在该字符,则可能导致移动m(图案的长度)。因此,不良字符启发式需要最好的时间。
C++
/* C++ Program for Bad Character Heuristic of Boyer
Moore String Matching Algorithm */
#include
using namespace std;
# define NO_OF_CHARS 256
// The preprocessing function for Boyer Moore's
// bad character heuristic
void badCharHeuristic( string str, int size,
int badchar[NO_OF_CHARS])
{
int i;
// Initialize all occurrences as -1
for (i = 0; i < NO_OF_CHARS; i++)
badchar[i] = -1;
// Fill the actual value of last occurrence
// of a character
for (i = 0; i < size; i++)
badchar[(int) str[i]] = i;
}
/* A pattern searching function that uses Bad
Character Heuristic of Boyer Moore Algorithm */
void search( string txt, string pat)
{
int m = pat.size();
int n = txt.size();
int badchar[NO_OF_CHARS];
/* Fill the bad character array by calling
the preprocessing function badCharHeuristic()
for given pattern */
badCharHeuristic(pat, m, badchar);
int s = 0; // s is shift of the pattern with
// respect to text
while(s <= (n - m))
{
int j = m - 1;
/* Keep reducing index j of pattern while
characters of pattern and text are
matching at this shift s */
while(j >= 0 && pat[j] == txt[s + j])
j--;
/* If the pattern is present at current
shift, then index j will become -1 after
the above loop */
if (j < 0)
{
cout << "pattern occurs at shift = " << s << endl;
/* Shift the pattern so that the next
character in text aligns with the last
occurrence of it in pattern.
The condition s+m < n is necessary for
the case when pattern occurs at the end
of text */
s += (s + m < n)? m-badchar[txt[s + m]] : 1;
}
else
/* Shift the pattern so that the bad character
in text aligns with the last occurrence of
it in pattern. The max function is used to
make sure that we get a positive shift.
We may get a negative shift if the last
occurrence of bad character in pattern
is on the right side of the current
character. */
s += max(1, j - badchar[txt[s + j]]);
}
}
/* Driver code */
int main()
{
string txt= "ABAAABCD";
string pat = "ABC";
search(txt, pat);
return 0;
}
// This code is contributed by rathbhupendra
C
/* C Program for Bad Character Heuristic of Boyer
Moore String Matching Algorithm */
# include
# include
# include
# define NO_OF_CHARS 256
// A utility function to get maximum of two integers
int max (int a, int b) { return (a > b)? a: b; }
// The preprocessing function for Boyer Moore's
// bad character heuristic
void badCharHeuristic( char *str, int size,
int badchar[NO_OF_CHARS])
{
int i;
// Initialize all occurrences as -1
for (i = 0; i < NO_OF_CHARS; i++)
badchar[i] = -1;
// Fill the actual value of last occurrence
// of a character
for (i = 0; i < size; i++)
badchar[(int) str[i]] = i;
}
/* A pattern searching function that uses Bad
Character Heuristic of Boyer Moore Algorithm */
void search( char *txt, char *pat)
{
int m = strlen(pat);
int n = strlen(txt);
int badchar[NO_OF_CHARS];
/* Fill the bad character array by calling
the preprocessing function badCharHeuristic()
for given pattern */
badCharHeuristic(pat, m, badchar);
int s = 0; // s is shift of the pattern with
// respect to text
while(s <= (n - m))
{
int j = m-1;
/* Keep reducing index j of pattern while
characters of pattern and text are
matching at this shift s */
while(j >= 0 && pat[j] == txt[s+j])
j--;
/* If the pattern is present at current
shift, then index j will become -1 after
the above loop */
if (j < 0)
{
printf("\n pattern occurs at shift = %d", s);
/* Shift the pattern so that the next
character in text aligns with the last
occurrence of it in pattern.
The condition s+m < n is necessary for
the case when pattern occurs at the end
of text */
s += (s+m < n)? m-badchar[txt[s+m]] : 1;
}
else
/* Shift the pattern so that the bad character
in text aligns with the last occurrence of
it in pattern. The max function is used to
make sure that we get a positive shift.
We may get a negative shift if the last
occurrence of bad character in pattern
is on the right side of the current
character. */
s += max(1, j - badchar[txt[s+j]]);
}
}
/* Driver program to test above function */
int main()
{
char txt[] = "ABAAABCD";
char pat[] = "ABC";
search(txt, pat);
return 0;
}
Java
/* Java Program for Bad Character Heuristic of Boyer
Moore String Matching Algorithm */
class AWQ{
static int NO_OF_CHARS = 256;
//A utility function to get maximum of two integers
static int max (int a, int b) { return (a > b)? a: b; }
//The preprocessing function for Boyer Moore's
//bad character heuristic
static void badCharHeuristic( char []str, int size,int badchar[])
{
// Initialize all occurrences as -1
for (int i = 0; i < NO_OF_CHARS; i++)
badchar[i] = -1;
// Fill the actual value of last occurrence
// of a character (indeces of table are ascii and values are index of occurence)
for (i = 0; i < size; i++)
badchar[(int) str[i]] = i;
}
/* A pattern searching function that uses Bad
Character Heuristic of Boyer Moore Algorithm */
static void search( char txt[], char pat[])
{
int m = pat.length;
int n = txt.length;
int badchar[] = new int[NO_OF_CHARS];
/* Fill the bad character array by calling
the preprocessing function badCharHeuristic()
for given pattern */
badCharHeuristic(pat, m, badchar);
int s = 0; // s is shift of the pattern with
// respect to text
//there are n-m+1 potential allignments
while(s <= (n - m))
{
int j = m-1;
/* Keep reducing index j of pattern while
characters of pattern and text are
matching at this shift s */
while(j >= 0 && pat[j] == txt[s+j])
j--;
/* If the pattern is present at current
shift, then index j will become -1 after
the above loop */
if (j < 0)
{
System.out.println("Patterns occur at shift = " + s);
/* Shift the pattern so that the next
character in text aligns with the last
occurrence of it in pattern.
The condition s+m < n is necessary for
the case when pattern occurs at the end
of text */
//txt[s+m] is character after the pattern in text
s += (s+m < n)? m-badchar[txt[s+m]] : 1;
}
else
/* Shift the pattern so that the bad character
in text aligns with the last occurrence of
it in pattern. The max function is used to
make sure that we get a positive shift.
We may get a negative shift if the last
occurrence of bad character in pattern
is on the right side of the current
character. */
s += max(1, j - badchar[txt[s+j]]);
}
}
/* Driver program to test above function */
public static void main(String []args) {
char txt[] = "ABAAABCD".toCharArray();
char pat[] = "ABC".toCharArray();
search(txt, pat);
}
}
Python
# Python3 Program for Bad Character Heuristic
# of Boyer Moore String Matching Algorithm
NO_OF_CHARS = 256
def badCharHeuristic(string, size):
'''
The preprocessing function for
Boyer Moore's bad character heuristic
'''
# Initialize all occurrence as -1
badChar = [-1]*NO_OF_CHARS
# Fill the actual value of last occurrence
for i in range(size):
badChar[ord(string[i])] = i;
# retun initialized list
return badChar
def search(txt, pat):
'''
A pattern searching function that uses Bad Character
Heuristic of Boyer Moore Algorithm
'''
m = len(pat)
n = len(txt)
# create the bad character list by calling
# the preprocessing function badCharHeuristic()
# for given pattern
badChar = badCharHeuristic(pat, m)
# s is shift of the pattern with respect to text
s = 0
while(s <= n-m):
j = m-1
# Keep reducing index j of pattern while
# characters of pattern and text are matching
# at this shift s
while j>=0 and pat[j] == txt[s+j]:
j -= 1
# If the pattern is present at current shift,
# then index j will become -1 after the above loop
if j<0:
print("Pattern occur at shift = {}".format(s))
'''
Shift the pattern so that the next character in text
aligns with the last occurrence of it in pattern.
The condition s+m < n is necessary for the case when
pattern occurs at the end of text
'''
s += (m-badChar[ord(txt[s+m])] if s+m
C#
/* C# Program for Bad Character Heuristic of Boyer
Moore String Matching Algorithm */
using System;
public class AWQ{
static int NO_OF_CHARS = 256;
//A utility function to get maximum of two integers
static int max (int a, int b) { return (a > b)? a: b; }
//The preprocessing function for Boyer Moore's
//bad character heuristic
static void badCharHeuristic( char []str, int size,int []badchar)
{
int i;
// Initialize all occurrences as -1
for (i = 0; i < NO_OF_CHARS; i++)
badchar[i] = -1;
// Fill the actual value of last occurrence
// of a character
for (i = 0; i < size; i++)
badchar[(int) str[i]] = i;
}
/* A pattern searching function that uses Bad
Character Heuristic of Boyer Moore Algorithm */
static void search( char []txt, char []pat)
{
int m = pat.Length;
int n = txt.Length;
int []badchar = new int[NO_OF_CHARS];
/* Fill the bad character array by calling
the preprocessing function badCharHeuristic()
for given pattern */
badCharHeuristic(pat, m, badchar);
int s = 0; // s is shift of the pattern with
// respect to text
while(s <= (n - m))
{
int j = m-1;
/* Keep reducing index j of pattern while
characters of pattern and text are
matching at this shift s */
while(j >= 0 && pat[j] == txt[s+j])
j--;
/* If the pattern is present at current
shift, then index j will become -1 after
the above loop */
if (j < 0)
{
Console.WriteLine("Patterns occur at shift = " + s);
/* Shift the pattern so that the next
character in text aligns with the last
occurrence of it in pattern.
The condition s+m < n is necessary for
the case when pattern occurs at the end
of text */
s += (s+m < n)? m-badchar[txt[s+m]] : 1;
}
else
/* Shift the pattern so that the bad character
in text aligns with the last occurrence of
it in pattern. The max function is used to
make sure that we get a positive shift.
We may get a negative shift if the last
occurrence of bad character in pattern
is on the right side of the current
character. */
s += max(1, j - badchar[txt[s+j]]);
}
}
/* Driver program to test above function */
public static void Main() {
char []txt = "ABAAABCD".ToCharArray();
char []pat = "ABC".ToCharArray();
search(txt, pat);
}
}
// This code is contributed by PrinciRaj19992
输出:
pattern occurs at shift = 4
错误字符启发法可能需要最坏情况下的时间。当文本和模式的所有字符都相同时,会发生最坏的情况。例如,txt [] =“ AAAAAAAAAAAAAAAAAA”和pat [] =“ AAAAA”。坏字符启发法在最佳情况下可能需要O(n / m)。最好的情况是文本和模式的所有字符都不同。