📜  用于模式搜索的Boyer Moore算法

📅  最后修改于: 2021-04-24 04:31:34             🧑  作者: Mango

模式搜索是计算机科学中的一个重要问题。当我们在记事本/单词文件,浏览器或数据库中搜索字符串,将使用模式搜索算法来显示搜索结果。典型的问题陈述将是-
给定一个文本txt [0..n-1]和一个模式pat [0..m-1],其中n是文本的长度,m是模式的长度,编写一个函数search(char pat [] ,char txt [])打印出txt []中所有出现的pat []。您可以假设n> m。
例子:

Input:  txt[] = "THIS IS A TEST TEXT"
        pat[] = "TEST"
Output: Pattern found at index 10

Input:  txt[] =  "AABAACAADAABAABA"
        pat[] =  "AABA"
Output: Pattern found at index 0
        Pattern found at index 9
        Pattern found at index 12

在本文中,我们将讨论Boyer Moore模式搜索算法。像KMP和有限自动机算法一样,Boyer Moore算法也对模式进行预处理。
博耶·摩尔(Boyer Moore)是以下两种方法的组合。
1)坏字符启发式
2)良好的后缀启发式
以上两种启发式方法也可以单独用于搜索文本中的模式。让我们首先了解两种独立的方法在Boyer Moore算法中如何协同工作。如果我们看一下朴素算法,它会在文本上将模式一个接一个地滑动。 KMP算法会对模式进行预处理,以便可以将模式移动不止一个。出于相同的原因,Boyer Moore算法进行了预处理。它处理模式并为两种启发式方法中的每一种创建不同的数组。在每个步骤中,它都会根据两种启发式方法各自建议的最大滑动量来滑动模式。因此,它在每一步都使用两种启发式方法建议的最大偏移量
与以前的模式搜索算法不同, Boyer Moore算法从模式的最后一个字符开始匹配。
在本文中,我们将在下一篇文章中讨论不良字符启发式和良好后缀启发式。

错误字符启发式

坏字符启发式的想法很简单。不使用该模式的当前字符匹配的文本的字符被称为不良字符。在不匹配时,我们将模式更改为–
1)不匹配变成匹配
2)模式P移过不匹配的字符。
案例1 –不匹配成为匹配
我们将查找模式中不匹配字符的最后一次出现的位置,如果模式中存在不匹配字符,则我们将移动模式以使其与文本T中的不匹配字符对齐。

情况1

情况1

说明:在上面的示例中,我们在位置3处出现不匹配。此处,我们的不匹配字符为“ A”。现在,我们将搜索模式中最后出现的“ A”。我们在模式1的位置(以蓝色显示)获得了“ A”,这是它的最后一次出现。现在我们将模式改变2次,以使模式中的“ A”与文本中的“ A”对齐。
情况2 –模式越过不匹配字符
我们将在模式中查找最后一次出现不匹配字符的位置,如果不存在字符,我们将使模式移过不匹配字符。

案例2

案例2

说明:这里我们在位置7有一个不匹配。位置7之前的模式中不存在不匹配字符“ C”,因此我们将模式移到位置7,最终在上述示例中,我们获得了模式的完美匹配(以绿色显示)。我们这样做是因为模式中不存在“ C”,因此在位置7之前的每个班次,我们都将不匹配,并且搜索将无济于事。
在以下实现中,我们预处理模式并将每个可能出现的字符的最后一次出现存储在大小等于字母大小的数组中。如果根本不存在该字符,则可能导致移动m(图案的长度)。因此,不良字符启发式需要O(n/m)  最好的时间。

C++
/* C++ Program for Bad Character Heuristic of Boyer
Moore String Matching Algorithm */
#include 
using namespace std;
# define NO_OF_CHARS 256
 
// The preprocessing function for Boyer Moore's
// bad character heuristic
void badCharHeuristic( string str, int size,
                        int badchar[NO_OF_CHARS])
{
    int i;
 
    // Initialize all occurrences as -1
    for (i = 0; i < NO_OF_CHARS; i++)
        badchar[i] = -1;
 
    // Fill the actual value of last occurrence
    // of a character
    for (i = 0; i < size; i++)
        badchar[(int) str[i]] = i;
}
 
/* A pattern searching function that uses Bad
Character Heuristic of Boyer Moore Algorithm */
void search( string txt, string pat)
{
    int m = pat.size();
    int n = txt.size();
 
    int badchar[NO_OF_CHARS];
 
    /* Fill the bad character array by calling
    the preprocessing function badCharHeuristic()
    for given pattern */
    badCharHeuristic(pat, m, badchar);
 
    int s = 0; // s is shift of the pattern with
                // respect to text
    while(s <= (n - m))
    {
        int j = m - 1;
 
        /* Keep reducing index j of pattern while
        characters of pattern and text are
        matching at this shift s */
        while(j >= 0 && pat[j] == txt[s + j])
            j--;
 
        /* If the pattern is present at current
        shift, then index j will become -1 after
        the above loop */
        if (j < 0)
        {
            cout << "pattern occurs at shift = " <<  s << endl;
 
            /* Shift the pattern so that the next
            character in text aligns with the last
            occurrence of it in pattern.
            The condition s+m < n is necessary for
            the case when pattern occurs at the end
            of text */
            s += (s + m < n)? m-badchar[txt[s + m]] : 1;
 
        }
 
        else
            /* Shift the pattern so that the bad character
            in text aligns with the last occurrence of
            it in pattern. The max function is used to
            make sure that we get a positive shift.
            We may get a negative shift if the last
            occurrence of bad character in pattern
            is on the right side of the current
            character. */
            s += max(1, j - badchar[txt[s + j]]);
    }
}
 
/* Driver code */
int main()
{
    string txt= "ABAAABCD";
    string pat = "ABC";
    search(txt, pat);
    return 0;
}
  
 // This code is contributed by rathbhupendra


C
/* C Program for Bad Character Heuristic of Boyer
   Moore String Matching Algorithm */
# include 
# include 
# include 
 
# define NO_OF_CHARS 256
 
// A utility function to get maximum of two integers
int max (int a, int b) { return (a > b)? a: b; }
 
// The preprocessing function for Boyer Moore's
// bad character heuristic
void badCharHeuristic( char *str, int size,
                        int badchar[NO_OF_CHARS])
{
    int i;
 
    // Initialize all occurrences as -1
    for (i = 0; i < NO_OF_CHARS; i++)
         badchar[i] = -1;
 
    // Fill the actual value of last occurrence
    // of a character
    for (i = 0; i < size; i++)
         badchar[(int) str[i]] = i;
}
 
/* A pattern searching function that uses Bad
   Character Heuristic of Boyer Moore Algorithm */
void search( char *txt,  char *pat)
{
    int m = strlen(pat);
    int n = strlen(txt);
 
    int badchar[NO_OF_CHARS];
 
    /* Fill the bad character array by calling
       the preprocessing function badCharHeuristic()
       for given pattern */
    badCharHeuristic(pat, m, badchar);
 
    int s = 0;  // s is shift of the pattern with
                // respect to text
    while(s <= (n - m))
    {
        int j = m-1;
 
        /* Keep reducing index j of pattern while
           characters of pattern and text are
           matching at this shift s */
        while(j >= 0 && pat[j] == txt[s+j])
            j--;
 
        /* If the pattern is present at current
           shift, then index j will become -1 after
           the above loop */
        if (j < 0)
        {
            printf("\n pattern occurs at shift = %d", s);
 
            /* Shift the pattern so that the next
               character in text aligns with the last
               occurrence of it in pattern.
               The condition s+m < n is necessary for
               the case when pattern occurs at the end
               of text */
            s += (s+m < n)? m-badchar[txt[s+m]] : 1;
 
        }
 
        else
            /* Shift the pattern so that the bad character
               in text aligns with the last occurrence of
               it in pattern. The max function is used to
               make sure that we get a positive shift.
               We may get a negative shift if the last
               occurrence  of bad character in pattern
               is on the right side of the current
               character. */
            s += max(1, j - badchar[txt[s+j]]);
    }
}
 
/* Driver program to test above function */
int main()
{
    char txt[] = "ABAAABCD";
    char pat[] = "ABC";
    search(txt, pat);
    return 0;
}


Java
/* Java Program for Bad Character Heuristic of Boyer
Moore String Matching Algorithm */
 
 
class AWQ{
     
     static int NO_OF_CHARS = 256;
      
    //A utility function to get maximum of two integers
     static int max (int a, int b) { return (a > b)? a: b; }
 
     //The preprocessing function for Boyer Moore's
     //bad character heuristic
     static void badCharHeuristic( char []str, int size,int badchar[])
     {
 
      // Initialize all occurrences as -1
      for (int i = 0; i < NO_OF_CHARS; i++)
           badchar[i] = -1;
 
      // Fill the actual value of last occurrence
      // of a character (indeces of table are ascii and values are index of occurence)
      for (i = 0; i < size; i++)
           badchar[(int) str[i]] = i;
     }
 
     /* A pattern searching function that uses Bad
     Character Heuristic of Boyer Moore Algorithm */
     static void search( char txt[],  char pat[])
     {
      int m = pat.length;
      int n = txt.length;
 
      int badchar[] = new int[NO_OF_CHARS];
 
      /* Fill the bad character array by calling
         the preprocessing function badCharHeuristic()
         for given pattern */
      badCharHeuristic(pat, m, badchar);
 
      int s = 0;  // s is shift of the pattern with
                  // respect to text
       //there are n-m+1 potential allignments
      while(s <= (n - m))
      {
          int j = m-1;
 
          /* Keep reducing index j of pattern while
             characters of pattern and text are
             matching at this shift s */
          while(j >= 0 && pat[j] == txt[s+j])
              j--;
 
          /* If the pattern is present at current
             shift, then index j will become -1 after
             the above loop */
          if (j < 0)
          {
              System.out.println("Patterns occur at shift = " + s);
 
              /* Shift the pattern so that the next
                 character in text aligns with the last
                 occurrence of it in pattern.
                 The condition s+m < n is necessary for
                 the case when pattern occurs at the end
                 of text */
              //txt[s+m] is character after the pattern in text
              s += (s+m < n)? m-badchar[txt[s+m]] : 1;
 
          }
 
          else
              /* Shift the pattern so that the bad character
                 in text aligns with the last occurrence of
                 it in pattern. The max function is used to
                 make sure that we get a positive shift.
                 We may get a negative shift if the last
                 occurrence  of bad character in pattern
                 is on the right side of the current
                 character. */
              s += max(1, j - badchar[txt[s+j]]);
      }
     }
 
     /* Driver program to test above function */
    public static void main(String []args) {
         
         char txt[] = "ABAAABCD".toCharArray();
         char pat[] = "ABC".toCharArray();
         search(txt, pat);
    }
}


Python
# Python3 Program for Bad Character Heuristic
# of Boyer Moore String Matching Algorithm
 
NO_OF_CHARS = 256
 
def badCharHeuristic(string, size):
    '''
    The preprocessing function for
    Boyer Moore's bad character heuristic
    '''
 
    # Initialize all occurrence as -1
    badChar = [-1]*NO_OF_CHARS
 
    # Fill the actual value of last occurrence
    for i in range(size):
        badChar[ord(string[i])] = i;
 
    # retun initialized list
    return badChar
 
def search(txt, pat):
    '''
    A pattern searching function that uses Bad Character
    Heuristic of Boyer Moore Algorithm
    '''
    m = len(pat)
    n = len(txt)
 
    # create the bad character list by calling
    # the preprocessing function badCharHeuristic()
    # for given pattern
    badChar = badCharHeuristic(pat, m)
 
    # s is shift of the pattern with respect to text
    s = 0
    while(s <= n-m):
        j = m-1
 
        # Keep reducing index j of pattern while
        # characters of pattern and text are matching
        # at this shift s
        while j>=0 and pat[j] == txt[s+j]:
            j -= 1
 
        # If the pattern is present at current shift,
        # then index j will become -1 after the above loop
        if j<0:
            print("Pattern occur at shift = {}".format(s))
 
            '''   
                Shift the pattern so that the next character in text
                      aligns with the last occurrence of it in pattern.
                The condition s+m < n is necessary for the case when
                   pattern occurs at the end of text
               '''
            s += (m-badChar[ord(txt[s+m])] if s+m


C#
/* C# Program for Bad Character Heuristic of Boyer
Moore String Matching Algorithm */
 
using System;
public class AWQ{
     
    static int NO_OF_CHARS = 256;
     
    //A utility function to get maximum of two integers
    static int max (int a, int b) { return (a > b)? a: b; }
 
    //The preprocessing function for Boyer Moore's
    //bad character heuristic
    static void badCharHeuristic( char []str, int size,int []badchar)
    {
    int i;
 
    // Initialize all occurrences as -1
    for (i = 0; i < NO_OF_CHARS; i++)
        badchar[i] = -1;
 
    // Fill the actual value of last occurrence
    // of a character
    for (i = 0; i < size; i++)
        badchar[(int) str[i]] = i;
    }
 
    /* A pattern searching function that uses Bad
    Character Heuristic of Boyer Moore Algorithm */
    static void search( char []txt, char []pat)
    {
    int m = pat.Length;
    int n = txt.Length;
 
    int []badchar = new int[NO_OF_CHARS];
 
    /* Fill the bad character array by calling
        the preprocessing function badCharHeuristic()
        for given pattern */
    badCharHeuristic(pat, m, badchar);
 
    int s = 0; // s is shift of the pattern with
                // respect to text
    while(s <= (n - m))
    {
        int j = m-1;
 
        /* Keep reducing index j of pattern while
            characters of pattern and text are
            matching at this shift s */
        while(j >= 0 && pat[j] == txt[s+j])
            j--;
 
        /* If the pattern is present at current
            shift, then index j will become -1 after
            the above loop */
        if (j < 0)
        {
            Console.WriteLine("Patterns occur at shift = " + s);
 
            /* Shift the pattern so that the next
                character in text aligns with the last
                occurrence of it in pattern.
                The condition s+m < n is necessary for
                the case when pattern occurs at the end
                of text */
            s += (s+m < n)? m-badchar[txt[s+m]] : 1;
 
        }
 
        else
            /* Shift the pattern so that the bad character
                in text aligns with the last occurrence of
                it in pattern. The max function is used to
                make sure that we get a positive shift.
                We may get a negative shift if the last
                occurrence of bad character in pattern
                is on the right side of the current
                character. */
            s += max(1, j - badchar[txt[s+j]]);
    }
    }
 
    /* Driver program to test above function */
    public static void Main() {
         
        char []txt = "ABAAABCD".ToCharArray();
        char []pat = "ABC".ToCharArray();
        search(txt, pat);
    }
}
 
// This code is contributed by PrinciRaj19992


输出:

pattern occurs at shift = 4

错误字符启发法可能需要O(mn)  最坏情况下的时间。当文本和模式的所有字符都相同时,会发生最坏的情况。例如,txt [] =“ AAAAAAAAAAAAAAAAAA”和pat [] =“ AAAAA”。坏字符启发法在最佳情况下可能需要O(n / m)。最好的情况是文本和模式的所有字符都不同。