Boyer Moore算法|良好的后缀启发式

📌 相关文章

📜 Boyer Moore算法|良好的后缀启发式

📅 最后修改于: 2021-04-23 07:47:50 🧑 作者: Mango

我们已经讨论了Boyer Moore算法的不良字符启发式变体。在本文中，我们将讨论模式搜索的良好后缀启发式方法。就像不良字符启发式一样，生成了用于后缀良好启发式的预处理表。

良好的后缀启发式

令t为与模式P的子字符串匹配的文本T的子字符串。现在我们将模式转移到：
1)P中t的另一次出现与T中的t匹配。
2)P的前缀，与t的后缀匹配
3)P越过t

情况1：P中t的另一次出现与T中的t匹配
模式P可能包含t的更多出现。在这种情况下，我们将尝试移动模式以使该出现与文本T中的t对齐。例如-

说明：在上面的示例中，我们获得了文本T的子字符串t，该字符串与模式P匹配(绿色)，然后在索引2处不匹配。现在，我们将在P中搜索t的出现(“ AB”)。发生在位置1(在黄色背景中)开始，因此我们将模式右移2次以使t中的t与T中的t对齐。这是原始的Boyer Moore的弱规则，效果不大，我们将讨论Strong Good Suffix规则不久。

情况2：P的前缀，与T中的t的后缀匹配
并非总是可能在P中找到t的出现。有时根本没有出现，在这种情况下，有时我们可以搜索与P的某些前缀匹配的t的后缀，并尝试通过移动P来对齐它们。例如 –

说明：在上面的示例中，在不匹配之前，我们在索引2-4处将t(“ BAB”)与P(绿色)匹配。但是，因为在P中不存在t，所以我们将搜索与t的后缀匹配的P前缀。我们发现前缀“ AB”(在黄色背景中)从索引0开始，它与整数t不匹配，但与后缀t从索引3开始的“ AB”匹配。因此，我们现在将模式转换3次以使前缀与后缀对齐。

情况3：P越过t
如果以上两种情况都不满足，我们将把模式移到t之后。例如 –

说明：如果在上面的示例中，则P中不存在t(“ AB”)，并且P中也没有与t后缀相匹配的前缀。因此，在那种情况下，我们永远无法在索引4之前找到任何完美匹配，因此我们会将P移过t ie。索引5。

强后缀启发式

假设子串q = P [i至n]与T中的t匹配，而c = P [i-1]是不匹配字符。现在，与情况1不同，我们将在P中搜索t，该t后面不包含字符c 。然后，通过移动模式P将最接近的这种情况与T中的t对齐。例如–

说明：在上面的示例中， q = P [7至8]在T中与t匹配。不匹配的字符c在位置P [6]处为“ C”。现在，如果我们开始在P中搜索t，我们将从位置4开始获得t的第一个匹配项。但是，该匹配项的前面是等于c的“ C”，因此我们将跳过它并继续搜索。在位置1，我们再次出现了t(在黄色背景中)。此事件之前是“ A”(蓝色)，不等同于c。因此，我们将模式P移位6次，以使这种情况与T中的t对齐。我们这样做是因为我们已经知道字符c =“ C”会导致不匹配。因此，如果出现任何以c开头的t，当与t对齐时，都会再次导致不匹配，因此，最好跳过此步骤。

良好后缀启发式的预处理

作为预处理的一部分，将创建一个数组移位。如果在位置i-1发生不匹配，则每个包含shift [i]的条目都将偏移距离模式。即，从位置i开始的模式的后缀匹配，并且在位置i-1发生不匹配。对于强后缀和上面讨论的情况2，分别进行预处理。

1)预处理以获得良好的后缀
在讨论预处理之前，让我们首先讨论边界的概念。边框是一个既是后缀又是前缀的子字符串。例如，在字符串“ ccacc”中， “ c”是边界， “ cc”是边界，因为它出现在字符串的两端，但“ cca”不是边界。

作为预处理的一部分，将计算数组bpos (边界位置)。每个条目bpos [i]都包含给定模式P中从索引i开始的后缀的边界起始索引。
从位置m开始的后缀φ没有边界，因此bpos [m]设置为m + 1 ，其中m是模式的长度。
移位位置由无法向左扩展的边界获得。以下是预处理代码–

void preprocess_strong_suffix(int *shift, int *bpos,
                  char *pat, int m)
{
    int i = m, j = m+1;
    bpos[i] = j;
    while(i > 0)
    {
        while(j <= m && pat[i-1] != pat[j-1])
        {
            if (shift[j] == 0)
                shift[j] = j-i;
            j = bpos[j];
        }
        i--; j--;
        bpos[i] = j; 
    }
}

说明：考虑模式P =“ ABBABAB”，m = 7 。

从位置i = 5开始的后缀“ AB”的最宽边界是从位置7开始的φ(无)，因此bpos [5] = 7。

在位置i = 2时，后缀为“ BABAB”。此后缀的最大边框是从位置4开始的“ BAB”，因此j = bpos [2] = 4。

我们可以使用以下示例了解bpos [i] = j –

如果字符＃哪个位置i-1是等于字符?在位置j-1处，我们知道边界将是? +从位置j开始的位置i处的后缀边界，这等效于说在i-1处的后缀边界从j-1或bpos [i-1] = j-1或代码中开始–

i--; 
j--; 
bpos[ i ] = j

但是，如果位置i-1处的字符＃与字符不匹配?在位置j-1处，然后我们继续向右搜索。现在我们知道–

边框宽度将小于从位置j开始的边框。小于x…φ
边框必须以＃开头并以φ结尾，或者可以为空(不存在边框)。

基于以上两个事实，我们将继续在子字符串x…φ中从位置j到m进行搜索。下一个边界应该在j = bpos [j]处。更新j之后，我们再次将位置j-1(?)的字符与＃进行比较，如果它们相等，则得到边界，否则继续向右搜索直到j> m 。此过程由代码显示–

while(j <= m && pat[i-1] != pat[j-1])
{
    j = bpos[j];
}
i--; j--;
bpos[i]=j;

在上面的代码中查看这些条件–

pat[i-1] != pat[j-1]

这是我们在壳体2所讨论的。当T的图案P的发生之前的字符比P中不匹配的字符不同，我们停止跳过发生和转移模式的条件。所以这里P [i] == P [j]但P [i-1]！= p [j-1]因此我们将模式从i转移到j 。因此， shift [j] = ji是j的记录器。因此，无论何时在位置j发生任何不匹配，我们都将模式shift [j + 1]位置向右移动。
在上面的代码中，以下条件非常重要–

if (shift[j] == 0 )

该条件防止了具有相同边界的后缀对shift [j]值的修改。例如，考虑模式P =“ addbddcdd” ，在这种情况下，当我们为i = 4计算bpos [i-1]时，则j = 7。我们最终将设置shift [7] = 3的值。现在，如果我们为i = 1计算bpos [i-1]，则j = 7，如果没有测试，我们将再次设置shift [7] = 6的值shift [j] ==0。这意味着如果我们在位置6不匹配，我们将模式P 3的位置移到右侧而不是6位置。

2)案例2的预处理
在情况2的预处理中，对于每个后缀，确定该后缀中包含的整个模式的最宽边界。
模式最宽边框的起始位置完全存储在bpos [0]中
在下面的预处理算法中，该值bpos [0]最初存储在数组移位的所有空闲条目中。但是，当模式的后缀比bpos [0]短时，算法将继续使用模式的下一个较宽边界，即bpos [j]。

以下是搜索算法的实现–

C++

/* C program for Boyer Moore Algorithm with 
   Good Suffix heuristic to find pattern in
   given text string */
  
#include 
#include 
  
// preprocessing for strong good suffix rule
void preprocess_strong_suffix(int *shift, int *bpos,
                                char *pat, int m)
{
    // m is the length of pattern 
    int i=m, j=m+1;
    bpos[i]=j;
  
    while(i>0)
    {
        /*if character at position i-1 is not equivalent to
          character at j-1, then continue searching to right
          of the pattern for border */
        while(j<=m && pat[i-1] != pat[j-1])
        {
            /* the character preceding the occurrence of t in 
               pattern P is different than the mismatching character in P, 
               we stop skipping the occurrences and shift the pattern
               from i to j */
            if (shift[j]==0)
                shift[j] = j-i;
  
            //Update the position of next border 
            j = bpos[j];
        }
        /* p[i-1] matched with p[j-1], border is found.
           store the  beginning position of border */
        i--;j--;
        bpos[i] = j; 
    }
}
  
//Preprocessing for case 2
void preprocess_case2(int *shift, int *bpos,
                      char *pat, int m)
{
    int i, j;
    j = bpos[0];
    for(i=0; i<=m; i++)
    {
        /* set the border position of the first character of the pattern
           to all indices in array shift having shift[i] = 0 */ 
        if(shift[i]==0)
            shift[i] = j;
  
        /* suffix becomes shorter than bpos[0], use the position of 
           next widest border as value of j */
        if (i==j)
            j = bpos[j];
    }
}
  
/*Search for a pattern in given text using
  Boyer Moore algorithm with Good suffix rule */
void search(char *text, char *pat)
{
    // s is shift of the pattern with respect to text
    int s=0, j;
    int m = strlen(pat);
    int n = strlen(text);
  
    int bpos[m+1], shift[m+1];
  
    //initialize all occurrence of shift to 0
    for(int i=0;i= 0 && pat[j] == text[s+j])
            j--;
  
        /* If the pattern is present at the current shift, then index j
             will become -1 after the above loop */
        if (j<0)
        {
            printf("pattern occurs at shift = %d\n", s);
            s += shift[0];
        }
        else
            /*pat[i] != pat[s+j] so shift the pattern
              shift[j+1] times  */
            s += shift[j+1];
    }
  
}
  
//Driver 
int main()
{
    char text[] = "ABAAAABAACD";
    char pat[] = "ABA";
    search(text, pat);
    return 0;
}

Java

/* Java program for Boyer Moore Algorithm with 
Good Suffix heuristic to find pattern in
given text string */
class GFG 
{
  
// preprocessing for strong good suffix rule
static void preprocess_strong_suffix(int []shift, int []bpos,
                                      char []pat, int m)
{
    // m is the length of pattern 
    int i = m, j = m + 1;
    bpos[i] = j;
  
    while(i > 0)
    {
        /*if character at position i-1 is not 
        equivalent to character at j-1, then 
        continue searching to right of the
        pattern for border */
        while(j <= m && pat[i - 1] != pat[j - 1])
        {
            /* the character preceding the occurrence of t 
            in pattern P is different than the mismatching 
            character in P, we stop skipping the occurrences 
            and shift the pattern from i to j */
            if (shift[j] == 0)
                shift[j] = j - i;
  
            //Update the position of next border 
            j = bpos[j];
        }
        /* p[i-1] matched with p[j-1], border is found.
        store the beginning position of border */
        i--; j--;
        bpos[i] = j; 
    }
}
  
//Preprocessing for case 2
static void preprocess_case2(int []shift, int []bpos,
                              char []pat, int m)
{
    int i, j;
    j = bpos[0];
    for(i = 0; i <= m; i++)
    {
        /* set the border position of the first character 
        of the pattern to all indices in array shift
        having shift[i] = 0 */
        if(shift[i] == 0)
            shift[i] = j;
  
        /* suffix becomes shorter than bpos[0], 
        use the position of next widest border
        as value of j */
        if (i == j)
            j = bpos[j];
    }
}
  
/*Search for a pattern in given text using
Boyer Moore algorithm with Good suffix rule */
static void search(char []text, char []pat)
{
    // s is shift of the pattern 
    // with respect to text
    int s = 0, j;
    int m = pat.length;
    int n = text.length;
  
    int []bpos = new int[m + 1];
    int []shift = new int[m + 1];
  
    //initialize all occurrence of shift to 0
    for(int i = 0; i < m + 1; i++) 
        shift[i] = 0;
  
    //do preprocessing
    preprocess_strong_suffix(shift, bpos, pat, m);
    preprocess_case2(shift, bpos, pat, m);
  
    while(s <= n - m)
    {
        j = m - 1;
  
        /* Keep reducing index j of pattern while 
        characters of pattern and text are matching 
        at this shift s*/
        while(j >= 0 && pat[j] == text[s+j])
            j--;
  
        /* If the pattern is present at the current shift, 
        then index j will become -1 after the above loop */
        if (j < 0)
        {
            System.out.printf("pattern occurs at shift = %d\n", s);
            s += shift[0];
        }
        else
          
            /*pat[i] != pat[s+j] so shift the pattern
            shift[j+1] times */
            s += shift[j + 1];
    }
  
}
  
// Driver Code
public static void main(String[] args) 
{
    char []text = "ABAAAABAACD".toCharArray();
    char []pat = "ABA".toCharArray();
    search(text, pat);
}
} 
  
// This code is contributed by 29AjayKumar

Python3

# Python3 program for Boyer Moore Algorithm with 
# Good Suffix heuristic to find pattern in 
# given text string
  
# preprocessing for strong good suffix rule
def preprocess_strong_suffix(shift, bpos, pat, m):
  
    # m is the length of pattern
    i = m
    j = m + 1
    bpos[i] = j
  
    while i > 0:
          
        '''if character at position i-1 is 
        not equivalent to character at j-1, 
        then continue searching to right 
        of the pattern for border '''
        while j <= m and pat[i - 1] != pat[j - 1]:
              
            ''' the character preceding the occurrence 
            of t in pattern P is different than the 
            mismatching character in P, we stop skipping
            the occurrences and shift the pattern 
            from i to j '''
            if shift[j] == 0:
                shift[j] = j - i
  
            # Update the position of next border
            j = bpos[j]
              
        ''' p[i-1] matched with p[j-1], border is found. 
        store the beginning position of border '''
        i -= 1
        j -= 1
        bpos[i] = j
  
# Preprocessing for case 2
def preprocess_case2(shift, bpos, pat, m):
    j = bpos[0]
    for i in range(m + 1):
          
        ''' set the border position of the first character 
        of the pattern to all indices in array shift
        having shift[i] = 0 '''
        if shift[i] == 0:
            shift[i] = j
              
        ''' suffix becomes shorter than bpos[0], 
        use the position of next widest border
        as value of j '''
        if i == j:
            j = bpos[j]
  
'''Search for a pattern in given text using 
Boyer Moore algorithm with Good suffix rule '''
def search(text, pat):
  
    # s is shift of the pattern with respect to text
    s = 0
    m = len(pat)
    n = len(text)
  
    bpos = [0] * (m + 1)
  
    # initialize all occurrence of shift to 0
    shift = [0] * (m + 1)
  
    # do preprocessing
    preprocess_strong_suffix(shift, bpos, pat, m)
    preprocess_case2(shift, bpos, pat, m)
  
    while s <= n - m:
        j = m - 1
          
        ''' Keep reducing index j of pattern while characters of 
            pattern and text are matching at this shift s'''
        while j >= 0 and pat[j] == text[s + j]:
            j -= 1
              
        ''' If the pattern is present at the current shift, 
            then index j will become -1 after the above loop '''
        if j < 0:
            print("pattern occurs at shift = %d" % s)
            s += shift[0]
        else:
              
            '''pat[i] != pat[s+j] so shift the pattern 
            shift[j+1] times '''
            s += shift[j + 1]
  
# Driver Code
if __name__ == "__main__":
    text = "ABAAAABAACD"
    pat = "ABA"
    search(text, pat)
  
# This code is contributed by
# sanjeev2552

C#

/* C# program for Boyer Moore Algorithm with 
Good Suffix heuristic to find pattern in
given text string */
using System;
  
class GFG 
{
  
// preprocessing for strong good suffix rule
static void preprocess_strong_suffix(int []shift, 
                                     int []bpos,
                                     char []pat, int m)
{
    // m is the length of pattern 
    int i = m, j = m + 1;
    bpos[i] = j;
  
    while(i > 0)
    {
        /*if character at position i-1 is not 
        equivalent to character at j-1, then 
        continue searching to right of the
        pattern for border */
        while(j <= m && pat[i - 1] != pat[j - 1])
        {
            /* the character preceding the occurrence of t 
            in pattern P is different than the mismatching 
            character in P, we stop skipping the occurrences 
            and shift the pattern from i to j */
            if (shift[j] == 0)
                shift[j] = j - i;
  
            //Update the position of next border 
            j = bpos[j];
        }
        /* p[i-1] matched with p[j-1], border is found.
        store the beginning position of border */
        i--; j--;
        bpos[i] = j; 
    }
}
  
//Preprocessing for case 2
static void preprocess_case2(int []shift, int []bpos,
                             char []pat, int m)
{
    int i, j;
    j = bpos[0];
    for(i = 0; i <= m; i++)
    {
        /* set the border position of the first character 
        of the pattern to all indices in array shift
        having shift[i] = 0 */
        if(shift[i] == 0)
            shift[i] = j;
  
        /* suffix becomes shorter than bpos[0], 
        use the position of next widest border
        as value of j */
        if (i == j)
            j = bpos[j];
    }
}
  
/*Search for a pattern in given text using
Boyer Moore algorithm with Good suffix rule */
static void search(char []text, char []pat)
{
    // s is shift of the pattern 
    // with respect to text
    int s = 0, j;
    int m = pat.Length;
    int n = text.Length;
  
    int []bpos = new int[m + 1];
    int []shift = new int[m + 1];
  
    // initialize all occurrence of shift to 0
    for(int i = 0; i < m + 1; i++) 
        shift[i] = 0;
  
    // do preprocessing
    preprocess_strong_suffix(shift, bpos, pat, m);
    preprocess_case2(shift, bpos, pat, m);
  
    while(s <= n - m)
    {
        j = m - 1;
  
        /* Keep reducing index j of pattern while 
        characters of pattern and text are matching 
        at this shift s*/
        while(j >= 0 && pat[j] == text[s + j])
            j--;
  
        /* If the pattern is present at the current shift, 
        then index j will become -1 after the above loop */
        if (j < 0)
        {
            Console.Write("pattern occurs at shift = {0}\n", s);
            s += shift[0];
        }
        else
          
            /*pat[i] != pat[s+j] so shift the pattern
            shift[j+1] times */
            s += shift[j + 1];
    }
}
  
// Driver Code
public static void Main(String[] args) 
{
    char []text = "ABAAAABAACD".ToCharArray();
    char []pat = "ABA".ToCharArray();
    search(text, pat);
}
} 
  
// This code is contributed by PrinciRaj1992

输出：

pattern occurs at shift = 0
pattern occurs at shift = 5

参考

http://www.iti.fh-flensburg.de/lang/algorithmen/pattern/bmen.htm