实现Zhu-Takaoka字符串匹配算法的Java程序

Zhu-Takaoka 字符串匹配算法是用于字符串中模式匹配的 Boyer Moore 算法的变体。本算法中Bad Maps的概念略有变化。 Good Suffixes 的概念与 Boyer Moore 的概念保持一致，但不再使用单个字符表示 Bad Shifts，现在在此算法中，我们将执行两次移位。

因此，该算法比 Boyer 算法的速度略快。 Good 后缀和两个字符Bad 移位都可以在代码中一起使用，以在算法的性能方面提供额外的优势。我们正在讨论如何更改此算法的 Bad 字符 Shifts 计算的思想，Good suffixes 的思想可以从 Boyer 算法中推导出来。

算法的工作：

起初，这个算法的开始与 Boyer 的算法相同，即将模式与从右到左的字符串进行比较。因此，从右到左将模式的每个字符与字符串的字符进行比较。所以比较的起始索引应该是模式的长度。

String :     ABCDEFGH

Pattern:    BCD

所以比较应该从字符串的 'C' 索引开始，即 2（使用基于 0 的索引）。所以比较从索引 = 模式长度 - 1 开始。如果找到匹配项，则索引递减，直到找到匹配项。一旦找不到匹配项，就该对 Bad 字符进行转换了。

String :     ABCDEFGH

Pattern:    BCC

a) 在索引 2 处，字符串具有字符 'C' 并且由于 Pattern[2]=='C' 所以找到了字符匹配。所以我们现在要检查之前的索引，即 1, 0。所以在字符串[1]（等于 'B”）处，Pattern[1]!='B' 所以没有找到匹配，是时间来移动字符。

坏字符移位的计算表（命名为 ZTBC 表）：这个阶段是一个预处理阶段，即应该在开始比较之前完成。坏字符表是一个哈希映射，它以 Pattern 的所有字母作为键，值表示应该给出模式的移位次数，以便：

不匹配变成了匹配。
模式传递了字符串中不匹配的字符。

因此，在 Zhu-Takaoka 算法中，我们维护了一个二维数组，该数组可以根据比较开始的字符串的前两个字符给出移位次数。因此，增加班次次数和减少比较次数会导致更多的性能提高。

程序：逻辑构建。计算表的思路如下图所示：

该表是使用 2D 数组制作的，其中所有列和行都由模式的字符命名。表被初始化为长度模式，因为如果在模式中找不到这对字符，那么唯一的方法是通过传递不匹配的字符来传递整个模式。

If pattern is  = "ABCD"

The ZTBC =  A  B  C  D  E...

                 A  4  4  4  4  4 

                 B  4  4  4  4  4

                 C  4  4  4  4  4

                 D  4  4  4  4  4

                 E.....

现在，如果在两个字符中，如果第二个字符是模式的起始字符，那么移动整个模式不是正确的想法，我们应该将第二个字符与模式的第一个字符进行匹配。所以我们应该将模式移动 Len-1。

so For all i in size of array 

ZTBC[i][pattern[0]] = len-1.

so ZTBC now looks like :

ZTBC =         A  B  C  D  E....

                A   3  4  4  4  4

                B   3  4  4  4  4

                C   3  4  4  4  4

                D   3  4  4  4  4

                E.....

现在，如果在模式中连续找到两个字符，那么我们应该只移动模式，以便字符串和模式中的字符对匹配。

for all i in array.size

ZTBC[pattern[i-1]][pattern[i]] = len-i-1 ; //This is the amount of shifts if two matching pair is found.

So finally ZTBC looks like

ZTBC =         A  B  C  D  E ......

                A   3  2  4  4  4

                B   3  4  1  4  4

                C   3  4  4  4  4

                D   3  4  4  4  4

                E.......

插图：

因此，假设一个字符串和模式如下：

String S  = "ABCABCDE"
Pattern P = "ABCD"

它在视觉艺术的帮助下描述性地显示如下：

因此，考虑到基于 0 的索引，我们将从索引 3 开始

so s[3]!=P[3]  // p[3]=='D' and S[3]=='A'

因此，发生了不匹配，我们将数组移动

ZTBC[C][A] since last two consecutive char is CA in string.

所以现在我们将模式移动 3

Since ZTBC[C][A] == 3, and now we are at index 6 ( 3+3 )

现在我们应该再次像步骤 1 中那样开始比较字符串和模式，然后我们会在字符串找到模式的匹配项，因此将其打印出来。我们发现了一个现象。现在既然继续下去，我们现在应该转移最后两个字符，即字符串中的 CD，因为它们仅位于前一个索引处。因此，我们应该将我们的模式移动 1 并继续相同的过程。另外，我们可以有好后缀的想法，在这个节目找到偏移s必要的，因此的最大数量使我们的代码的性能更好。 Good Suffixes 的想法与 Boyer 的想法相同。因此，如果在字符串的字符处发生不匹配，则为上述转变思想提供一个通用公式。说

Say S[i+m-k]!=P[m-k] //m is the size of pattern and j is the index of the start of matching .

然后移位的数量应该表示为：

ZTBC[S[i+m-2]][S[i+m-1]] // two consecutive char at the index where comparisons starts.

例子：

Java

// Java Program to Implement Zhu–Takaoka String Matching
// Algorithm
 
// Importing required classes
import java.io.*;
import java.lang.*;
import java.util.*;
 
// Main class
public class GFG {
 
    // Declaring custom strings as inputs and patterns
    public static String string = "ABCABCDEABCDEA";
    public static String pattern = "ABCD";
 
    // And their lengths
    public static int stringlen = 14;
    public static int patternlen = 4;
 
    // Preprocessing and calculating the ZTBC for above
    // pattern by creating an integer array
 
    // As alphabets are 26 so
    // square matrix of 26 * 26
    public static int[][] ZTBC = new int[26][26];
 
    // Method
    // To calculate ZTBC to
    // print the indepattern at which the patternlenatches
    // occurs
    public static void ZTBCCalculation()
    {
 
        // Declaring variables within this scope
        int i, j;
 
        // Iterating over to compute
        // using nested for loops
        for (i = 0; i < 26; ++i)
            for (j = 0; j < 26; ++j)
                ZTBC[i][j] = patternlen;
 
        for (i = 0; i < 26; ++i)
            ZTBC[i][pattern.charAt(0) - 'A']
                = patternlen - 1;
        for (i = 1; i < patternlen - 1; ++i)
            ZTBC[pattern.charAt(i - 1) - 'A']
                [pattern.charAt(i) - 'A']
                = patternlen - 1 - i;
    }
 
    // Main driver method
    public static void main(String args[])
    {
        // Declare variables in main() body
        int i, j;
 
        // Calling the above created Method 1
        ZTBCCalculation();
 
        // Lastly, searching pattern and printing the
        // indepattern
 
        j = 0;
 
        // Till condition holds true
        while (j <= stringlen - patternlen) {
 
            i = patternlen - 1;
            while (i >= 0
                   && pattern.charAt(i)
                          == string.charAt(i + j))
                --i;
            if (i < 0) {
 
                // Pattern detected
                System.out.println("Pattern Found at "
                                   + (j + 1));
                j += patternlen;
            }
 
            // Not detected
            else
                j += ZTBC[string.charAt(j + patternlen - 2)
                          - 'A']
                         [string.charAt(j + patternlen - 1)
                          - 'A'];
        }
    }
}

输出

Pattern Found at 4
Pattern Found at 9

Note:

Runtime complexity is found to be O(stringlen*patternlen) For searching one and O(patterlen + (26*26)).
Space Complexity is found to be O(26×26) which is constant nearly for large test cases.

编程需要懂一点英语