后缀数组是给定字符串的所有后缀的排序数组。定义类似于后缀树,后缀树是给定文本的所有后缀的压缩特里。
Let the given string be "banana".
0 banana 5 a
1 anana Sort the Suffixes 3 ana
2 nana ----------------> 1 anana
3 ana alphabetically 0 banana
4 na 4 na
5 a 2 nana
The suffix array for "banana" is {5, 3, 1, 0, 4, 2}
我们讨论了构造后缀数组的朴素算法。天真的算法要考虑所有后缀,使用O(nLogn)排序算法对它们进行排序,并在排序时保持原始索引。时间朴素算法的复杂度是O(n 2 logN)的,其中n是输入字符串的字符数。
在这篇文章中,讨论了用于后缀数组构造的O(nLogn)算法。为了简单起见,让我们首先讨论O(n * Logn * Logn)算法。我们的想法是使用的事实,字符串将被分类为一个字符串的后缀。
我们首先根据第一个字符对所有后缀进行排序,然后根据前2个字符,然后按前4个字符排序,依此类推,而要考虑的字符数小于2n。重要的一点是,如果我们已根据前2个i字符对后缀进行排序,则可以使用Merge Sort等nLogn排序算法根据O(nLogn)时间中的前2个i + 1字符对后缀进行排序。这是可能的,因为可以在O(1)时间内比较两个后缀(我们只需要比较两个值,请参见下面的示例和代码)。
排序函数称为O(Logn)倍(请注意,我们以2的幂为单位增加要考虑的字符数)。因此,总体时间复杂度变为O(nLognLogn)。有关更多详细信息,请参见http://www.stanford.edu/class/cs97si/suffix-array.pdf。
让我们使用以上算法构建后缀数组示例字符串“ banana”。
根据前两个字符排序使用第一个字符的ASCII值为所有后缀分配一个等级。一种简单的分配等级的方法是对strp []的第i个后缀执行“ str [i] –’a’”
Index Suffix Rank
0 banana 1
1 anana 0
2 nana 13
3 ana 0
4 na 13
5 a 0
对于每个字符,我们还存储下一个相邻字符的等级,即str [i + 1]处的字符等级(需要根据前2个字符对后缀进行排序)。如果一个字符是最后一个字符,则将下一个等级存储为-1
Index Suffix Rank Next Rank
0 banana 1 0
1 anana 0 13
2 nana 13 0
3 ana 0 13
4 na 13 0
5 a 0 -1
根据等级和相邻等级对所有后缀进行排序。等级被视为第一位数字或MSD,相邻等级被视为第二位数字。
Index Suffix Rank Next Rank
5 a 0 -1
1 anana 0 13
3 ana 0 13
0 banana 1 0
2 nana 13 0
4 na 13 0
根据前四个字符排序
为所有后缀分配新的等级。为了分配新的等级,我们将排序后缀一一考虑。将0作为新等级分配给第一个后缀。为了给剩余的后缀分配等级,我们考虑在当前后缀之前的后缀的等级对。如果后缀的前一个等级对与后缀的前一个等级相同,则为其分配相同的等级。否则,分配前一个后缀的等级加1。
Index Suffix Rank
5 a 0 [Assign 0 to first]
1 anana 1 (0, 13) is different from previous
3 ana 1 (0, 13) is same as previous
0 banana 2 (1, 0) is different from previous
2 nana 3 (13, 0) is different from previous
4 na 3 (13, 0) is same as previous
对于每个后缀str [i],还将下一个后缀的等级存储在str [i + 2]。如果i + 2处没有下一个后缀,则将下一个等级存储为-1
Index Suffix Rank Next Rank
5 a 0 -1
1 anana 1 1
3 ana 1 0
0 banana 2 3
2 nana 3 3
4 na 3 -1
根据排名和下一个排名对所有后缀进行排序。
Index Suffix Rank Next Rank
5 a 0 -1
3 ana 1 0
1 anana 1 1
0 banana 2 3
4 na 3 -1
2 nana 3 3
C++
// C++ program for building suffix array of a given text
#include
#include
#include
using namespace std;
// Structure to store information of a suffix
struct suffix
{
int index; // To store original index
int rank[2]; // To store ranks and next rank pair
};
// A comparison function used by sort() to compare two suffixes
// Compares two pairs, returns 1 if first pair is smaller
int cmp(struct suffix a, struct suffix b)
{
return (a.rank[0] == b.rank[0])? (a.rank[1] < b.rank[1] ?1: 0):
(a.rank[0] < b.rank[0] ?1: 0);
}
// This is the main function that takes a string 'txt' of size n as an
// argument, builds and return the suffix array for the given string
int *buildSuffixArray(char *txt, int n)
{
// A structure to store suffixes and their indexes
struct suffix suffixes[n];
// Store suffixes and their indexes in an array of structures.
// The structure is needed to sort the suffixes alphabatically
// and maintain their old indexes while sorting
for (int i = 0; i < n; i++)
{
suffixes[i].index = i;
suffixes[i].rank[0] = txt[i] - 'a';
suffixes[i].rank[1] = ((i+1) < n)? (txt[i + 1] - 'a'): -1;
}
// Sort the suffixes using the comparison function
// defined above.
sort(suffixes, suffixes+n, cmp);
// At this point, all suffixes are sorted according to first
// 2 characters. Let us sort suffixes according to first 4
// characters, then first 8 and so on
int ind[n]; // This array is needed to get the index in suffixes[]
// from original index. This mapping is needed to get
// next suffix.
for (int k = 4; k < 2*n; k = k*2)
{
// Assigning rank and index values to first suffix
int rank = 0;
int prev_rank = suffixes[0].rank[0];
suffixes[0].rank[0] = rank;
ind[suffixes[0].index] = 0;
// Assigning rank to suffixes
for (int i = 1; i < n; i++)
{
// If first rank and next ranks are same as that of previous
// suffix in array, assign the same new rank to this suffix
if (suffixes[i].rank[0] == prev_rank &&
suffixes[i].rank[1] == suffixes[i-1].rank[1])
{
prev_rank = suffixes[i].rank[0];
suffixes[i].rank[0] = rank;
}
else // Otherwise increment rank and assign
{
prev_rank = suffixes[i].rank[0];
suffixes[i].rank[0] = ++rank;
}
ind[suffixes[i].index] = i;
}
// Assign next rank to every suffix
for (int i = 0; i < n; i++)
{
int nextindex = suffixes[i].index + k/2;
suffixes[i].rank[1] = (nextindex < n)?
suffixes[ind[nextindex]].rank[0]: -1;
}
// Sort the suffixes according to first k characters
sort(suffixes, suffixes+n, cmp);
}
// Store indexes of all sorted suffixes in the suffix array
int *suffixArr = new int[n];
for (int i = 0; i < n; i++)
suffixArr[i] = suffixes[i].index;
// Return the suffix array
return suffixArr;
}
// A utility function to print an array of given size
void printArr(int arr[], int n)
{
for (int i = 0; i < n; i++)
cout << arr[i] << " ";
cout << endl;
}
// Driver program to test above functions
int main()
{
char txt[] = "banana";
int n = strlen(txt);
int *suffixArr = buildSuffixArray(txt, n);
cout << "Following is suffix array for " << txt << endl;
printArr(suffixArr, n);
return 0;
}
Java
// Java program for building suffix array of a given text
import java.util.*;
class GFG
{
// Class to store information of a suffix
public static class Suffix implements Comparable
{
int index;
int rank;
int next;
public Suffix(int ind, int r, int nr)
{
index = ind;
rank = r;
next = nr;
}
// A comparison function used by sort()
// to compare two suffixes.
// Compares two pairs, returns 1
// if first pair is smaller
public int compareTo(Suffix s)
{
if (rank != s.rank) return Integer.compare(rank, s.rank);
return Integer.compare(next, s.next);
}
}
// This is the main function that takes a string 'txt'
// of size n as an argument, builds and return the
// suffix array for the given string
public static int[] suffixArray(String s)
{
int n = s.length();
Suffix[] su = new Suffix[n];
// Store suffixes and their indexes in
// an array of classes. The class is needed
// to sort the suffixes alphabatically and
// maintain their old indexes while sorting
for (int i = 0; i < n; i++)
{
su[i] = new Suffix(i, s.charAt(i) - '$', 0);
}
for (int i = 0; i < n; i++)
su[i].next = (i + 1 < n ? su[i + 1].rank : -1);
// Sort the suffixes using the comparison function
// defined above.
Arrays.sort(su);
// At this point, all suffixes are sorted
// according to first 2 characters.
// Let us sort suffixes according to first 4
// characters, then first 8 and so on
int[] ind = new int[n];
// This array is needed to get the index in suffixes[]
// from original index. This mapping is needed to get
// next suffix.
for (int length = 4; length < 2 * n; length <<= 1)
{
// Assigning rank and index values to first suffix
int rank = 0, prev = su[0].rank;
su[0].rank = rank;
ind[su[0].index] = 0;
for (int i = 1; i < n; i++)
{
// If first rank and next ranks are same as
// that of previous suffix in array,
// assign the same new rank to this suffix
if (su[i].rank == prev &&
su[i].next == su[i - 1].next)
{
prev = su[i].rank;
su[i].rank = rank;
}
else
{
// Otherwise increment rank and assign
prev = su[i].rank;
su[i].rank = ++rank;
}
ind[su[i].index] = i;
}
// Assign next rank to every suffix
for (int i = 0; i < n; i++)
{
int nextP = su[i].index + length / 2;
su[i].next = nextP < n ?
su[ind[nextP]].rank : -1;
}
// Sort the suffixes according
// to first k characters
Arrays.sort(su);
}
// Store indexes of all sorted
// suffixes in the suffix array
int[] suf = new int[n];
for (int i = 0; i < n; i++)
suf[i] = su[i].index;
// Return the suffix array
return suf;
}
static void printArr(int arr[], int n)
{
for (int i = 0; i < n; i++)
System.out.print(arr[i] + " ");
System.out.println();
}
// Driver Code
public static void main(String[] args)
{
String txt = "banana";
int n = txt.length();
int[] suff_arr = suffixArray(txt);
System.out.println("Following is suffix array for banana:");
printArr(suff_arr, n);
}
}
// This code is contributed by AmanKumarSingh
Python3
# Python3 program for building suffix
# array of a given text
# Class to store information of a suffix
class suffix:
def __init__(self):
self.index = 0
self.rank = [0, 0]
# This is the main function that takes a
# string 'txt' of size n as an argument,
# builds and return the suffix array for
# the given string
def buildSuffixArray(txt, n):
# A structure to store suffixes
# and their indexes
suffixes = [suffix() for _ in range(n)]
# Store suffixes and their indexes in
# an array of structures. The structure
# is needed to sort the suffixes alphabatically
# and maintain their old indexes while sorting
for i in range(n):
suffixes[i].index = i
suffixes[i].rank[0] = (ord(txt[i]) -
ord("a"))
suffixes[i].rank[1] = (ord(txt[i + 1]) -
ord("a")) if ((i + 1) < n) else -1
# Sort the suffixes according to the rank
# and next rank
suffixes = sorted(
suffixes, key = lambda x: (
x.rank[0], x.rank[1]))
# At this point, all suffixes are sorted
# according to first 2 characters. Let
# us sort suffixes according to first 4
# characters, then first 8 and so on
ind = [0] * n # This array is needed to get the
# index in suffixes[] from original
# index.This mapping is needed to get
# next suffix.
k = 4
while (k < 2 * n):
# Assigning rank and index
# values to first suffix
rank = 0
prev_rank = suffixes[0].rank[0]
suffixes[0].rank[0] = rank
ind[suffixes[0].index] = 0
# Assigning rank to suffixes
for i in range(1, n):
# If first rank and next ranks are
# same as that of previous suffix in
# array, assign the same new rank to
# this suffix
if (suffixes[i].rank[0] == prev_rank and
suffixes[i].rank[1] == suffixes[i - 1].rank[1]):
prev_rank = suffixes[i].rank[0]
suffixes[i].rank[0] = rank
# Otherwise increment rank and assign
else:
prev_rank = suffixes[i].rank[0]
rank += 1
suffixes[i].rank[0] = rank
ind[suffixes[i].index] = i
# Assign next rank to every suffix
for i in range(n):
nextindex = suffixes[i].index + k // 2
suffixes[i].rank[1] = suffixes[ind[nextindex]].rank[0] \
if (nextindex < n) else -1
# Sort the suffixes according to
# first k characters
suffixes = sorted(
suffixes, key = lambda x: (
x.rank[0], x.rank[1]))
k *= 2
# Store indexes of all sorted
# suffixes in the suffix array
suffixArr = [0] * n
for i in range(n):
suffixArr[i] = suffixes[i].index
# Return the suffix array
return suffixArr
# A utility function to print an array
# of given size
def printArr(arr, n):
for i in range(n):
print(arr[i], end = " ")
print()
# Driver code
if __name__ == "__main__":
txt = "banana"
n = len(txt)
suffixArr = buildSuffixArray(txt, n)
print("Following is suffix array for", txt)
printArr(suffixArr, n)
# This code is contributed by debrc
输出:
Following is suffix array for banana
5 3 1 0 4 2
注意,上述算法使用标准排序函数,因此时间复杂度为O(nLognLogn)。我们可以在此处使用“基数排序”将时间复杂度降低为O(nLogn)。
请注意,也可以在O(n)时间内构造后缀数组。我们将很快讨论O(n)算法。
参考:
http://www.stanford.edu/class/cs97si/suffix-array.pdf
http://www.cbcb.umd.edu/confcour/Fall2012/lec14b.pdf