最长公共扩展/ LCE |第 3 组(分段树法)
先决条件:LCE(Set 1), LCE(Set 2), Suffix Array (n Log Log n), Kasai的算法和Segment Tree
最长公共扩展 (LCE) 问题考虑字符串s ,并为每一对 (L , R) 计算s的最长子字符串,该子串从 L 和 R 开始。在 LCE 中,在每个查询中我们必须回答从索引 L 和 R 开始的最长公共前缀的长度。
例子:
字符串:“阿巴巴巴”
查询: LCE(1, 2)、LCE(1, 6) 和 LCE(0, 5)
从(1, 2), (1, 6) 和 (0, 5)给出的索引开始查找最长公共前缀的长度。
突出显示“绿色”的字符串是从相应查询的索引 L 和 R 开始的最长公共前缀。我们必须找到从索引- (1, 2), (1, 6) 和 (0, 5)开始的最长公共前缀的长度。
在本集中,我们将讨论解决 LCE 问题的分段树方法。
在 Set 2 中,我们看到 LCE 问题可以转化为 RMQ 问题。
为了有效地处理 RMQ,我们在 lcp 数组上构建了一个段树,然后有效地回答 LCE 查询。
要找到低位和高位,我们必须先计算后缀数组,然后从后缀数组计算逆后缀数组。
我们还需要 lcp 数组,因此我们使用 Kasai 算法从后缀数组中找到 lcp 数组。
完成上述操作后,我们只需在 lcp 数组中为每个查询从索引低到高找到最小值(如上所示)。
在没有证明的情况下,我们将使用直接结果(在数学证明之后推导出来)-
LCE (L, R) = RMQ lcp (invSuff[R], invSuff[L]-1)
下标 lcp 意味着我们必须在 lcp 数组上执行 RMQ,因此我们将在 lcp 数组上构建一个段树。
// A C++ Program to find the length of longest common
// extension using Segment Tree
#include
using namespace std;
// Structure to represent a query of form (L,R)
struct Query
{
int L, R;
};
// Structure to store information of a suffix
struct suffix
{
int index; // To store original index
int rank[2]; // To store ranks and next rank pair
};
// A utility function to get minimum of two numbers
int minVal(int x, int y)
{
return (x < y)? x: y;
}
// A utility function to get the middle index from
// corner indexes.
int getMid(int s, int e)
{
return s + (e -s)/2;
}
/* A recursive function to get the minimum value
in a given range of array indexes. The following
are parameters for this function.
st --> Pointer to segment tree
index --> Index of current node in the segment
tree. Initially 0 is passed as root
is always at index 0
ss & se --> Starting and ending indexes of the
segment represented by current
node, i.e., st[index]
qs & qe --> Starting and ending indexes of query
range */
int RMQUtil(int *st, int ss, int se, int qs, int qe,
int index)
{
// If segment of this node is a part of given range,
// then return the min of the segment
if (qs <= ss && qe >= se)
return st[index];
// If segment of this node is outside the given range
if (se < qs || ss > qe)
return INT_MAX;
// If a part of this segment overlaps with the given
// range
int mid = getMid(ss, se);
return minVal(RMQUtil(st, ss, mid, qs, qe, 2*index+1),
RMQUtil(st, mid+1, se, qs, qe, 2*index+2));
}
// Return minimum of elements in range from index qs
// (query start) to qe (query end). It mainly uses RMQUtil()
int RMQ(int *st, int n, int qs, int qe)
{
// Check for erroneous input values
if (qs < 0 || qe > n-1 || qs > qe)
{
printf("Invalid Input");
return -1;
}
return RMQUtil(st, 0, n-1, qs, qe, 0);
}
// A recursive function that constructs Segment Tree
// for array[ss..se]. si is index of current node in
// segment tree st
int constructSTUtil(int arr[], int ss, int se, int *st,
int si)
{
// If there is one element in array, store it in
// current node of segment tree and return
if (ss == se)
{
st[si] = arr[ss];
return arr[ss];
}
// If there are more than one elements, then recur
// for left and right subtrees and store the minimum
// of two values in this node
int mid = getMid(ss, se);
st[si] = minVal(constructSTUtil(arr, ss, mid, st, si*2+1),
constructSTUtil(arr, mid+1, se, st, si*2+2));
return st[si];
}
/* Function to construct segment tree from given array.
This function allocates memory for segment tree and
calls constructSTUtil() to fill the allocated memory */
int *constructST(int arr[], int n)
{
// Allocate memory for segment tree
//Height of segment tree
int x = (int)(ceil(log2(n)));
// Maximum size of segment tree
int max_size = 2*(int)pow(2, x) - 1;
int *st = new int[max_size];
// Fill the allocated memory st
constructSTUtil(arr, 0, n-1, st, 0);
// Return the constructed segment tree
return st;
}
// A comparison function used by sort() to compare
// two suffixes Compares two pairs, returns 1 if
// first pair is smaller
int cmp(struct suffix a, struct suffix b)
{
return (a.rank[0] == b.rank[0])?
(a.rank[1] < b.rank[1] ?1: 0):
(a.rank[0] < b.rank[0] ?1: 0);
}
// This is the main function that takes a string
// 'txt' of size n as an argument, builds and return
// the suffix array for the given string
vector buildSuffixArray(string txt, int n)
{
// A structure to store suffixes and their indexes
struct suffix suffixes[n];
// Store suffixes and their indexes in an array
// of structures. The structure is needed to sort
// the suffixes alphabetically and maintain their
// old indexes while sorting
for (int i = 0; i < n; i++)
{
suffixes[i].index = i;
suffixes[i].rank[0] = txt[i] - 'a';
suffixes[i].rank[1] = ((i+1) < n)?
(txt[i + 1] - 'a'): -1;
}
// Sort the suffixes using the comparison function
// defined above.
sort(suffixes, suffixes+n, cmp);
// At his point, all suffixes are sorted according to first
// 2 characters. Let us sort suffixes according to first 4
// characters, then first 8 and so on
int ind[n]; // This array is needed to get the index
// in suffixes[]
// from original index. This mapping is needed to get
// next suffix.
for (int k = 4; k < 2*n; k = k*2)
{
// Assigning rank and index values to first suffix
int rank = 0;
int prev_rank = suffixes[0].rank[0];
suffixes[0].rank[0] = rank;
ind[suffixes[0].index] = 0;
// Assigning rank to suffixes
for (int i = 1; i < n; i++)
{
// If first rank and next ranks are same as
// that of previous suffix in array, assign
// the same new rank to this suffix
if (suffixes[i].rank[0] == prev_rank &&
suffixes[i].rank[1] == suffixes[i-1].rank[1])
{
prev_rank = suffixes[i].rank[0];
suffixes[i].rank[0] = rank;
}
else // Otherwise increment rank and assign
{
prev_rank = suffixes[i].rank[0];
suffixes[i].rank[0] = ++rank;
}
ind[suffixes[i].index] = i;
}
// Assign next rank to every suffix
for (int i = 0; i < n; i++)
{
int nextindex = suffixes[i].index + k/2;
suffixes[i].rank[1] = (nextindex < n)?
suffixes[ind[nextindex]].rank[0]: -1;
}
// Sort the suffixes according to first k characters
sort(suffixes, suffixes+n, cmp);
}
// Store indexes of all sorted suffixes in the suffix array
vectorsuffixArr;
for (int i = 0; i < n; i++)
suffixArr.push_back(suffixes[i].index);
// Return the suffix array
return suffixArr;
}
/* To construct and return LCP */
vector kasai(string txt, vector suffixArr,
vector &invSuff)
{
int n = suffixArr.size();
// To store LCP array
vector lcp(n, 0);
// Fill values in invSuff[]
for (int i=0; i < n; i++)
invSuff[suffixArr[i]] = i;
// Initialize length of previous LCP
int k = 0;
// Process all suffixes one by one starting from
// first suffix in txt[]
for (int i=0; i0)
k--;
}
// return the constructed lcp array
return lcp;
}
// A utility function to find longest common extension
// from index - L and index - R
int LCE(int *st, vectorlcp, vectorinvSuff,
int n, int L, int R)
{
// Handle the corner case
if (L == R)
return (n-L);
// Use the formula -
// LCE (L, R) = RMQ lcp (invSuff[R], invSuff[L]-1)
return (RMQ(st, n, invSuff[R], invSuff[L]-1));
}
// A function to answer queries of longest common extension
void LCEQueries(string str, int n, Query q[],
int m)
{
// Build a suffix array
vectorsuffixArr = buildSuffixArray(str, str.length());
// An auxiliary array to store inverse of suffix array
// elements. For example if suffixArr[0] is 5, the
// invSuff[5] would store 0. This is used to get next
// suffix string from suffix array.
vector invSuff(n, 0);
// Build a lcp vector
vectorlcp = kasai(str, suffixArr, invSuff);
int lcpArr[n];
// Convert to lcp array
for (int i=0; i
输出:
LCE (1, 2) = 1
LCE (1, 6) = 3
LCE (0, 5) = 4
时间复杂度:构建 lcp 和后缀数组需要 O(N.logN) 时间。要回答每个查询需要 O(log N)。因此总体时间复杂度为 O(N.logN + Q.logN))。虽然我们可以使用其他算法在 O(N) 时间内构造 lcp 数组和后缀数组。
在哪里,
Q = LCE 查询数。
N = 输入字符串的长度。
辅助空间:
我们使用 O(N) 辅助空间来存储 lcp、后缀和反后缀数组以及段树。
性能比较:我们已经看到了三种计算 LCE 长度的算法。
第 1 组:朴素方法 [O(NQ)]
设置 2: RMQ-直接最小方法 [O(N.logN + Q. (|invSuff[R] – invSuff[L]|))]
Set 3 : Segment Tree Method [O(N.logN + Q.logN))]
invSuff[] = 输入字符串的反后缀数组。
从渐近时间复杂度来看,Segment Tree 方法似乎效率最高,而其他两种方法效率非常低。
但是当涉及到实际世界时,情况并非如此。如果我们为具有用于各种运行的随机字符串的典型文件绘制时间与 log((|invSuff[R] – invSuff[L]|) 之间的图表,则结果如下所示。
上图取自该参考资料。测试在 25 个文件上运行,随机字符串范围从 0.7 MB 到 2 GB。字符串的确切大小未知,但显然 2 GB 文件中有很多字符。这是因为 1 个字符= 1 个字节。因此,大约 1000 个字符等于 1 KB。如果一个页面上有 2000 个字符(双倍行距页面的合理平均值),那么它将占用 2K(2 KB)。这意味着大约需要 500 页文本才能达到 1 兆字节。因此 2 GB = 2000 MB = 2000*500 = 10,00,000 页文本!
从上图中可以清楚地看出,朴素方法(在第 1 组中讨论)表现最好(优于分段树方法)。
这是令人惊讶的,因为分段树方法的渐近时间复杂度远低于朴素方法。
事实上,在具有随机字符串的典型文件上,naive 方法通常比 Segment Tree Method 快 5-6 倍。另外不要忘记,朴素方法是一种就地算法,因此使其成为计算 LCE 的最理想算法。
最重要的是,当涉及到平均情况性能时,朴素方法是回答 LCE 查询的最佳选择。
当一种看起来更快的算法在实际测试中被效率较低的算法击败时,这种想法在计算机科学中很少发生。
我们了解到,虽然渐近分析是在纸上比较两种算法的最有效方法之一,但在实际使用中,有时事情可能会反过来。
参考:
http://www.sciencedirect.com/science/article/pii/S1570866710000377