📅  最后修改于: 2023-12-03 15:26:29.078000             🧑  作者: Mango
在字符串处理的算法中,常常需要寻找最长重复和非重叠子串。最长重复子串是指一个字符串中重复出现的最长的子串,而最长非重叠子串是指一个字符串中不重叠的、相同的最长的子串。
最朴素的想法是对于字符串中的每个子串都检查一次是否是重复的。时间复杂度为 $O(n^3)$,其中 $n$ 为字符串长度。其中,第一层循环枚举字符串的所有子串,第二层循环枚举字串的起点,第三层循环枚举当前子串的每个字符所藏。
下面是 Python 实现:
def longest_repeated_substring(s):
n = len(s)
max_len = 0
max_sub = ""
for i in range(n):
for j in range(i+1, n):
flag = True
for k in range(j-i):
if s[i+k] != s[j+k]:
flag = False
break
if flag and j-i > max_len:
max_len = j-i
max_sub = s[i:j]
return max_sub
对于字符串 $s$,构建它的后缀数组后,最长重复子串可以由后缀数组中某一对相邻的后缀的最长公共前缀得到。
KMP 算法是一种时间复杂度为 $O(n)$ 的快速搜索算法,其核心是构建一个前缀表(也称为失效函数)。前缀表的含义是在匹配过程中,遇到失配时应该回溯到上一次可行匹配的位置处再继续匹配。这样,我们可以在一遍扫描后就匹配出整个 $s$。
def build_prefix_table(s):
n = len(s)
t = [-1]*n
j = -1
for i in range(1, n):
while j >= 0 and s[i] != s[j+1]:
j = t[j]
if s[i] == s[j+1]:
j += 1
t[i] = j
return t
def longest_repeated_substring(s):
t = build_prefix_table(s)
n = len(s)
max_len = 0
max_sub = ""
for i in range(n):
j = t[i]
if i + j + 1 == n and j > max_len:
max_len = j
max_sub = s[i-j:i+1]
return max_sub
非重叠子串需要考虑到重叠的问题。首先生成所有的子串,然后按照长度递减排序,从长到短枚举,找到第一个不重叠的子串即可。时间复杂度为 $O(n^3)$。
下面是 Python 实现:
def longest_non_overlap_substring(s):
n = len(s)
substrings = set()
for i in range(n):
for j in range(i+1, n+1):
substrings.add(s[i:j])
substrings = sorted(list(substrings), key=lambda x:-len(x))
for i in range(len(substrings)):
for j in range(i+1, len(substrings)):
if substrings[i].find(substrings[j]) == -1 and substrings[j].find(substrings[i]) == -1:
return substrings[j]
return ""
后缀树是字符串处理的重要工具。它可以支持快速地查询字符串的任意子串是否出现在其中,时间复杂度为 $O(m)$,其中 $m$ 为要查询的子串长度。后缀树不仅可以用于寻找最长重复子串,还可以用于寻找最长非重叠子串。时间复杂度为线性,即 $O(n)$。
class SuffixTreeNode:
def __init__(self):
self.children = {}
self.start_idx = -1
self.end_idx = -1
self.index = -1
def build_suffix_tree(s):
n = len(s)
root = SuffixTreeNode()
for i in range(n):
node = root
j = i
while j < n:
if s[j] not in node.children:
new_node = SuffixTreeNode()
new_node.start_idx = i
new_node.end_idx = n-1
new_node.index = i
node.children[s[j]] = new_node
break
child = node.children[s[j]]
k = child.start_idx
while k <= child.end_idx and j < n and s[k] == s[j]:
k += 1
j += 1
if k > child.end_idx:
node = child
continue
if k <= child.end_idx:
new_node = SuffixTreeNode()
new_node.start_idx = child.start_idx
new_node.end_idx = k-1
new_node.index = child.index
node.children[s[j]] = new_node
child.start_idx = k
child.index = -1
new_child = SuffixTreeNode()
new_child.start_idx = k
new_child.end_idx = child.end_idx
new_child.index = child.index
new_node.children[s[k]] = child
new_node.children[s[child.start_idx]] = new_child
break
return root
def find_longest_non_overlap_substring(node):
max_depth = 0
max_depth_node = None
for c in node.children.values():
depth = c.end_idx - c.start_idx + 1
if c.index != -1 and depth > max_depth:
max_depth = depth
max_depth_node = c
child_depth, child_node = find_longest_non_overlap_substring(c)
if child_depth > max_depth:
max_depth = child_depth
max_depth_node = child_node
if max_depth_node is None:
return 0, None
else:
return max_depth, max_depth_node
def longest_non_overlap_substring(s):
root = build_suffix_tree(s)
return s[find_longest_non_overlap_substring(root)[1].start_idx:find_longest_non_overlap_substring(root)[1].end_idx+1]