📜  最短超弦问题第2套(使用套盖)

📅  最后修改于: 2021-04-29 12:33:46             🧑  作者: Mango

给定一组n个字符串S,找到包含给定集合中的每个字符串的最小字符串作为子字符串。我们可以假设arr []中的任何字符串都不是另一个字符串的子字符串。

例子:

Input:  S = {"001", "01101", "010"}
Output: 0011010  

Input:  S = {"geeks", "quiz", "for"}
Output: geeksquizfor

Input:  S = {"catg", "ctaagt", "gcta", "ttca", "atgcatc"}
Output: gctaagttcatgcatc

在上一篇文章中,我们讨论了一种证明为4近似的解决方案(推测为2近似)。
在这篇文章中,讨论了一个可以证明为2H n近似的解决方案。其中H n = 1 + 1/2 + 1/3 +…1 / n。这个想法是将最短超字符串问题转化为集合覆盖问题(集合覆盖问题被赋予了Universe的某些子集,每个给定子集都有相关的成本。任务是找到给定子集的最低成本集,使得宇宙被覆盖)。对于集合覆盖问题,我们需要有一个Universe及其子集及其相关成本。

以下是将“最短超级字符串”转换为“设置封面”的步骤。

1) Let S be the set of given strings.
   S = {s1, s2, ... sn}

2) Universe for Set Cover problem is S (We need
   to find a superstring that has every string
   as substring)

3) Let us initialize subsets to be considered for universe as
     Subsets =  {{s1}, {s2}, ... {sn}}
   Cost of every subset is length of string in it.

3) For all pairs of strings si and sj in S,
     If si and sj overlap
      a) Construct a string rijk where k is
         the maximum overlap between the two.
      b) Add the set represented by rijk to Subsets,
           i.e., Subsets = Subsets U Set(rijk)
         The set represented by rijk is the set 
         of all strings which are substring of it.
         Cost of the subset is length of rijk.

4) Now problem is transformed to Set Cover, we can 
   run Greedy Set Cover approximate algorithm to find
   set cover of S using Subsets.  Cost of every element in
   Subsets is length of string in it.

例子:

S = {s1, s2, s3}.
s1 = "001"
s2 = "01101"
s3 = "010"

[Combination of s1 and s2 with 2 overlapping characters]
r122 = 001101 

[Combination of s1 and s3 with 2 overlapping characters]
r132 = 0010 

Similarly,
r232 = 011010
r311 = 01001
r321 = 0101101

Now set cover problem becomes as following:

Universe to cover is {s1, s2, s3}

Subsets of the universe and their costs :

{s1}, cost 3 (length of s1)
{s2}, cost 5 (length of s2)
{s3}, cost 5 (length of s3)

set(r122), cost 6 (length of r122)
The set r122 represents all strings which are
substrings of r122. 
Therefore set(r122) = {s1, s2}

set(r132), cost 3 (length of r132)
The subset r132 represents all strings which are
substrings of r132
Therefore set(r132) = {s1, s3}

Similarly there are more subsets for set(r232), 
set(r311), and set(r321).

So we have a set cover problem with universe and subsets
of universe with costs associated with every subset.

我们已经讨论了最短超字符串问题的实例可以在多项式时间内转换为集合覆盖问题的实例。

请参阅此以证明基于Set Cover的算法近似为2H n的事实。

参考:
http://www.cs.dartmouth.edu/~ac/Teach/CS105-Winter05/Notes/wan-ba-notes.pdf
http://fileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/superstring.pdf
http://math.mit.edu/~goemans/18434S06/superstring-lele.pdf