在 Java 中找到给定有序子序列的超序列

Question

我在编写的一个无关程序中遇到了这个问题，我花了好几个小时试图解决它，因为我认为它会很有趣。是的，但我无法一路做到。我的代码只解决了一些子集的序列。这个问题也感觉像是一个一般的数学问题，几十年来可能已经以多种方式解决了，但我缺乏数学技能和术语来找到解决方案或在线找到关于这个特定问题的任何信息。

我有一组子序列，我知道它们是更大的未知（超级？）序列的一部分。我不认为这些子序列是数学意义上的 sets 因为它们是有序的但是它们的相似之处在于它们不包含重复元素。 master/super/whateversequence 也是如此。（为清楚起见，我将其称为超序列。）

子序列都包含相同类型的数据，但数据未按字母顺序、升序或类似顺序排序。从某种意义上说，数据是任意顺序的：在超序列中。这就是我感兴趣的。我想找到这些子序列的未知超序列。

为了简单起见，我尝试使用字母表中的字母来解决这个问题，但我可以稍后重构代码以满足我的需要。显然，因为我仍在尝试解决这个问题，所以我首先想出了一个合适的词来表示不包含重复元素的超序列：FLOWCHARTS.

然后我想出了以下六个子序列：

F,W,C,R
L,H,A
L,O,H,A,R,S
C,S
R,T,S
F,O,W,H,A,S

这是我的顺序排序方法：

// LinkedHashMappedKeyValueList keeps the data in the order it was inserted and allows one key to have multiple values.
private static LinkedHashSet<Character> orderSequence(final Set<Character> unorderedSequence, final LinkedHashMappedKeyValueList ruleMap)
{
    List<Character> orderedSequence = new ArrayList<Character>(unorderedSequence);

    // Order the sequence according to the rules.
    System.out.println("---- ORDERING SEQUENCE ----");

    for (Map.Entry<Character, LinkedHashSet<Character>> rule : ruleMap.entrySet())
    {
        char currentChar = rule.getKey();
        LinkedHashSet<Character> ruleChars = rule.getValue();

        System.out.println("Processing rule " + currentChar + "<" + ruleChars.toString());

        if (orderedSequence.contains(currentChar))
        {
            int ruleCharIndex = -1;
            int smallestRuleCharIndex = Integer.MAX_VALUE;
            Iterator<Character> it = ruleChars.iterator();

            // Find the rule character with the smallest index.
            while (it.hasNext())
            {
                char ruleChar = it.next();
                ruleCharIndex = orderedSequence.indexOf(ruleChar);
                System.out.println("\tChecking for rule character: " + ruleChar + " (" + ruleCharIndex + ")");

                if (ruleCharIndex > -1 && smallestRuleCharIndex > ruleCharIndex)
                    smallestRuleCharIndex = ruleCharIndex;
            }

            if (smallestRuleCharIndex != Integer.MAX_VALUE)
                System.out.println("\tMoving '" + currentChar + "' before '"
                        + orderedSequence.get(smallestRuleCharIndex) + "'.");
            else
                System.out.println("\tMoving '" + currentChar + "' to the end of the sequence.");

            if (!moveBefore(orderedSequence.indexOf(currentChar), smallestRuleCharIndex, orderedSequence))
                System.out.println("\tAlready in correct position.");
            else
                System.out.println("\tCurrent sequence: " + listToString(orderedSequence));
        }
        else
            throw new ArithmeticException("Element of a subsequence not a part of the sequence.");
    }

    return new LinkedHashSet<Character>(orderedSequence);
}

最后，我的代码找到了这些子序列的超序列 F,L,O,W,H,C,A,R,T,S，它非常接近但并不完美。我还需要多次运行我的订购方法，所以我想出的 "algorithm" 也不完美。 "rule map" 是一个散列图，其中键是另一个字符对象的散列图，它位于子序列（因此在超序列中）中的键字符之后。

有没有我可以使用的某种 Java 库来进行这种序列查找？有人可以告诉我这叫做什么吗 and/or 帮助我找到适合这项工作的正确算法？

此外，我的程序的控制台输出缩短了：

---- BUILDING RULE MAP ----
Subsequences:   F,W,C,R
        L,H,A
        L,O,H,A,R,S
        C,S
        R,T,S
        F,O,W,H,A,S

All subsequences processed. Number of ordering rules: 10
Rule map: (F<[W, O]),(W<[C, H]),(C<[R, S]),(R<[, S, T]),(L<[H, O]),(H<[A]),(A<[, R, S]),(O<[H, W]),(S<[]),(T<[S])

---- BUILDING UNORDERED SEQUENCE ----
Sequence size is 10.
Unordered sequence: F,W,C,R,L,H,A,O,S,T

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'W'.
    Already in correct position.
Processing rule W<[C, H]
    Moving 'W' before 'C'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Already in correct position.
Processing rule R<[, S, T]
    Moving 'R' before 'S'.
    Current sequence: F,W,C,L,H,A,O,R,S,T
Processing rule L<[H, O]
    Moving 'L' before 'H'.
    Already in correct position.
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,W,C,L,H,O,A,R,S,T
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Current sequence: F,O,W,C,L,H,A,R,S,T
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Current sequence: F,O,W,C,L,H,A,R,T,S
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,W,C,R,L,H,A,O,S,T
Ordered sequence:   F,O,W,C,L,H,A,R,T,S
Sequences match:    false

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'O'.
    Already in correct position.
Processing rule W<[C, H]
    Moving 'W' before 'C'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Current sequence: F,O,W,L,H,A,C,R,T,S
Processing rule R<[, S, T]
    Moving 'R' before 'T'.
    Already in correct position.
Processing rule L<[H, O]
    Moving 'L' before 'O'.
    Current sequence: F,L,O,W,H,A,C,R,T,S
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,L,O,W,H,C,A,R,T,S
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Already in correct position.
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Already in correct position.
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,O,W,C,L,H,A,R,T,S
Ordered sequence:   F,L,O,W,H,C,A,R,T,S
Sequences match:    false

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'O'.
    Current sequence: L,F,O,W,H,C,A,R,T,S
Processing rule W<[C, H]
    Moving 'W' before 'H'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Current sequence: L,F,O,W,H,A,C,R,T,S
Processing rule R<[, S, T]
    Moving 'R' before 'T'.
    Already in correct position.
Processing rule L<[H, O]
    Moving 'L' before 'O'.
    Current sequence: F,L,O,W,H,A,C,R,T,S
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,L,O,W,H,C,A,R,T,S
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Already in correct position.
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Already in correct position.
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,L,O,W,H,C,A,R,T,S
Ordered sequence:   F,L,O,W,H,C,A,R,T,S
Sequences match:    true
Sequence ordered according to the limits of the rule map.
Sequence found after 2 tries.

Expected sequence:  F,L,O,W,C,H,A,R,T,S FLOWCHARTS
Found sequence:     F,L,O,W,H,C,A,R,T,S FLOWHCARTS
Sequences match:    false

Answer 1

您要的是根据部分订单计算总订单。我找不到太多这方面的工作。不过我们可以在这里稍微讨论一下这个问题。

考虑 A<B<C<D。如果我们有序列 A<C、B<D 和 C<D，我们将永远无法计算总阶数。我们只得到 A<C<D 和 B<D.

我认为可以证明，我们需要 X<Y 与 X 和 Y 形式的所有 N-1 关系在最终链中连续出现为了重建总顺序（可能还有其他的，但那些是额外的信息）。作为 非严格 演示，假设我们有 A1<A2<A3<...<AN 并且假设我们能够从偏序 A_begin 重构到 A_end。现在为了将其放入总顺序中的正确位置，我们需要知道 A_(begin-1)<A_begin。没有其他关系可以让它适应总秩序。继续向下进入 A_begin..A_end 我们应该能够通过某种归纳/无限下降表明我们需要单词的连续字符给出的所有关系才能重建它。

上面这组序列中缺失的信息是F<L和C<H。可以得到的序列有W->C->R->T->S、F->O->W->H->A->R->T->S和L->O->W->H->A->R->T->S。计算余数需要更多信息。

在上面的例子中，我们经过分解和去重后有如下关系：

A,R
A,S <-- redundant since R<S and A<R
C,R
C,S <-- redundant since R<S and A<R
F,O 
F,W <-- redundant since O<W and F<O
H,A
L,H <-- redundant since O<H and L<O
L,O
O,H <-- redundant since O<W and W<H
O,W
R,S <-- redundant since T<S and R<T
R,T
T,S
W,C
W,H

有 16 个关系，其中 6 个直接是多余的。删除冗余我们得到以下 10 个关系：

A,R <-- consecutive letters in actual word
C,R
F,O 
H,A <-- consecutive letters in actual word
L,O <-- consecutive letters in actual word
O,W <-- consecutive letters in actual word
R,T <-- consecutive letters in actual word
T,S <-- consecutive letters in actual word
W,C <-- consecutive letters in actual word
W,H

原始序列中唯一缺少的是 F<L 和 C<H。附加的给定关系 C<R、F<O 和 W,H 重复了 LHS 或 RHS 并给出了不可操作的信息（基本上这些 link 两个部分有序的链但不在端点, 所以你知道一个链要合并或小于另一个但不知道在哪里）。

添加缺失的关系后，有多种方法可以实现这一点。您的代码可能自己工作。

Answer 2

我在问题中描述的问题是shortest common supersequence problem。我没有在网上搜索这个，因为我在编写代码时过于专注于数学集，只有当我开始写出我的问题时，我才意识到我正在处理序列。事实上，我以为我只是当场编造了 "supersequence" 这个词。原来这正是我需要在线搜索的词，以找到与此问题相关的 material 的几页。

is absolutely correct. The given subsequences lack information to make the correct supersequence so in my example, there is no way of finding the actual supersequence. about bioinformatics most likely refers to shortest common superstring problem which has applications in reconstruction of DNA sequences, something discussed in this question。最短公共超序列和超串问题乍一看非常相似，然而，在超串问题中，子串必须由超串的 adjacent 个元素组成，这与超序列问题不同。（如下图所示。）

两道题也是NP-complete或"NP-hard"。我的理解是，这些问题没有最佳解决方案或一些神奇的算法，您可以将它们复制粘贴到您的代码中。只有近似值和 "good enough" 解。

令人惊讶的是，我无法找到解决此问题的完整 Java 代码，但我确实遇到了一些包含其他语言代码的站点。我在下面列出了我在研究这个问题时发现的一些有用的资源。我还包含了与最短常见超弦问题相关的资源。

进一步研究的其他资源

Google Scholar
The Shortest Common Supersequence Problem by Andreas Westling (With C code)
Science Direct papers on the subject
Papers on CiteSeerX
Approximation Algorithms for the Shortest Common Superstring Problem by Jonathan Turner
Related Whosebug questions
Rosetta Code (various languages)
Python
Java (superstring problem)
Ruby (The title says superstring but it seems more like the supersequence problem to me.)

在 Java 中找到给定有序子序列的超序列

Find the supersequence of given ordered subsequences in Java

java

algorithm

sequence

subsequence

进一步研究的其他资源