寻找最长复合词的时间复杂度

Time complexity of finding longest compound word

我这里指的是算法题:

http://www.ardendertat.com/2012/06/15/programming-interview-questions-28-longest-compound-word/

Given a sorted list of words, find the longest compound word in the list that is constructed by concatenating the words in the list. For example, if the input list is: ['cat', 'cats', 'catsdogcats', 'catxdogcatsrat', 'dog', 'dogcatsdog', 'hippopotamuses', 'rat', 'ratcat', 'ratcatdog', 'ratcatdogcat']. Then the longest compound word is ‘ratcatdogcat’ with 12 letters. Note that the longest individual words are ‘catxdogcatsrat’ and ‘hippopotamuses’ with 14 letters, but they’re not fully constructed by other words. Former one has an extra ‘x’ letter, and latter is an individual word by itself not a compound word.

我实现的算法如下:

  1. 从输入列表中的所有单词构造一个 Trie。每个节点代表一个字符,一个单词的结尾通过设置 isTerminal=true.

  2. 来标记
  3. 现在我有另一种方法来检查每个输入的单词,以找出组成它的组件的数量(比如说复合长度)。例如,在上面的示例中,ratcatdogcat 由输入列表 ratcatdogcat 中的单个单词组成。我通过递归解析输入词的有效前缀并找到词的其余部分的复合长度来做到这一点,即解析 rat 并获得 catdogcat 的复合长度。如果 rest 的复合长度是 zero,意味着 rest 不是一个有效的结构,我尝试下一个前缀 ratcat 并递归 dogcat.

伪代码如下所示:

Node {
  Character ch
  Map<Character, Node> children
  boolean isTerminal
}

int getCompoundLength(word) {

  if (dpTable.contains(word))
    return dpTable.get(word)

  dpTable.put(word, 0) // memoize word to compound length

  Node current = root;
  for (i=0 to word.length) {
    ch = word[i]
    if (!current.children.contains(ch)) // invalid character
      return 0;

    current = current.children.get(ch)

    if (!current.isTerminal) // not a valid prefix yet
      continue

    lenRest = getCompoundLength(word.substring(i+1));

    if (lenRest != 0) { // rest of the string is valid
      dpTable.put(word, lenRest+1)
      return lenRest + 1
    }
  }

  // Could not split the word into multiple components.
  // Check if word is a valid word at least.
  if (current.isTerminal) {
    dpTable.put(word, 1)
    return 1;
  }

我知道构建 trie 需要 O(W),其中 W 是输入单词的总数。但是我不清楚如何计算 getCompoundLength 方法的运行时间。感谢任何帮助。

你的理解有误。插入一个单词到一个trie中有运行个单词长度的时间,所以构建trie的时间是O(W*s),其中s是单词的平均大小

查找 trie 中的单词是最坏的情况 O(s) 其中 s 是单词的长度。

至于你的getCompoundLength方法,你需要想出最悲观的可能输入。考虑以下示例:

a
aa
aaa
aaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab

最后一个词不是复合词。但是你刚刚遇到了一个指数回溯问题......

(现实世界中的正则表达式实现确实存在这个问题。它们在大多数字符串上都能正常工作,但有一些病态的输入会让它们哭泣。)