Haskell 中无重复字符的最长子串的性能

Question

阅读并提出解决方案后，我尝试解决 Haskell 中的相同挑战。

我想出了下面的代码，它似乎可以工作。但是，由于我对这种语言还很陌生，所以我需要一些帮助来了解代码是否具有良好的性能。

lswrc :: String -> String
lswrc s = reverse $ fst $ foldl' step ("","") s
  where
    step ("","") c = ([c],[c])
    step (maxSubstr,current) c
      | c `elem` current = step (maxSubstr,init current) c
      | otherwise = let candidate = (c:current)
                        longerThan = (>) `on` length
                        newMaxSubstr = if maxSubstr `longerThan` candidate
                                       then maxSubstr
                                       else candidate
                    in (newMaxSubstr, candidate)

有些观点我认为可以做得更好

我携带了一对字符串（跟踪时间最长的，也是目前的候选人），但我只需要前者；从程序上考虑，没有办法避免这种情况，但也许 FP 允许另一种方法？
我构建了(c:current)，但我只在else中使用它；我可以做一个更复杂的 longerThan 来在它的第二个参数的长度上加 1，这样我就可以将它应用到 maxSubstr 和 current，并在else，连名字都没有。
当 c 在 current 字符串中时，我删除 current 的最后一个元素，因为我正在用 : 堆积字符串；我可以在针对字符串检查 c 时进行模式匹配（如 c `elem` current@(a:as)），但是在添加新字符时我应该做 current ++ [c]，我知道它的性能不如 c:current.
我用foldl'（据我所知foldl没有意义）； foldr 可能是另一种选择，但由于我看不出懒惰是如何进入这个问题的，所以我无法判断哪个更好。

Answer 1

运行 elem 每次迭代都会使您的算法 Ω(n^2)（对于没有重复的字符串）。运行 length 上，在最坏的情况下，每次迭代都会使您的算法 Ω(n^2)（对于没有重复的字符串）。运行 init 很多使您的算法 Ω(n*sqrt(n))（对于 sqrt(n) 重复 sqrt(n) 长字符串的字符串，每个其他字符串都反转，并假设 O(1) elem 替换）。

更好的方法是预先支付一次 O(n) 成本以复制到具有恒定时间索引的数据结构，并保留一组（或类似的数据结构）可见元素而不是一个平面列表.像这样：

import Data.Set (Set)
import Data.Vector (Vector)
import qualified Data.Set as S
import qualified Data.Vector as V

lswrc2 :: String -> String
lswrc2 "" = ""
lswrc2 s_ = go S.empty 0 0 0 0 where
    s = V.fromList s_
    n = V.length s
    at = V.unsafeIndex s
    go seen lo hi bestLo bestHi
        | hi == n = V.toList (V.slice bestLo (bestHi-bestLo+1) s)
        -- it is probably faster (possibly asymptotically so?) to use findIndex
        -- to immediately pick the correct next value of lo
        | at hi `S.member` seen = go (S.delete (at lo) seen) (lo+1) hi bestLo bestHi
        | otherwise = let rec = go (S.insert (at hi) seen) lo (hi+1) in
            if hi-lo > bestHi-bestLo then rec lo hi else rec bestLo bestHi

这应该具有 O(n*log(n)) 最坏情况性能（在没有重复的字符串上实现最坏情况）。可能还有更好的方法；我还没想那么多。

在我的机器上，lswrc2 在随机字符串上的表现始终优于 lswrc。在字符串 ['[=17=]' .. '0000'] 上， lswrc 大约需要 40 秒， lswrc2 需要 0.03 秒。 lswrc2可以在0.4s左右处理[minBound .. maxBound]；让 lswrc 咀嚼该列表 20 多分钟后，我放弃了。

Haskell 中无重复字符的最长子串的性能

Performance of Longest Substring Without Repeating Characters in Haskell

performance

haskell

longest-substring