如何在单次传递中将值插入已排序的 Vector 中?

How to insert a value into a sorted Vector in a single pass?

我有 Vector 个排序值,例如

fromList [1, 2, 4, 5]

现在我想插入另一个值,比方说 3 并创建一个新向量。在命令式语言中,我会分配一个大小为 5 的数组,遍历原始向量,复制旧值,并在适当的位置插入新值,这样我就可以获得

fromList [1, 2, 3, 4, 5]

我可以使用 vector API as

let (pre, post) = span (< x) n
in V.concat [pre, pure x, post]

有效,但遍历原始向量两次:一次是在搜索拆分时,一次是在合并拆分时。有没有办法一次完成? (另一种解决方案是使用二进制搜索来搜索分裂点,但我很感兴趣是否可以创建真正的单遍解决方案。)

似乎可用的最佳工具是 unfoldr,例如:

import qualified Data.Vector as V
import Data.Vector (Vector)

insertElem :: Int -> Vector Int -> Vector Int
insertElem e v = V.unfoldrN (len+1) go (0,False)
  where
    len = V.length v
    go (i,found)
      | i >= len  = if found then Nothing else Just (e, (i+1, True))
      | found     = Just (x, (i+1, True))
      | x <= e    = Just (x, (i+1, False))
      | otherwise = Just (e, (i, True))
      where x = v V.! i


test1 = insertElem 3 (V.fromList [1,2,4,5])
test2 = insertElem 0 (V.fromList [1,2,4,5])
test3 = insertElem 6 (V.fromList [1,2,4,5])

我并没有非常努力地删除 go 函数中的逻辑。

is quite a pretty way to do it, but it falls prey to an efficiency problem described in the Data.Vector documentation。具体来说,一旦找到插入点并进行盲目复制,它就不再强制实际从源向量中读取值。相反,它用 thunk 填充目标向量,当强制时,从源向量读取。

原解

注意:这是我想到的第一个解决方案。很好理解,但是和vector中的stream fusion系统配合不好,会导致性能比较差。总的来说,下面的解决方案更好。

如文档中所述,一种解决方案是使用 monadic indexM 操作来执行这些盲读。这会强制执行读取,但不会强制读取值。因此它将一个指针(可能是指向 thunk 的指针)从源向量复制到目标向量。为了获得最大效率,下面的所有内容都应替换为它的 unsafe 变体(特别是 unsafeIndexMunsafeIndexunsafeWrite)。

{-# Language ScopedTypeVariables #-}
module Insert where
import qualified Data.Vector as V
import Data.Vector (Vector)
import qualified Data.Vector.Mutable as MV
import Data.Vector.Mutable (MVector)
import Control.Monad.ST

insertElem :: forall a . Ord a => a -> Vector a -> Vector a
insertElem e v = V.create act
 where
  act :: forall s . ST s (MVector s a)
  act = do
    mv <- MV.new (V.length v + 1)
    let
      start :: Int -> ST s (MVector s a)
      start i
        | i == V.length v ||
          e <= v V.! i = MV.write mv i e >> finish i
        | otherwise = MV.write mv i (v V.! i) >> start (i+1)
      finish :: Int -> ST s (MVector s a)
      finish i
        | i == V.length v = return mv
        | otherwise = do
             V.indexM v i >>= MV.write mv (i+1)
             finish (i+1)
      in start 0

insertElemInt :: Int -> Vector Int -> Vector Int
insertElemInt = insertElem

请注意,命名 act 操作并使用 ScopedTypeVariables 实际上并不是必需的,但我发现它们对追查我的错误非常有帮助。

基于流的解决方案

以上代码不适用于流融合,因为索引到处都是。以下方法应该正确融合,并且仍然避免创建不必要的 thunk。这是我第一次接触流融合代码,所以可能有些地方可以改进。第一个基于流的版本的唯一问题是,如果它 确实 融合,那么输入流的阶跃函数将在其中一个元素上 运行 两次。这通常不是问题,但如果 step 函数非常昂贵(很少见),则可能是。因此,我给出了一个在这种情况下应该更好用的替代方案。我不确定哪一个在实践中会更好,所以我将两者都包括在内。

使用这些基于流的解决方案中的任何一个,代码

testEnum :: Word -> Word -> Word -> Word
testEnum ins low high = V.product $
  insertElem ins $ V.enumFromStepN low 1 (fromIntegral high)

将编译成 运行 常量 space 的循环,根本不会真正创建向量。

一步一个种子两次

{-# Language ScopedTypeVariables #-}
module Insert where

import Data.Vector (Vector)
import Data.Word (Word)
import qualified Data.Vector.Fusion.Stream.Monadic as S
import qualified Data.Vector.Generic as G
import Data.Vector.Fusion.Util (Id(..))

-- To check on unboxing and such
insertElemWord :: Word -> Vector Word -> Vector Word
insertElemWord = insertElem

{-# INLINE insertElem #-}
insertElem :: forall a . Ord a => a -> Vector a -> Vector a
insertElem a v = G.unstream (insertElemS a (G.stream v))

{-# INLINE [1] insertElemS #-}
insertElemS :: forall a . Ord a => a -> S.Stream Id a -> S.Stream Id a
insertElemS e (S.Stream step (state::s) size) = S.Stream step' (state, False) (size + 1)
  where
    {-# INLINE [0] step' #-}
    step' :: (s, Bool) -> Id (S.Step (s, Bool) a)
    step' (s, True) = Id $ case unId (step s) of
       S.Yield a s' -> S.Yield a (s', True)
       S.Skip s' -> S.Skip (s', True)
       S.Done -> S.Done
    step' (s, False) = Id $ case unId (step s) of
       S.Yield a s' ->
          if e <= a
          then S.Yield e (s, True)
          else S.Yield a (s', False)
       S.Skip s' -> S.Skip (s', False)
       S.Done -> S.Yield e (s, True)

一次准确通过,永不重复lookup/step

{-# Language ScopedTypeVariables #-}
module Insert where

import Data.Vector (Vector)
import Data.Word (Word)
import qualified Data.Vector.Fusion.Stream.Monadic as S
import qualified Data.Vector.Generic as G
import Data.Vector.Fusion.Util (Id(..))

data Status s a = Pre s | During s a | Post s | End

insertElemWord :: Word -> Vector Word -> Vector Word
insertElemWord = insertElem

{-# INLINE insertElem #-}
insertElem :: forall a . Ord a => a -> Vector a -> Vector a
insertElem a v = G.unstream (insertElemS a (G.stream v))

{-# INLINE [1] insertElemS #-}
insertElemS :: forall a . Ord a => a -> S.Stream Id a -> S.Stream Id a
insertElemS e (S.Stream step (state::s) size) = S.Stream step' (Pre state) (size+1)
  where
    {-# INLINE [0] step' #-}
    step' :: Status s a -> Id (S.Step (Status s a) a)
    step' (Post s) = Id $ case unId (step s) of
      S.Yield a s' -> S.Yield a (Post s')
      S.Skip s' -> S.Skip (Post s')
      S.Done -> S.Done
    step' (Pre s) = Id $ case unId (step s) of
      S.Yield a s'
        | e <= a    -> S.Yield e (During s' a)
        | otherwise -> S.Yield a (Pre s')
      S.Skip s' -> S.Skip (Pre s')
      S.Done -> S.Yield e End
    step' (During s a) = Id (S.Yield a (Post s))
    step' End = Id S.Done