Conduit 管道跳过流的某些元素

Question

我尝试使用 Haskell 的 Conduit 库实现一个简单的字数统计：

wordcountCv2 :: IO ()
wordcountCv2 = do 
    hashMap <- runConduitRes $ sourceFile "input.txt"
        .| decodeUtf8C
        .| omapCE Data.Char.toLower
        .| peekForeverE (do
            word <- takeWhileCE isAlphaNum
            dropCE 1
            return word)
        .| foldMC insertInHashMap empty
    print (toList hashMap)

insertInHashMap x v = do
    return (insertWith (+) v 1 x)

问题是这个函数对 small/medium 个输入文件工作正常，但随着文件大小的增长，它往往会打断一些单词。例如，如果我使用包含单词 "hello" 100 倍的小文件，结果是：[("hello",100)]，如果 hello 是例如 100000，则结果是：[( "hello",99988),("he",6),("hell",6),("o",6),("llo",6)]。文件越大，断词就越多。我的实现有问题吗？

Answer 1

chi 正确 that takeWhileCE returns () 并将结果发送到下游而不是返回它。不过，他们在一件事上是错误的：是，事实上，问题所在。

您的管道在块流上运行，takeWhileCE 将结果发送到下游的原因之一是它可以将输入拆分留在原始块边界上。这样它就不会因为您可能会收到一长串匹配值而强迫您消耗无限内存。

但是如果你想组合构成每个单词的潜在多个块，你需要做更多的工作。通过 foldC 发送它们是一种方法。

        .| peekForeverE (do
            word <- takeWhileCE isAlphaNum .| foldC
            dropCE 1
            yield word)

在您的情况下，使用 splitOnUnboundedE 组合器会更容易，它会为您完成这一切。

        .| splitOnUnboundedE (not . isAlphaNum)

Conduit 管道跳过流的某些元素

Conduit pipeline skips some elements of a stream

haskell

conduit