在 Haskell 中有效地解析 ASCII 文件

Question

我想在 Haskell 中重新实现我的一些 ASCII 解析器，因为我认为我可以提高一些速度。然而，即使是简单的 "grep and count" 也比草率的 Python 实施慢得多。

有人可以向我解释为什么以及如何正确地做到这一点吗？

所以任务是，计算以字符串 "foo".

开头的行数

我非常基本的Python实现：

with open("foo.txt", 'r') as f:
    print len([line for line in f.readlines() if line.startswith('foo')])

和Haskell版本：

import System.IO 
import Data.List

countFoos :: String -> Int
countFoos str = length $ filter (isPrefixOf "foo") (lines str)

main = do
    contents <- readFile "foo.txt"
    putStr (show $ countFoos contents)

运行和 time 在一个 ~600MB 的文件上有 17001895 行表明 Python 实现比 [=51= 快了将近 4 倍] 一个（运行在配备 PCIe SSD 的 MacBook Pro Retina 2015 上）：

> $ time ./FooCounter                                                           
1770./FooCounter  20.92s user 0.62s system 98% cpu 21.858 total

> $ time python foo_counter.py                                                   
1770
python foo_counter.py  5.19s user 1.01s system 97% cpu 6.332 total

与 unix 命令行工具相比：

> $ time grep -c foo foo.txt                                                     
1770
grep -c foo foo.txt   4.87s user 0.10s system 99% cpu 4.972 total

> $ time fgrep -c foo foo.txt                                                     
1770
fgrep -c foo foo.txt  6.21s user 0.10s system 99% cpu 6.319 total

> $ time egrep -c foo foo.txt                                                     
1770
egrep -c foo foo.txt  6.21s user 0.11s system 99% cpu 6.317 total

有什么想法吗？

更新：

使用 András Kovács 的实现 (ByteString)，我不到半秒就搞定了！

> $ time ./FooCounter                                                                                                               
1770
./EvtReader  0.47s user 0.48s system 97% cpu 0.964 total

Answer 1

您的 Haskell 版本使用类型 String 来表示文件的文本。 String 是 [Char] 的别名，它是一个字符链表。这不是大字符串的良好表示。

尝试将 text package instead. It represents strings as arrays (in the Data.Text.* modules) or as linked lists of arrays (in the Data.Text.Lazy.* modules). To port your existing code, you probably want the latter, because I guess you don't want to load the full 600MB file into memory at once. Look in the Data.Text.Lazy and Data.Text.Lazy.IO 模块用于您正在使用的 readFile、filter、isPrefixOf 等函数的变体。

如果您确定只想支持 ASCII，您还可以考虑使用 bytestring 包而不是 text 包。

Answer 2

我对以下解决方案进行了基准测试：

{-# LANGUAGE OverloadedStrings #-}

import qualified Data.ByteString.Char8 as B

main =
  print . length . filter (B.isPrefixOf "foo") . B.lines =<< B.readFile "test.txt"

text.txt 是一个 170 MB 的文件，有 800 万行，其中一半行以 "foo" 开头。我用 GHC 7.10 和 -O2 -fllvm 编译。

ByteString版本运行在0.27秒，而原始版本运行在5.16[=27] =] 秒.

然而，严格的 ByteString 版本使用 170 MB 内存加载完整文件。将导入更改为 Data.ByteString.Lazy.Char8 我得到 0.39 秒运行时间和 1 MB 内存使用量。

在 Haskell 中有效地解析 ASCII 文件

Parsing ASCII file efficiently in Haskell

file-io

haskell

text-parsing