提高文件操作性能

Question

我有一个包含数字矩阵的文件，如下所示：

0 10 24 10 13 4 101 ...
6 0 52 10 4 5 0 4 ...
3 4 0 86 29 20 77 294 ...
4 1 1 0 78 100 83 199 ...
5 4 9 10 0 58 8 19 ...
6 58 60 13 68 0 148 41 ...
. .
.   .
.     .

我想做的是对每一行求和并将每一行的总和输出到一个新文件（每行的总和在一个新行上）。

我曾尝试在 Haskell 中使用 ByteStrings 执行此操作，但性能比 python 实施慢 3 倍。这是 Haskell 实现：

import qualified Data.ByteString.Char8 as B

-- This function is for summing a row
sumrows r = foldr (\x y -> (maybe 0 (*1) $ fst <$> (B.readInt x)) + y) 0 (B.split ' ' r)

-- This function is for mapping the sumrows function to each line
sumfile f = map (\x -> (show x) ++ "\n") (map sumrows (B.split '\n' f)) 

main = do
  contents <- B.readFile "telematrix"
  -- I get the sum of each line, and then pack up all the results so that it can be written
  B.writeFile "teleDensity" $ (B.pack . unwords) (sumfile contents)
  print "complete"

对于 25 MB 的文件，这大约需要 14 秒。

这是 python 实现

fd = open("telematrix", "r")
nfd = open("teleDensity", "w")

for line in fd: 
  nfd.write(str(sum(map(int, line.split(" ")))) + "\n")

fd.close()
nfd.close()

对于同一个 25 MB 的文件，这大约需要 5 秒。

关于如何增加 Haskell 实施的任何建议？

Answer 1

乍一看，我敢打赌你的第一个瓶颈是在 sumfile 中字符串的 ++ 中，它每次都在解构左操作数并重建它。您可以将 unwords 函数调用替换为 unlines，而不是将 "\n" 附加到末尾，这完全符合您的要求。那应该会让你的速度有一点提升。

一个更小的挑剔是 maybe 函数中的 (*1) 是不需要的。使用 id 会更有效率，因为 (*1) 浪费了一个乘法运算，但这不过是几个处理器周期而已。

最后，我不得不问你为什么在这里使用 ByteString。 ByteString 将字符串数据有效地存储为数组，就像命令式语言中的传统字符串一样。但是，您在这里所做的涉及拆分字符串和遍历元素，这是链表适合的操作。老实说，在这种情况下，我建议使用传统的 [Char] 类型。 B.split 调用可能会毁了你，因为它必须将整行复制到拆分形式的单独数组中，而用于字符链接列表的 words 函数只是拆分链接结构在几个点关闭。

Answer 2

他的问题似乎是我正在编译和运行使用 runhaskell 的程序，而不是使用 ghc 然后运行程序。通过先编译然后运行，我在 Haskell

中将性能提高到 1 秒

Answer 3

性能不佳的主要原因是因为我使用的是runhaskell，而不是先编译然后运行程序。所以我从：

runhaskell program.hs

至

ghc program.hs

./program

提高文件操作性能

Increasing performance in file manipulation

performance

haskell

functional-programming

bytestring