如何在 Haskell 中重复读取大数据文件的乱序行？

Question

我有一个 60k 行的数据文件，其中每行都有 ~1k 逗号分隔的整数（我想立即将其转换为双精度数）。

我想遍历 32 行的随机 "batches" 序列，其中一个批次是所有行的随机子集，并且 none 个批次共享公共行。由于每批有60k行，32行，所以应该有1875批。

如有必要，我愿意改变一些事情，但我希望它们采用延迟评估的列表（批次）的形式。需要这个的代码是 foldM，我在这里使用它：

resulting_struct <- foldM fold_fn my_struct batch_list

以便它在当前累加器 my_struct 的结果和 batch_list 的下一个元素上重复调用 fold_fn。

我很困惑。当我不需要洗牌时很容易；我简单地读入它们并将它们分块，它们被懒惰地评估，所以我没有问题。现在我完全被卡住了，觉得我一定是错过了一些简单的东西。

我试过以下方法：

将文件读入行列表并天真地打乱输入。这是行不通的，因为 readFile 是延迟求值的，但它需要将整个文件读入内存以随机洗牌，它很快就会耗尽我所有的 ~8 GB 内存。
获取文件的长度，然后创建一个从 0 到 60k 的混洗索引的批次列表，这些批次对应于将选择形成批次。然后，当我想实际获取数据批次时，我会这样做：

ind_batches <- get_shuffled_ind_batches_from_file fname
batch_list <- mapM (get_data_batch_from_ind_batch fname) ind_batches

其中：

get_shuffled_ind_batches_from_file :: String -> IO [[Int]]
get_shuffled_ind_batches_from_file fname = do
  contents <- get_contents_from_file fname -- uses readFile, returns [[Double]]
  let n_samps = length contents
      ind = [0..(n_samps-1)]
  shuffled_indices <- shuffle_list ind
  let shuffled_ind_chunks = take 1800 $ chunksOf 32 shuffled_indices
  return shuffled_ind_chunks

get_data_batch_from_ind_batch :: String -> [Int] -> IO [[Double]]
get_data_batch_from_ind_batch fname ind_chunk = do
  contents <- get_contents_from_file fname
  let data_batch = get_elems_at_indices contents ind_chunk
  return data_batch

shuffle_list :: [a] -> IO [a]
shuffle_list xs = do
        ar <- newArray n xs
        forM [1..n] $ \i -> do
            j <- randomRIO (i,n)
            vi <- readArray ar i
            vj <- readArray ar j
            writeArray ar j vi
            return vj
  where
    n = length xs
    newArray :: Int -> [a] -> IO (IOArray Int a)
    newArray n xs =  newListArray (1,n) xs

get_elems_at_indices :: [a] -> [Int] -> [a]
get_elems_at_indices my_list ind_list = (map . (!!)) my_list ind_list

然而，似乎 mapM 立即评估，然后尝试重复读取文件内容（我想，RAM 无论如何都会爆炸）。

更多的搜索告诉我，我可以尝试使用 unsafeInterleaveIO 来让它懒惰地评估一个动作，所以我试着像这样坚持下去：

get_data_batch_from_ind_batch :: String -> [Int] -> IO [[Double]]
get_data_batch_from_ind_batch fname ind_chunk = unsafeInterleaveIO $ do
  contents <- get_contents_from_file fname
  let data_batch = get_elems_at_indices contents ind_chunk
  return data_batch

但是没有运气，和上面一样的问题。

我觉得我一直在用头撞墙，一定是漏掉了一些非常简单的东西。有人建议改用流或管道，但是当我查看它们的文档时，我并不清楚如何使用它们来解决这个问题。

我怎样才能读入一个大数据文件并随机播放它，而不用尽我所有的内存？

Answer 1

hGetContents 会延迟 return 文件的内容，但是如果您对结果做很多事情，您将立即实现整个文件。我建议读取文件一次，然后扫描它以查找换行符，这样您就可以建立一个索引，说明哪个块从哪个字节偏移量开始。该索引将非常小，因此您可以轻松地对其进行洗牌。然后你可以遍历索引，每次打开文件并只读取它的一个定义的子范围，并且只解析那个块。

如何在 Haskell 中重复读取大数据文件的乱序行？

How can I repeatedly read in shuffled lines of a large data file in Haskell?

haskell

file

input

lazy-evaluation