如何读取 lua 中的大文件 (>1GB)?

How to read large files (>1GB) in lua?

我是Lua的新手(用于Torch7框架)。我有一个大小约为 1.4GB 的输入特征文件(文本文件)。简单的 io.open 函数在尝试打开此文件时抛出错误 'not enough memory'。在浏览用户组和文档时,我发现它 可能 是 Lua 限制。有解决方法吗?还是我在读取文件时做错了什么?

local function parse_file(path)
    -- read file
    local file = assert(io.open(path,"r"))
    local content = file:read("*all")
    file:close()

    -- split on start/end tags.
    local sections = string.split(content, start_tag)
    for j=1,#sections do
        sections[j] = string.split(sections[j],'\n')
        -- remove the end_tag
        table.remove(sections[j], #sections[j])
    end 
    return sections
end

local train_data = parse_file(file_loc .. '/' .. train_file)

编辑:我尝试读取的输入文件包含我想用来训练模型的图像特征。这个文件是有序的({start-tag} ...内容...{end-tag}{start-tag} ...等等...),所以如果我可以加载这些就没问题部分(开始标签到结束标签)一次一个。但是,我希望所有这些部分都加载到内存中。

我从来没有需要读取这么大的文件,但是如果您的 运行 内存不足,您可能需要逐行读取它。经过一些快速研究后,我从 lua 网站上找到了这个:

buff = buff..line.."\n"

buff is a new string with 50,020 bytes, and the old string in now > garbage. After two loop cycles, buff is a string with 50,040 bytes, and there are two old strings making a total of more than 100 Kbytes of garbage. Therefore, Lua decides, quite correctly, that it is a good time to run its garbage collector, and so it frees those 100 Kbytes. The problem is that this will happen every two cycles, and so Lua will run its garbage collector two thousand times before finishing the loop. Even with all this work, its memory usage will be around three times the file size. To make things worse, each concatenation must copy the whole string content (50 Kbytes and growing) into the new string.

所以加载大文件似乎会占用大量内存,即使您逐行读取它并每次都像这样使用连接:

local buff = ""  
while 1 do  
    local line = read()  
    if line == nil then break end  
    buff = buff..line.."\n"  
end  

然后他们提出了一个更节省内存的过程:

  function newBuffer ()
      return {n=0}     -- 'n' counts number of elements in the stack
  end  

  function addString (stack, s)
    table.insert(stack, s)       -- push 's' into the top of the stack
    for i=stack.n-1, 1, -1 do
      if string.len(stack[i]) > string.len(stack[i+1]) then break end
      stack[i] = stack[i]..table.remove(stack)
    end
  end

  function toString (stack)
    for i=stack.n-1, 1, -1 do
      stack[i] = stack[i]..table.remove(stack)
    end
    return stack[1]
  end

占用的内存比以前少了很多。所有资料来自: http://www.lua.org/notes/ltn009.html
希望有所帮助。

事实证明,解决加载大文件问题的最简单方法是将 Torch 升级到 Lua5.2 或更高版本!正如火炬开发者在 torch7-google-group.

上所建议的那样
cd ~/torch
./clean.sh
TORCH_LUA_VERSION=LUA52 ./install.sh

内存限制从 5.2 版本开始不再存在!我已经测试过了,效果很好!

参考:https://groups.google.com/forum/#!topic/torch7/fi8a0RTPvDo


另一种可能的解决方案(更优雅并且类似于@Adam 在他的回答中建议的)是使用逐行读取文件并使用 Tensors 或 tds 来存储数据,因为这会使用内存在 Luajit 之外。代码示例如下,感谢 Vislab。

local ffi = require 'ffi'
-- this function loads a file line by line to avoid having memory issues
local function load_file_to_tensor(path)
  -- intialize tensor for the file
  local file_tensor = torch.CharTensor()
  
  -- Now we must determine the maximum size of the tensor in order to allocate it into memory.
  -- This is necessary to allocate the tensor in one sweep, where columns correspond to letters and rows correspond to lines in the text file.
  
  --[[ get  number of rows/columns ]]
  local file = io.open(path, 'r') -- open file
  local max_line_size = 0
  local number_of_lines = 0
  for line in file:lines() do
    -- get maximum line size
    max_line_size = math.max(max_line_size, #line +1) -- the +1 is important to correctly fetch data
    
    -- increment the number of lines counter
    number_of_lines = number_of_lines +1
  end
  file:close() --close file
  
  -- Now that we have the maximum size of the vector, we just have to allocat memory for it (as long there is enough memory in ram)
  file_tensor = file_tensor:resize(number_of_lines, max_line_size):fill(0)
  local f_data = file_tensor:data()
  
  -- The only thing left to do is to fetch data into the tensor. 
  -- Lets open the file again and fill the tensor using ffi
  local file = io.open(path, 'r') -- open file
  for line in file:lines() do
    -- copy data into the tensor line by line
    ffi.copy(f_data, line)
    f_data = f_data + max_line_size
  end
  file:close() --close file

  return file_tensor
end

从这个张量中读取数据简单快捷。例如,如果你想读取文件中的第 10 行(它将位于张量的第 10 个位置),你可以简单地执行以下操作:

local line_string = ffi.string(file_tensor[10]:data()) -- this will convert into a string var

一句警告:这将占用更多 space 内存,并且对于某些行比其他行长得多的情况可能不是最佳选择。但是如果你没有内存问题,这甚至可以被忽略,因为当从文件加载张量到内存时它非常快并且可能会在这个过程中为你节省一些白发。

参考:https://groups.google.com/forum/#!topic/torch7/fi8a0RTPvDo