Apache Commons CSV 框架是否提供内存高效 incremental/sequential 模式来读取大文件？

Question

Apache Commons CSV 项目非常适合解析逗号分隔值、制表符分隔数据和类似数据格式。

我的印象是，此工具读取文件时会将生成的行对象完全保存在内存中。但我不确定，我找不到关于此行为的任何文档。

对于非常大的解析，我想进行增量读取，一次读取一行，或者一次读取相对较少的行，以避免过多的内存限制。

仅就内存使用方面而言，这里的想法就像 XML 的 SAX 解析器如何增量读取以尽量减少 RAM 的使用，而不是 DOM 样式 XML将文档完全读入内存以提供树遍历的解析器。

问题：

关于读取文档，Apache Commons CSV 的默认行为是什么：完全进入内存还是增量？
可以在增量文档和整个文档之间更改此行为吗？

Answer 1

My impression is that this tool reads a file entirely with the resulting line objects kept in memory

没有。内存的使用取决于您选择与 CSVParser 对象交互的方式。

CSVParser 的 Javadoc 在其 明智地解析记录 与 解析到内存 部分中明确解决了这个问题，其中注意事项：

Parsing into memory may consume a lot of system resources depending on the input. For example if you're parsing a 150MB file of CSV data the contents will be read completely into memory.

我快速浏览了源代码，确实明智地解析记录似乎是一次从其输入源读取一个块，而不是一次全部读取。但是 see for yourself.

明智地解析记录

在 明智地解析记录 部分中，它展示了如何通过循环 Iterable 即 CSVParser 来一次增量读取一个 CSVRecord ].

CSVParser parser = CSVParser.parse(csvData, CSVFormat.RFC4180);
for (CSVRecord csvRecord : parser) {
    ...
}

正在解析到内存中

相比之下，Parsing into memory 部分显示了使用 CSVParser::getRecords 将所有 CSVRecord 对象加载到 List all一时之间，在记忆中。所以很明显，一个非常大的输入文件可能会耗尽受限机器上的内存。

Reader in = new StringReader("a;b\nc;d");
CSVParser parser = new CSVParser(in, CSVFormat.EXCEL);
List<CSVRecord> list = parser.getRecords();

Apache Commons CSV 框架是否提供内存高效 incremental/sequential 模式来读取大文件？

Does Apache Commons CSV framework offer a memory-efficient incremental/sequential mode for reading large files?

java

memory

csv

memory-management

apache-commons-csv

明智地解析记录

正在解析到内存中