为什么面向列的文件格式不太适合流式写入?

Why column oriented file formats are not well suited to streaming writes?

Hadoop权威指南(第4版)第137页有一段话:

Column-oriented formats need more memory for reading and writing, since they have to buffer a row split in memory, rather than just a single row. Also, it’s not usually possible to control when writes occur (via flush or sync operations), so column-oriented formats are not suited to streaming writes, as the current file cannot be recovered if the writer process fails. On the other hand, row-oriented formats like sequence files and Avro datafiles can be read up to the last sync point after a writer failure. It is for this reason that Flume (see Chapter 14) uses row-oriented formats.

我不明白为什么在失败的情况下无法恢复当前块。有人可以解释有关此声明的技术困难:

we can not control when writes occur (via flush or sync operations)

don't understands why current block can not be recovered in the case of failure.

仅仅是因为没有块可以恢复。解释很清楚,柱状格式(ORC、Parquet 等)自己决定何时刷新。如果没有同花顺,那么就没有 'block'。由于 Flume 无法控制列式内存缓冲区何时写入存储,因此不能依赖此类格式。