为什么面向列的文件格式不太适合流式写入?
Why column oriented file formats are not well suited to streaming writes?
Hadoop权威指南(第4版)第137页有一段话:
Column-oriented formats need more memory for reading and writing,
since they have to buffer a row split in memory, rather than just a
single row. Also, it’s not usually possible to control when writes
occur (via flush or sync operations), so column-oriented formats are
not suited to streaming writes, as the current file cannot be
recovered if the writer process fails. On the other hand, row-oriented
formats like sequence files and Avro datafiles can be read up to the
last sync point after a writer failure. It is for this reason that
Flume (see Chapter 14) uses row-oriented formats.
我不明白为什么在失败的情况下无法恢复当前块。有人可以解释有关此声明的技术困难:
we can not control when writes occur (via flush or sync operations)
don't understands why current block can not be recovered in the case of failure.
仅仅是因为没有块可以恢复。解释很清楚,柱状格式(ORC、Parquet 等)自己决定何时刷新。如果没有同花顺,那么就没有 'block'。由于 Flume 无法控制列式内存缓冲区何时写入存储,因此不能依赖此类格式。
Hadoop权威指南(第4版)第137页有一段话:
Column-oriented formats need more memory for reading and writing, since they have to buffer a row split in memory, rather than just a single row. Also, it’s not usually possible to control when writes occur (via flush or sync operations), so column-oriented formats are not suited to streaming writes, as the current file cannot be recovered if the writer process fails. On the other hand, row-oriented formats like sequence files and Avro datafiles can be read up to the last sync point after a writer failure. It is for this reason that Flume (see Chapter 14) uses row-oriented formats.
我不明白为什么在失败的情况下无法恢复当前块。有人可以解释有关此声明的技术困难:
we can not control when writes occur (via flush or sync operations)
don't understands why current block can not be recovered in the case of failure.
仅仅是因为没有块可以恢复。解释很清楚,柱状格式(ORC、Parquet 等)自己决定何时刷新。如果没有同花顺,那么就没有 'block'。由于 Flume 无法控制列式内存缓冲区何时写入存储,因此不能依赖此类格式。