附加到 ORC 文件

Question

我是大数据和相关技术的新手，所以我不确定我们是否可以将数据附加到现有的 ORC 文件中。我正在使用 Java API 编写 ORC 文件，当我关闭 Writer 时，我无法再次打开文件以向其写入新内容，基本上是附加新数据。

有没有办法可以使用 Java Api 或 Hive 或任何其他方式将数据附加到现有的 ORC 文件？

再说明一下，将Java util.Date对象存入ORC文件时，ORC类型存为：

struct<timestamp:struct<fasttime:bigint,cdate:struct<cachedyear:int,cachedfixeddatejan1:bigint,cachedfixeddatenextjan1:bigint>>,

对于 java BigDecimal，它是：

<margin:struct<intval:struct<signum:int,mag:struct<>,bitcount:int,bitlength:int,lowestsetbit:int,firstnonzerointnum:int>

这些是否正确，是否有任何相关信息？

Answer 1

是的，这可以通过 Hive 实现，您基本上可以在其中 'concatenate' 更新数据。来自 hive 官方文档 https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-WhatisACIDandwhyshouldyouuseit?

Answer 2

不，您不能直接附加到 ORC 文件。也不是 Parquet 文件。也不是任何具有复杂内部结构且元数据与数据交错的列式格式。

引用官方“Apache Parquet”网站...

Metadata is written after the data to allow for single pass writing.

然后引用官方“Apache ORC”网站...

Since HDFS does not support changing the data in a file after it is written, ORC stores the top level index at the end of the file (...) The file’s tail consists of 3 parts; the file metadata, file footer and postscript.

嗯，从技术上讲，如今您可以附加到 HDFS 文件；你甚至可以截断它。但是这些技巧仅对某些边缘情况有用（例如 Flume 将消息馈送到 HDFS "log file"，微批处理，不时 fflush）。

对于 Hive 事务支持，他们使用了不同的技巧：在每个事务（即微批处理）上创建一个新的 ORC 文件，并在后台进行定期压缩作业运行，à la HBase.

Answer 3

2017 年更新

是的，现在你可以了！ Hive 为 ACID, but you can append data to your table using Append Mode mode("append") with Spark

提供了新的支持

下面举个例子

Seq((10, 20)).toDF("a", "b").write.mode("overwrite").saveAsTable("tab1")
Seq((20, 30)).toDF("a", "b").write.mode("append").saveAsTable("tab1")
sql("select * from tab1").show

或更完整的 ORC 示例 here；在摘录下方：

val command = spark.read.format("jdbc").option("url" .... ).load()
command.write.mode("append").format("orc").option("orc.compression","gzip").save("command.orc")

附加到 ORC 文件

appending to ORC file

hadoop

hive

orc