Apache Spark 和 Hudi：大量的输出文件

Question

我正在尝试从许多不同的 .csv 文件（都具有相同的“结构”）中读取数据，使用 Spark 执行一些操作，最后将它们保存为 Hudi 格式。
要在同一个 Hudi table 中存储数据，我认为最好的方法是在执行写入时使用 append 方法。
问题是这样做会创建大量小文件，其总维数远远超过输入数据集大小（在某些情况下为 10 倍）。

这是我对 Hudi 的配置：

hudi_options = {
  'hoodie.table.name': tableName,
  'hoodie.datasource.write.recordkey.field': 'uuid',
  'hoodie.datasource.write.partitionpath.field': 'main_partition',
  'hoodie.datasource.write.table.name': tableName,
  'hoodie.datasource.write.operation': 'upsert',
  'hoodie.datasource.write.precombine.field': 'ts',
  'hoodie.upsert.shuffle.parallelism': 10, 
  'hoodie.insert.shuffle.parallelism': 10,
  'hoodie.delete.shuffle.parallelism': 10
}

虽然写操作是这样执行的：

result_df.write.format("hudi").options(**hudi_options).mode("append").save(basePath)

其中 result_df 是一个 Spark Dataframe，模式始终相同，但数据不同，并且 basePath 是常量。
我检查了输出文件的内容，它们具有正确的 schema/data。那么，有没有办法将数据追加到同一个 Hudi table 文件中呢？

我是 apache Spark 和 Hudi 的新手，所以任何 help/suggestions 将不胜感激 ;-)

Answer 1

请提出github问题(http://github.com/apache/hudi/issues)以获得社区的及时响应

Answer 2

Apache Hudi works on the principle of MVCC (Multi Versioned Concurrency Control), so every write creates a new version of the the existing file in following scenarios: 1. if the file size is less than the default max file size : 100 MB 2. if you are updating existing records in the existing file. Add these two options to your hudi_options, which keeps only latest two versions at any given time: "hoodie.cleaner.commits.retained": 1, "hoodie.keep.min.commits": 2

来自

Apache Spark 和 Hudi：大量的输出文件

Apache Spark and Hudi: tons of output files

apache-spark

apache-spark-sql

pyspark

apache-hudi