如何将 dynamodb 作为单个文件导出到 s3？

How to export dynamodb to s3 as a single file?

我有一个 dynamodb table 需要使用数据管道每 24 小时将其导出到 s3 存储桶。这将反过来被 sparkjob 用来查询数据。

问题是每当我设置数据管道来执行此操作时activity，s3 中的输出是多个分区文件。

有没有办法确保整个 table 在 s3 中导出为单个文件？如果没有，spark中有没有办法使用manifest读取分区文件并将它们合并为一个来查询数据？

这里有两个选项（函数应该在写入之前在数据帧上运行）：

repartition(1)
coalesce(1)

但是正如文档所强调的那样，在您的情况下更好的是 repartition:

However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition(). This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

文档：

repartition

coalesce

如何将 dynamodb 作为单个文件导出到 s3？

How to export dynamodb to s3 as a single file?

amazon-s3

amazon-data-pipeline

apache-spark