循环遍历许多文件时如何增加 gcsfuse 吞吐量？

How to increase gcsfuse throughput when looping through many files?

我正在处理超过 200,000 个 netcdf 文件，每个文件有 17 MB。它们都在 google 云存储桶中，我正试图找到一种使用 gcsfuse 增加吞吐量的方法。

我正在使用 google 云计算引擎虚拟机和 gcsfuse 来访问这些文件。我查看了 gsutil，但在 Google 云文档中读到 "individual I/O streams run approximately as fast as gsutil." 使用 gcsfuse NCL 脚本将花费 8 天，这太长了。关于如何提高吞吐量的任何建议？谢谢。

您必须考虑的优化传输性能：

在同一区域找到您的 Cloud Storage Bucket 和 Compute Engine VM 实例。
通过创建具有更多 vCPU 内核的实例来增加您的 Compute Engine VM 实例网络带宽
增加persistent disk throughput
使用 gsutil -r 和 -m option to run tasks in parallel 您甚至可以通过 parallel_thread_count
请在 scripting transfer
在使用 gcsfuse 时，检查您是否有针对并行传输优化的 0.27.0 版本。

循环遍历许多文件时如何增加 gcsfuse 吞吐量？

How to increase gcsfuse throughput when looping through many files?

google-cloud-platform

gcsfuse