磁盘 I/O 在 P100-NC6s-V2 上极慢

Question

我正在 Azure ML 管道上训练图像分割模型。在测试步骤中，我将模型的输出保存到关联的 blob 存储中。然后我想找到计算输出和地面实况之间的 IOU（Intersection over Union）。这两组图像都位于 blob 存储中。但是，IOU 计算非常慢，我认为它是磁盘绑定的。在我的 IOU 计算代码中，我只是加载了两个图像（注释掉了其他代码），仍然，每次迭代花费了将近 6 秒，而训练和测试速度足够快。

这种行为正常吗？我该如何调试这一步？

Answer 1

关于 AzureML 远程运行可用的驱动器的一些说明：

这是我在远程运行上运行 df 时看到的内容（在这一个中，我通过 [=14= 使用 blob Datastore ]):

Filesystem                             1K-blocks     Used  Available Use% Mounted on
overlay                                103080160 11530364   86290588  12% /
tmpfs                                      65536        0      65536   0% /dev
tmpfs                                    3568556        0    3568556   0% /sys/fs/cgroup
/dev/sdb1                              103080160 11530364   86290588  12% /etc/hosts
shm                                      2097152        0    2097152   0% /dev/shm
//danielscstorageezoh...-620830f140ab 5368709120  3702848 5365006272   1% /mnt/batch/tasks/.../workspacefilestore
blobfuse                               103080160 11530364   86290588  12% /mnt/batch/tasks/.../workspaceblobstore

有趣的项目是 overlay、/dev/sdb1、//danielscstorageezoh...-620830f140ab 和 blobfuse：

overlay 和 /dev/sdb1 都是 local SSD 在机器上的挂载（我用的是 STANDARD_D2_V2 有一个100GB 固态硬盘）。
//danielscstorageezoh...-620830f140ab 是包含项目文件（您的脚本等）的 Azure 文件共享 的装载。它也是您运行.

当前工作目录

blobfuse 是我在执行运行.[=53 时请求挂载到 Estimator 中的 blob 存储=]

我很好奇这三种驱动器之间的性能差异。我的迷你基准测试是下载并提取此文件：http://download.tensorflow.org/example_images/flower_photos.tgz（这是一个 220 MB tar 文件，其中包含大约 3600 张 jpeg 花卉图像）。

结果如下：

Filesystem/Drive         Download_and_save       Extract
Local_SSD                               2s            2s  
Azure File Share                        9s          386s
Premium File Share                     10s          120s
Blobfuse                               10s          133s
Blobfuse w/ Premium Blob                8s          121s

总而言之，在网络驱动器上写入小文件要慢得多，因此强烈建议您使用 /tmp 或 Python tempfile 如果您正在写入较小的文件。

作为参考，这里的脚本我运行测量：https://gist.github.com/danielsc/9f062da5e66421d48ac5ed84aabf8535

这就是我运行的方式：https://gist.github.com/danielsc/6273a43c9b1790d82216bdaea6e10e5c

磁盘 I/O 在 P100-NC6s-V2 上极慢

Disk I/O extremely slow on P100-NC6s-V2

tensorflow

azure-machine-learning-service