numpy.load 随机选择的文件比顺序选择的文件需要更长的时间来加载

Question

上下文

在训练神经网络时，我意识到当我增加数据集的大小（不改变批大小）时，每批花费的时间会增加。重要的是，我需要为每个数据点获取 20 .npy 个文件，这个数字不依赖于数据集大小。

问题

训练从 2 秒/迭代变为 10 秒/迭代... 没有明显的理由说明为什么训练会花费更长的时间。但是，我设法找到了瓶颈。这似乎与 .npy 文件的加载有关。

要重现此行为，这里有一个小脚本，您可以运行生成 10,000 个虚拟 .npy 文件：

def path(i):
    return os.sep.join(('../datasets/test', str(i)))

def create_dummy_files(N=10000):
    for i in range(N):
        x = np.random.random((100, 100))
        np.save(path(random.getrandbits(128)), x)

那你可以运行下面两个脚本自己对比一下：

第一个脚本 运行domly 选择并加载了 20 .npy 个文件：

L = os.listdir('../datasets/test')
S = random.sample(L, 20)
for s in S:
    np.load(path(s)) # <- timed this

第二个版本，其中20个.npy 'sequential' 文件被选中并加载。

L = os.listdir('../datasets/test')
i = 100
S = L[i: i + 20]
for s in S:
    np.load(path(s)) # <- timed this

我测试了两个脚本，每个脚本运行都测试了 100 次（在第二个脚本中，我使用迭代计数作为 i 的值，因此不会加载相同的文件两次）。我用 time.time() 调用包裹了 np.load(path(s)) 行。 我没有对采样进行计时，只是对加载进行计时。以下是结果：

随机加载（时间大致在0.1s到0.4s之间，平均为0.25s）：
非运行dom加载（时间大致在0.010s到0.014s之间，平均为0.01s）：

我假设这些时间与加载脚本时的 CPU 的 activity 有关。但是，它并不能解释这种差距。为什么这两个结果如此不同？与文件索引的方式有关吗？

Edit：我在运行dom 示例脚本中打印了 S，复制了 20 个文件名的列表，然后再次运行 S 作为按字面定义的列表。它所花费的时间与 'sequential' 脚本相当。这意味着它与 fs 中不连续的文件或其他任何文件无关。似乎运行dom 采样被计入计时器，yet 时间定义为：

t = time.time()
np.load(path(s))
print(time.time() - t)

我也尝试用 cProfile 包装 np.load（独占）：相同的结果。

Answer 1

我确实说过：

I tested both scripts and ran them 100 times each (in the 2nd script I used the iteration count as the value for i so the same files are not loaded twice)

但是正如tevemadar提到的那样

i should be randomized

我把第二个版本选择不同个文件的操作完全搞砸了。我的代码像这样对脚本计时 100 次：

for i in trange(100):
   if rand:
      S = random.sample(L, 20)
   else:
      S = L[i: i+20] # <- every loop there's only 1 new file added in the selection, 
                     #    19 files will have already been cached in the previous fetch

第二个脚本应该是S = L[100*i, 100*i+20]!

是的，当计时时，结果是可比较的。

numpy.load 随机选择的文件比顺序选择的文件需要更长的时间来加载

Randomly selected file take longer to load with numpy.load than sequential ones

python

time

numpy

cprofile

上下文

问题