首先 tf.session.run() 的表现与后来的运行截然不同。为什么？

First tf.session.run() performs dramatically different from later runs. Why?

这里有一个例子来阐明我的意思：
首先 session.run():
First run of a TensorFlow session

稍后 session.run():
Later runs of a TensorFlow session

我知道 TensorFlow 正在这里做一些初始化，但我想知道它在源代码中的什么地方出现。 这发生在 CPU 和 GPU 上, 但在 GPU 上效果更为突出。例如，在显式 Conv2D 操作的情况下，第一个运行在 GPU 流中有更多的 Conv2D 操作。事实上，如果我更改 Conv2D 的输入大小，它可以从数十个流式 Conv2D 操作变为数百个。然而，在后来的运行秒中，GPU 流中始终只有五个 Conv2D 操作（无论输入大小如何）。当运行ning 在 CPU 上时，我们在第一个运行和后面的运行中保留了相同的操作列表，但我们确实看到了相同的时间差异。

TensorFlow 源代码的哪一部分与此行为有关？ GPU操作在哪里"split?"

感谢您的帮助！

tf.nn.conv_2d() op takes much longer to run on the first tf.Session.run() invocation because—by default—TensorFlow uses cuDNN's autotune facility to choose how to run subsequent convolutions as fast as possible. You can see the autotune invocation here.

有一个 undocumented environment variable 可以用来禁用自动调谐。在启动进程运行 TensorFlow（例如 python 解释器）时设置 TF_CUDNN_USE_AUTOTUNE=0 以禁用它。

首先 tf.session.run() 的表现与后来的运行截然不同。为什么？

First tf.session.run() performs dramatically different from later runs. Why?

cublas

tensorflow

cudnn

tensorflow-gpu

tensorflow-xla