如何解释 TensorFlow 输出?
How to interpret TensorFlow output?
如何解释用于在 GPGPU 上构建和执行计算图的 TensorFlow 输出?
给定以下使用 python API.
执行任意 tensorflow 脚本的命令
python3 tensorflow_test.py > out
第一部分stream_executor
似乎是它的加载依赖项。
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
什么是 NUMA
节点?
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
我假设这是它找到可用 GPU 的时候
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:01:00.0
Total memory: 11.25GiB
Free memory: 11.15GiB
一些gpu初始化?什么是 DMA?
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:01:00.0)
为什么会抛出错误 E
?
E tensorflow/stream_executor/cuda/cuda_driver.cc:932] failed to allocate 11.15G (11976531968 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
很好地回答了 pool_allocator
的作用:
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 3160 get requests, put_count=2958 evicted_count=1000 eviction_rate=0.338066 and unsatisfied allocation rate=0.412025
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1743 get requests, put_count=1970 evicted_count=1000 eviction_rate=0.507614 and unsatisfied allocation rate=0.456684
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 256 to 281
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1986 get requests, put_count=2519 evicted_count=1000 eviction_rate=0.396983 and unsatisfied allocation rate=0.264854
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 655 to 720
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 28728 get requests, put_count=28680 evicted_count=1000 eviction_rate=0.0348675 and unsatisfied allocation rate=0.0418407
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 1694 to 1863
关于 NUMA -- https://software.intel.com/en-us/articles/optimizing-applications-for-numa
粗略地说,如果您有双插槽 CPU,它们将各自拥有自己的内存,并且必须通过较慢的 QPI link 访问另一个处理器的内存。所以每个CPU+内存都是一个NUMA节点。
可能您可以将两个不同的 NUMA 节点视为两个不同的设备,并构建您的网络以针对不同的 within-node/between-node 带宽进行优化
但是,我认为现在 TF 中没有足够的线路来执行此操作。检测也不起作用——我刚在一台有 2 个 NUMA 节点的机器上试过,它仍然打印相同的消息并初始化为 1 个 NUMA 节点。
DMA = 直接内存访问。您可以在不使用 CPU(即通过 NVlink)的情况下将内容从一个 GPU 复制到另一个 GPU。还没有 NVLink 集成。
就错误而言,TensorFlow 会尝试分配接近 GPU 最大内存的内存,因此听起来您的某些 GPU 内存已经分配给了其他内存,但分配失败了。
您可以像下面那样做以避免分配太多内存
config = tf.ConfigProto(log_device_placement=True)
config.gpu_options.per_process_gpu_memory_fraction=0.3 # don't hog all vRAM
config.operation_timeout_in_ms=15000 # terminate on long hangs
sess = tf.InteractiveSession("", config=config)
successfully opened CUDA library xxx locally
表示库已加载,但不表示会使用
successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
表示您的内核不支持 NUMA。你可以阅读 NUMA here and here.
Found device 0 with properties:
你有 1 个 GPU 可以使用。它列出了这个 GPU 的属性。
- DMA 是直接内存访问。有关 Wikipedia.
的更多信息
failed to allocate 11.15G
错误清楚地解释了为什么会这样,但是不看代码很难说为什么需要这么多内存。
- 池分配器消息在 this answer
中进行了解释
如何解释用于在 GPGPU 上构建和执行计算图的 TensorFlow 输出?
给定以下使用 python API.
执行任意 tensorflow 脚本的命令python3 tensorflow_test.py > out
第一部分stream_executor
似乎是它的加载依赖项。
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
什么是 NUMA
节点?
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
我假设这是它找到可用 GPU 的时候
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:01:00.0
Total memory: 11.25GiB
Free memory: 11.15GiB
一些gpu初始化?什么是 DMA?
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:01:00.0)
为什么会抛出错误 E
?
E tensorflow/stream_executor/cuda/cuda_driver.cc:932] failed to allocate 11.15G (11976531968 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
很好地回答了 pool_allocator
的作用:
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 3160 get requests, put_count=2958 evicted_count=1000 eviction_rate=0.338066 and unsatisfied allocation rate=0.412025
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1743 get requests, put_count=1970 evicted_count=1000 eviction_rate=0.507614 and unsatisfied allocation rate=0.456684
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 256 to 281
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1986 get requests, put_count=2519 evicted_count=1000 eviction_rate=0.396983 and unsatisfied allocation rate=0.264854
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 655 to 720
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 28728 get requests, put_count=28680 evicted_count=1000 eviction_rate=0.0348675 and unsatisfied allocation rate=0.0418407
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 1694 to 1863
关于 NUMA -- https://software.intel.com/en-us/articles/optimizing-applications-for-numa
粗略地说,如果您有双插槽 CPU,它们将各自拥有自己的内存,并且必须通过较慢的 QPI link 访问另一个处理器的内存。所以每个CPU+内存都是一个NUMA节点。
可能您可以将两个不同的 NUMA 节点视为两个不同的设备,并构建您的网络以针对不同的 within-node/between-node 带宽进行优化
但是,我认为现在 TF 中没有足够的线路来执行此操作。检测也不起作用——我刚在一台有 2 个 NUMA 节点的机器上试过,它仍然打印相同的消息并初始化为 1 个 NUMA 节点。
DMA = 直接内存访问。您可以在不使用 CPU(即通过 NVlink)的情况下将内容从一个 GPU 复制到另一个 GPU。还没有 NVLink 集成。
就错误而言,TensorFlow 会尝试分配接近 GPU 最大内存的内存,因此听起来您的某些 GPU 内存已经分配给了其他内存,但分配失败了。
您可以像下面那样做以避免分配太多内存
config = tf.ConfigProto(log_device_placement=True)
config.gpu_options.per_process_gpu_memory_fraction=0.3 # don't hog all vRAM
config.operation_timeout_in_ms=15000 # terminate on long hangs
sess = tf.InteractiveSession("", config=config)
successfully opened CUDA library xxx locally
表示库已加载,但不表示会使用successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
表示您的内核不支持 NUMA。你可以阅读 NUMA here and here.Found device 0 with properties:
你有 1 个 GPU 可以使用。它列出了这个 GPU 的属性。- DMA 是直接内存访问。有关 Wikipedia. 的更多信息
failed to allocate 11.15G
错误清楚地解释了为什么会这样,但是不看代码很难说为什么需要这么多内存。- 池分配器消息在 this answer 中进行了解释