TensorFlow 中的 SVD 比 numpy 慢
SVD in TensorFlow is slower than in numpy
我观察到在我的机器上,tensorflow 中的 SVD 运行 比 numpy 慢得多。我有 GTX 1080 GPU,并且期望 SVD 至少与 运行 使用 CPU (numpy) 的代码一样快。
环境信息
操作系统
lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.10
Release: 16.10
Codename: yakkety
已安装的 CUDA 和 cuDNN 版本:
ls -l /usr/local/cuda-8.0/lib64/libcud*
-rw-r--r-- 1 root root 556000 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudadevrt.a
lrwxrwxrwx 1 root root 16 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so -> libcudart.so.8.0
lrwxrwxrwx 1 root root 19 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so.8.0 -> libcudart.so.8.0.61
-rwxr-xr-x 1 root root 415432 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so.8.0.61
-rw-r--r-- 1 root root 775162 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart_static.a
lrwxrwxrwx 1 voldemaro users 13 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so -> libcudnn.so.5
lrwxrwxrwx 1 voldemaro users 18 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so.5 -> libcudnn.so.5.1.10
-rwxr-xr-x 1 voldemaro users 84163560 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so.5.1.10
-rw-r--r-- 1 voldemaro users 70364814 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn_static.a
TensorFlow 设置
python -c "import tensorflow; print(tensorflow.__version__)"
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
1.0.0
代码:
'''
Created on Sep 21, 2017
@author: voldemaro
'''
import numpy as np
import tensorflow as tf
import time;
import numpy.linalg as NLA;
N=1534;
svd_array = np.random.random_sample((N,N));
svd_array = svd_array.astype(complex);
specVar = tf.Variable(svd_array, dtype=tf.complex64);
[D2, E1, E2] = tf.svd(specVar);
init_OP = tf.global_variables_initializer();
with tf.Session() as sess:
# Initialize all tensorflow variables
start = time.time();
sess.run(init_OP);
print 'initializing variables: {} s'.format(time.time()-start);
start_time = time.time();
[d, e1, e2] = sess.run([D2, E1, E2]);
print("Tensorflow SVD ---: {} s" . format(time.time() - start_time));
# Equivalent numpy
start = time.time();
u, s, v = NLA.svd(svd_array);
print 'numpy SVD ---: {} s'.format(time.time() - start);
代码跟踪:
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.11GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
initializing variables: 0.230546951294 s
Tensorflow SVD ---: 6.56117296219 s
numpy SVD ---: 4.41714000702 s
GPU 执行通常仅在并行化有效时优于 CPU。
但是,SVD 算法的并行化仍在积极研究中,这意味着还没有发现并行版本比串行实现有很大的优势。
很可能,NumPy 版本是基于一个非常优化的 FORTRAN 实现,而我相信 TensorFlow 有它自己的 C++ 实现,显然它没有 NumPy 调用的代码那么优化。
编辑:您可能不是第一个观察到 poorer performances of TensorFlow with SVD 与 FORTRAN 实现相比的人。
它看起来像 TensorFlow op implements gesvd,而如果您使用 MKL 启用 numpy/scipy(即,如果您使用 conda),它默认更快(但在数值上不太稳健)gesdd
您可以尝试与 scipy 中的 gesvd
进行比较:
from scipy import linalg
u0, s0, vt0 = linalg.svd(target0, lapack_driver="gesvd")
我也体验过 MKL 版本的更好结果,所以我一直在使用这个助手 class 在 TensorFlow 和 SVD 的 numpy 版本之间透明切换,使用 tf.Variable 存储结果
你这样用
result = SvdWrapper(tensor)
result.update()
sess.run([result.u, result.s, result.v])
关于缓慢的更多细节的问题:https://github.com/tensorflow/tensorflow/issues/13222
我观察到在我的机器上,tensorflow 中的 SVD 运行 比 numpy 慢得多。我有 GTX 1080 GPU,并且期望 SVD 至少与 运行 使用 CPU (numpy) 的代码一样快。
环境信息
操作系统
lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.10
Release: 16.10
Codename: yakkety
已安装的 CUDA 和 cuDNN 版本:
ls -l /usr/local/cuda-8.0/lib64/libcud*
-rw-r--r-- 1 root root 556000 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudadevrt.a
lrwxrwxrwx 1 root root 16 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so -> libcudart.so.8.0
lrwxrwxrwx 1 root root 19 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so.8.0 -> libcudart.so.8.0.61
-rwxr-xr-x 1 root root 415432 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so.8.0.61
-rw-r--r-- 1 root root 775162 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart_static.a
lrwxrwxrwx 1 voldemaro users 13 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so -> libcudnn.so.5
lrwxrwxrwx 1 voldemaro users 18 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so.5 -> libcudnn.so.5.1.10
-rwxr-xr-x 1 voldemaro users 84163560 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so.5.1.10
-rw-r--r-- 1 voldemaro users 70364814 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn_static.a
TensorFlow 设置
python -c "import tensorflow; print(tensorflow.__version__)"
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
1.0.0
代码:
'''
Created on Sep 21, 2017
@author: voldemaro
'''
import numpy as np
import tensorflow as tf
import time;
import numpy.linalg as NLA;
N=1534;
svd_array = np.random.random_sample((N,N));
svd_array = svd_array.astype(complex);
specVar = tf.Variable(svd_array, dtype=tf.complex64);
[D2, E1, E2] = tf.svd(specVar);
init_OP = tf.global_variables_initializer();
with tf.Session() as sess:
# Initialize all tensorflow variables
start = time.time();
sess.run(init_OP);
print 'initializing variables: {} s'.format(time.time()-start);
start_time = time.time();
[d, e1, e2] = sess.run([D2, E1, E2]);
print("Tensorflow SVD ---: {} s" . format(time.time() - start_time));
# Equivalent numpy
start = time.time();
u, s, v = NLA.svd(svd_array);
print 'numpy SVD ---: {} s'.format(time.time() - start);
代码跟踪:
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.11GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
initializing variables: 0.230546951294 s
Tensorflow SVD ---: 6.56117296219 s
numpy SVD ---: 4.41714000702 s
GPU 执行通常仅在并行化有效时优于 CPU。
但是,SVD 算法的并行化仍在积极研究中,这意味着还没有发现并行版本比串行实现有很大的优势。
很可能,NumPy 版本是基于一个非常优化的 FORTRAN 实现,而我相信 TensorFlow 有它自己的 C++ 实现,显然它没有 NumPy 调用的代码那么优化。
编辑:您可能不是第一个观察到 poorer performances of TensorFlow with SVD 与 FORTRAN 实现相比的人。
它看起来像 TensorFlow op implements gesvd,而如果您使用 MKL 启用 numpy/scipy(即,如果您使用 conda),它默认更快(但在数值上不太稳健)gesdd
您可以尝试与 scipy 中的 gesvd
进行比较:
from scipy import linalg
u0, s0, vt0 = linalg.svd(target0, lapack_driver="gesvd")
我也体验过 MKL 版本的更好结果,所以我一直在使用这个助手 class 在 TensorFlow 和 SVD 的 numpy 版本之间透明切换,使用 tf.Variable 存储结果
你这样用
result = SvdWrapper(tensor)
result.update()
sess.run([result.u, result.s, result.v])
关于缓慢的更多细节的问题:https://github.com/tensorflow/tensorflow/issues/13222