在 Tensorflow 上多 GPU 训练比单 GPU 慢
Multi GPU training slower than single GPU on Tensorflow
我创建了 3 个虚拟 GPU(有 1 个 GPU)并尝试加速图像矢量化。但是,使用下面提供的代码和来自 off docs (here) 的手动放置,我得到了奇怪的结果:在所有 GPU 上的训练比在单个 GPU 上的训练慢两倍。还要在具有 3 个物理 GPU 的机器上检查此代码(并删除虚拟设备初始化)- 工作相同。
环境:Python 3.6,Ubuntu 18.04.3,tensorflow-gpu 1.14.0。
代码(此示例创建 3 个虚拟设备,您可以在具有一个 GPU 的 PC 上对其进行测试):
import os
import time
import numpy as np
import tensorflow as tf
start = time.time()
def load_graph(frozen_graph_filename):
# We load the protobuf file from the disk and parse it to retrieve the
# unserialized graph_def
with tf.gfile.GFile(frozen_graph_filename, "rb") as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
# Then, we import the graph_def into a new Graph and returns it
with tf.Graph().as_default() as graph:
# The name var will prefix every op/nodes in your graph
# Since we load everything in a new graph, this is not needed
tf.import_graph_def(graph_def, name="")
return graph
path_to_graph = '/imagenet/' # Path to imagenet folder where graph file is placed
GRAPH = load_graph(os.path.join(path_to_graph, 'classify_image_graph_def.pb'))
# Create Session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth = True
session = tf.Session(graph=GRAPH, config=config)
output_dir = '/vectors/' # where to saved vectors from images
# Single GPU vectorization
for image_index, image in enumerate(selected_list):
with Image.open(image) as f:
image_data = f.convert('RGB')
feature_tensor = session.graph.get_tensor_by_name('pool_3:0')
feature_vector = session.run(feature_tensor, {'DecodeJpeg:0': image_data})
feature_vector = np.squeeze(feature_vector)
outfile_name = os.path.basename(image) + ".vc"
out_path = os.path.join(output_dir, outfile_name)
# Save vector
np.savetxt(out_path, feature_vector, delimiter=',')
print(f"Single GPU: {time.time() - start}")
start = time.time()
print("Start calculation on multiple GPU")
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Create 3 virtual GPUs with 1GB memory each
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024),
tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024),
tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
print("Create prepared ops")
start1 = time.time()
gpus = logical_gpus # comment this line to use physical GPU devices for calculations
image_list = ['1.jpg', '2.jpg', '3.jpg'] # list with images to vectorize (tested on 100 and 1000 examples)
# Assign chunk of list to each GPU
# image_list1, image_list2, image_list3 = image_list[:len(image_list)],\
# image_list[len(image_list):2*len(image_list)],\
# image_list[2*len(image_list):]
selected_list = image_list # commit this line if you want to try to assign chunk of list manually to each GPU
output_vectors = []
if gpus:
# Replicate your computation on multiple GPUs
feature_vectors = []
for gpu in gpus: # iterating on a virtual GPU devices, not physical
with tf.device(gpu.name):
print(f"Assign list of images to {gpu.name.split(':', 4)[-1]}")
# Try to assign chunk of list with images to each GPU - work the same time as single GPU
# if gpu.name.split(':', 4)[-1] == "GPU:0":
# selected_list = image_list1
# if gpu.name.split(':', 4)[-1] == "GPU:1":
# selected_list = image_list2
# if gpu.name.split(':', 4)[-1] == "GPU:2":
# selected_list = image_list3
for image_index, image in enumerate(selected_list):
with Image.open(image) as f:
image_data = f.convert('RGB')
feature_tensor = session.graph.get_tensor_by_name('pool_3:0')
feature_vector = session.run(feature_tensor, {'DecodeJpeg:0': image_data})
feature_vectors.append(feature_vector)
print("All images has been assigned to GPU's")
print(f"Time spend on prep ops: {time.time() - start1}")
print("Start calculation on multiple GPU")
start1 = time.time()
for image_index, image in enumerate(image_list):
feature_vector = np.squeeze(feature_vectors[image_index])
outfile_name = os.path.basename(image) + ".vc"
out_path = os.path.join(output_dir, outfile_name)
# Save vector
np.savetxt(out_path, feature_vector, delimiter=',')
# Close session
session.close()
print(f"Calc on GPU's spend: {time.time() - start1}")
print(f"All time, spend on multiple GPU: {time.time() - start}")
提供输出视图(来自包含 100 张图像的列表):
1 Physical GPU, 3 Logical GPUs
Single GPU: 18.76301646232605
Start calculation on multiple GPU
Create prepared ops
Assign list of images to GPU:0
Assign list of images to GPU:1
Assign list of images to GPU:2
All images has been assigned to GPU's
Time spend on prep ops: 18.263537883758545
Start calculation on multiple GPU
Calc on GPU's spend: 11.697082042694092
All time, spend on multiple GPU: 29.960679531097412
我尝试了什么:将带有图像的列表分成 3 个块,并将每个块分配给 GPU(请参阅提交的代码行)。这将多 GPU 时间减少到 17 秒,比单 GPU 运行 18 秒(~5%)快一点。
预期结果:多 GPU 版本比单 GPU 版本快(至少 1.5 倍加速)。
想法,为什么会这样:我把计算写错了
有两个基本的误解造成了您的困扰:
with tf.device(...):
适用于范围内创建的图形节点,而不是 Session.run
调用。
Session.run
是阻塞调用。它们不会 运行 并行。 TensorFlow 只能并行化单个 Session.run
.
的内容
现代 TF (>= 2.0) 可以使这更容易。
主要是可以不用tf.Session
和tf.Graph
了。使用 @tf.function
代替,我相信这个基本结构会起作用:
@tf.function
def my_function(inputs, gpus, model):
results = []
for input, gpu in zip(inputs, gpus):
with tf.device(gpu):
results.append(model(input))
return results
但你会想要尝试更真实的测试。只有 3 张图片,您根本无法衡量实际性能。
另请注意:
tf.distribute.Strategy
class 可以 help simplify some of this,通过将设备规范与 @tf.function
分开,即 运行。 strategy.experimental_run_v2(my_function, args=(dataset_inputs,))
- 使用
tf.data.Dataset
输入管道将帮助您 loading/preprocessing 与模型执行重叠。
但是,如果您真的打算使用 tf.Graph
和 tf.Session
来执行此操作,我认为您基本上需要重新组织您的代码:
# Your code:
# Builds a graph
graph = build_graph()
for gpu in gpus():
with tf.device(gpu):
# Calls `gpu` in each device scope.
session.run(...)
为此:
g = tf.Graph()
with g.as_default():
results = []
for gpu in gpus:
# Build the graph, on each device
input = iterator.get_next()
with tf.device(gpu):
results.append(my_function(input))
# Use a single `Session.run` call
np_result = session.run(results, feed_dict={inputs: my_inputs})
我创建了 3 个虚拟 GPU(有 1 个 GPU)并尝试加速图像矢量化。但是,使用下面提供的代码和来自 off docs (here) 的手动放置,我得到了奇怪的结果:在所有 GPU 上的训练比在单个 GPU 上的训练慢两倍。还要在具有 3 个物理 GPU 的机器上检查此代码(并删除虚拟设备初始化)- 工作相同。
环境:Python 3.6,Ubuntu 18.04.3,tensorflow-gpu 1.14.0。
代码(此示例创建 3 个虚拟设备,您可以在具有一个 GPU 的 PC 上对其进行测试):
import os
import time
import numpy as np
import tensorflow as tf
start = time.time()
def load_graph(frozen_graph_filename):
# We load the protobuf file from the disk and parse it to retrieve the
# unserialized graph_def
with tf.gfile.GFile(frozen_graph_filename, "rb") as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
# Then, we import the graph_def into a new Graph and returns it
with tf.Graph().as_default() as graph:
# The name var will prefix every op/nodes in your graph
# Since we load everything in a new graph, this is not needed
tf.import_graph_def(graph_def, name="")
return graph
path_to_graph = '/imagenet/' # Path to imagenet folder where graph file is placed
GRAPH = load_graph(os.path.join(path_to_graph, 'classify_image_graph_def.pb'))
# Create Session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth = True
session = tf.Session(graph=GRAPH, config=config)
output_dir = '/vectors/' # where to saved vectors from images
# Single GPU vectorization
for image_index, image in enumerate(selected_list):
with Image.open(image) as f:
image_data = f.convert('RGB')
feature_tensor = session.graph.get_tensor_by_name('pool_3:0')
feature_vector = session.run(feature_tensor, {'DecodeJpeg:0': image_data})
feature_vector = np.squeeze(feature_vector)
outfile_name = os.path.basename(image) + ".vc"
out_path = os.path.join(output_dir, outfile_name)
# Save vector
np.savetxt(out_path, feature_vector, delimiter=',')
print(f"Single GPU: {time.time() - start}")
start = time.time()
print("Start calculation on multiple GPU")
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Create 3 virtual GPUs with 1GB memory each
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024),
tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024),
tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
print("Create prepared ops")
start1 = time.time()
gpus = logical_gpus # comment this line to use physical GPU devices for calculations
image_list = ['1.jpg', '2.jpg', '3.jpg'] # list with images to vectorize (tested on 100 and 1000 examples)
# Assign chunk of list to each GPU
# image_list1, image_list2, image_list3 = image_list[:len(image_list)],\
# image_list[len(image_list):2*len(image_list)],\
# image_list[2*len(image_list):]
selected_list = image_list # commit this line if you want to try to assign chunk of list manually to each GPU
output_vectors = []
if gpus:
# Replicate your computation on multiple GPUs
feature_vectors = []
for gpu in gpus: # iterating on a virtual GPU devices, not physical
with tf.device(gpu.name):
print(f"Assign list of images to {gpu.name.split(':', 4)[-1]}")
# Try to assign chunk of list with images to each GPU - work the same time as single GPU
# if gpu.name.split(':', 4)[-1] == "GPU:0":
# selected_list = image_list1
# if gpu.name.split(':', 4)[-1] == "GPU:1":
# selected_list = image_list2
# if gpu.name.split(':', 4)[-1] == "GPU:2":
# selected_list = image_list3
for image_index, image in enumerate(selected_list):
with Image.open(image) as f:
image_data = f.convert('RGB')
feature_tensor = session.graph.get_tensor_by_name('pool_3:0')
feature_vector = session.run(feature_tensor, {'DecodeJpeg:0': image_data})
feature_vectors.append(feature_vector)
print("All images has been assigned to GPU's")
print(f"Time spend on prep ops: {time.time() - start1}")
print("Start calculation on multiple GPU")
start1 = time.time()
for image_index, image in enumerate(image_list):
feature_vector = np.squeeze(feature_vectors[image_index])
outfile_name = os.path.basename(image) + ".vc"
out_path = os.path.join(output_dir, outfile_name)
# Save vector
np.savetxt(out_path, feature_vector, delimiter=',')
# Close session
session.close()
print(f"Calc on GPU's spend: {time.time() - start1}")
print(f"All time, spend on multiple GPU: {time.time() - start}")
提供输出视图(来自包含 100 张图像的列表):
1 Physical GPU, 3 Logical GPUs
Single GPU: 18.76301646232605
Start calculation on multiple GPU
Create prepared ops
Assign list of images to GPU:0
Assign list of images to GPU:1
Assign list of images to GPU:2
All images has been assigned to GPU's
Time spend on prep ops: 18.263537883758545
Start calculation on multiple GPU
Calc on GPU's spend: 11.697082042694092
All time, spend on multiple GPU: 29.960679531097412
我尝试了什么:将带有图像的列表分成 3 个块,并将每个块分配给 GPU(请参阅提交的代码行)。这将多 GPU 时间减少到 17 秒,比单 GPU 运行 18 秒(~5%)快一点。
预期结果:多 GPU 版本比单 GPU 版本快(至少 1.5 倍加速)。
想法,为什么会这样:我把计算写错了
有两个基本的误解造成了您的困扰:
with tf.device(...):
适用于范围内创建的图形节点,而不是Session.run
调用。
的内容Session.run
是阻塞调用。它们不会 运行 并行。 TensorFlow 只能并行化单个Session.run
.
现代 TF (>= 2.0) 可以使这更容易。
主要是可以不用tf.Session
和tf.Graph
了。使用 @tf.function
代替,我相信这个基本结构会起作用:
@tf.function
def my_function(inputs, gpus, model):
results = []
for input, gpu in zip(inputs, gpus):
with tf.device(gpu):
results.append(model(input))
return results
但你会想要尝试更真实的测试。只有 3 张图片,您根本无法衡量实际性能。
另请注意:
tf.distribute.Strategy
class 可以 help simplify some of this,通过将设备规范与@tf.function
分开,即 运行。strategy.experimental_run_v2(my_function, args=(dataset_inputs,))
- 使用
tf.data.Dataset
输入管道将帮助您 loading/preprocessing 与模型执行重叠。
但是,如果您真的打算使用 tf.Graph
和 tf.Session
来执行此操作,我认为您基本上需要重新组织您的代码:
# Your code:
# Builds a graph
graph = build_graph()
for gpu in gpus():
with tf.device(gpu):
# Calls `gpu` in each device scope.
session.run(...)
为此:
g = tf.Graph()
with g.as_default():
results = []
for gpu in gpus:
# Build the graph, on each device
input = iterator.get_next()
with tf.device(gpu):
results.append(my_function(input))
# Use a single `Session.run` call
np_result = session.run(results, feed_dict={inputs: my_inputs})