多 GPU TFF 模拟错误 "Detected dataset reduce op in multi-GPU TFF simulation"
Multi-GPU TFF simulation errors "Detected dataset reduce op in multi-GPU TFF simulation"
我 运行 我的代码用于使用 Tensorflow Federated 模拟的情绪检测模型。我的代码仅使用 CPU 就可以完美运行。但是,我在尝试使用 GPU 运行 TFF 时收到此错误。
ValueError: Detected dataset reduce op in multi-GPU TFF simulation: `use_experimental_simulation_loop=True` for `tff.learning`; or use `for ... in iter(dataset)` for your own dataset iteration.Reduce op will be functional after b/159180073.
这是什么错误,我该如何解决?我试着搜索了很多地方,但没有找到答案。
如果有帮助,这是调用堆栈。它很长所以我粘贴到这个 link: https://pastebin.com/b1R93gf1
编辑:
这里是包含iterative_process
的代码
def startTraining(output_file):
iterative_process = tff.learning.build_federated_averaging_process(
model_fn,
client_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.01),
server_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=1.0),
use_experimental_simulation_loop=True
)
flstate = iterative_process.initialize()
evaluation = tff.learning.build_federated_evaluation(model_fn)
output_file.write(
'round,available_users,loss,sparse_categorical_accuracy,val_loss,val_sparse_categorical_accuracy,test_loss,test_sparse_categorical_accuracy\n')
curr_round_result = [0,0,100,0,100,0]
min_val_loss = 100
for round in range(1,ROUND_COUNT + 1):
available_users = fetch_available_users_and_increase_time(ROUND_DURATION_AVERAGE + random.randint(-ROUND_DURATION_VARIATION, ROUND_DURATION_VARIATION + 1))
if(len(available_users) == 0):
write_to_file(curr_round_result)
continue
train_data = make_federated_data(available_users, 'train')
flstate, metrics = iterative_process.next(flstate, train_data)
val_data = make_federated_data(available_users, 'val')
val_metrics = evaluation(flstate.model, val_data)
curr_round_result[0] = round
curr_round_result[1] = len(available_users)
curr_round_result[2] = metrics['train']['loss']
curr_round_result[3] = metrics['train']['sparse_categorical_accuracy']
curr_round_result[4] = val_metrics['loss']
curr_round_result[5] = val_metrics['sparse_categorical_accuracy']
write_to_file(curr_round_result)
这是 make_federated_data
的代码
def make_federated_data(users, dataset_type):
offset = 0
if(dataset_type == 'val'):
offset = train_size
elif(dataset_type == 'test'):
offset = train_size + val_size
global LOADED_USER
for id in users:
if(id + offset not in LOADED_USER):
LOADED_USER[id + offset] = getDatasetFromFilePath(filepaths[id + offset])
return [
LOADED_USER[id + offset]
for id in users
]
我发现TFF还没有支持多GPU。因此,我们需要将 GPU 的可见数量限制为 1,使用:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
TFF 确实支持多 GPU,并且正如错误消息所说,正在发生以下两种情况之一:
- 代码使用
tff.learning
,但使用 False
的默认 use_experimental_simulation_loop
参数值。对于多个 GPU,在使用包括 tff.learning.build_federated_averaging_process
在内的 API 时,必须将其设置为 True
。例如,调用:
training_process = tff.learning.build_federated_averaging_process(
..., use_experimental_simulation_loop=True)
- 该代码在某处包含自定义
tf.data.Dataset.reduce(...)
调用。这必须替换为迭代数据集的 Python 代码。例如:
result = dataset.reduce(initial_state=0, reduce_func=lambda s, x: s + x)
变成
s = 0
for x in iter(dataset):
s += x
我 运行 我的代码用于使用 Tensorflow Federated 模拟的情绪检测模型。我的代码仅使用 CPU 就可以完美运行。但是,我在尝试使用 GPU 运行 TFF 时收到此错误。
ValueError: Detected dataset reduce op in multi-GPU TFF simulation: `use_experimental_simulation_loop=True` for `tff.learning`; or use `for ... in iter(dataset)` for your own dataset iteration.Reduce op will be functional after b/159180073.
这是什么错误,我该如何解决?我试着搜索了很多地方,但没有找到答案。
如果有帮助,这是调用堆栈。它很长所以我粘贴到这个 link: https://pastebin.com/b1R93gf1
编辑:
这里是包含iterative_process
的代码def startTraining(output_file):
iterative_process = tff.learning.build_federated_averaging_process(
model_fn,
client_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.01),
server_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=1.0),
use_experimental_simulation_loop=True
)
flstate = iterative_process.initialize()
evaluation = tff.learning.build_federated_evaluation(model_fn)
output_file.write(
'round,available_users,loss,sparse_categorical_accuracy,val_loss,val_sparse_categorical_accuracy,test_loss,test_sparse_categorical_accuracy\n')
curr_round_result = [0,0,100,0,100,0]
min_val_loss = 100
for round in range(1,ROUND_COUNT + 1):
available_users = fetch_available_users_and_increase_time(ROUND_DURATION_AVERAGE + random.randint(-ROUND_DURATION_VARIATION, ROUND_DURATION_VARIATION + 1))
if(len(available_users) == 0):
write_to_file(curr_round_result)
continue
train_data = make_federated_data(available_users, 'train')
flstate, metrics = iterative_process.next(flstate, train_data)
val_data = make_federated_data(available_users, 'val')
val_metrics = evaluation(flstate.model, val_data)
curr_round_result[0] = round
curr_round_result[1] = len(available_users)
curr_round_result[2] = metrics['train']['loss']
curr_round_result[3] = metrics['train']['sparse_categorical_accuracy']
curr_round_result[4] = val_metrics['loss']
curr_round_result[5] = val_metrics['sparse_categorical_accuracy']
write_to_file(curr_round_result)
这是 make_federated_data
的代码def make_federated_data(users, dataset_type):
offset = 0
if(dataset_type == 'val'):
offset = train_size
elif(dataset_type == 'test'):
offset = train_size + val_size
global LOADED_USER
for id in users:
if(id + offset not in LOADED_USER):
LOADED_USER[id + offset] = getDatasetFromFilePath(filepaths[id + offset])
return [
LOADED_USER[id + offset]
for id in users
]
我发现TFF还没有支持多GPU。因此,我们需要将 GPU 的可见数量限制为 1,使用:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
TFF 确实支持多 GPU,并且正如错误消息所说,正在发生以下两种情况之一:
- 代码使用
tff.learning
,但使用False
的默认use_experimental_simulation_loop
参数值。对于多个 GPU,在使用包括tff.learning.build_federated_averaging_process
在内的 API 时,必须将其设置为True
。例如,调用:
training_process = tff.learning.build_federated_averaging_process(
..., use_experimental_simulation_loop=True)
- 该代码在某处包含自定义
tf.data.Dataset.reduce(...)
调用。这必须替换为迭代数据集的 Python 代码。例如:
result = dataset.reduce(initial_state=0, reduce_func=lambda s, x: s + x)
变成
s = 0
for x in iter(dataset):
s += x