运行 使用 FilePerUserClientData 时内存不足
Running Out of RAM using FilePerUserClientData
我在使用 tff.simulation.FilePerUserClientData
进行训练时遇到问题 - 在每轮 10 个客户端的 5-6 轮之后,我很快 运行 内存不足。
每轮 RAM 使用量都在稳步增加。
我试图缩小范围并意识到问题不是实际的迭代过程,而是客户端数据集的创建。
简单地在循环中调用 create_tf_dataset_for_client(client)
会导致问题。
所以这是我的代码的最小版本:
import tensorflow as tf
import tensorflow_federated as tff
import numpy as np
import pickle
BATCH_SIZE = 16
EPOCHS = 2
MAX_SEQUENCE_LEN = 20
NUM_ROUNDS = 100
CLIENTS_PER_ROUND = 10
def decode_fn(record_bytes):
return tf.io.parse_single_example(
record_bytes,
{"x": tf.io.FixedLenFeature([MAX_SEQUENCE_LEN], dtype=tf.string),
"y": tf.io.FixedLenFeature([MAX_SEQUENCE_LEN], dtype=tf.string)}
)
def dataset_fn(path):
return tf.data.TFRecordDataset([path]).map(decode_fn).padded_batch(BATCH_SIZE).repeat(EPOCHS)
def sample_client_data(data, client_ids, sampling_prob):
clients_total = len(client_ids)
x = np.random.uniform(size=clients_total)
sampled_ids = [client_ids[i] for i in range(clients_total) if x[i] < sampling_prob]
data = [train_data.create_tf_dataset_for_client(client) for client in sampled_ids]
return data
with open('users.pkl', 'rb') as f:
users = pickle.load(f)
train_client_ids = users["train"]
client_id_to_train_file = {i: "reddit_leaf_tf/" + i for i in train_client_ids}
train_data = tff.simulation.datasets.FilePerUserClientData(
client_ids_to_files=client_id_to_train_file,
dataset_fn=dataset_fn
)
sampling_prob = CLIENTS_PER_ROUND / len(train_client_ids)
for round_num in range(0, NUM_ROUNDS):
print('Round {r}'.format(r=round_num))
participants_data = sample_client_data(train_data, train_client_ids, sampling_prob)
print("Round Completed")
我正在使用 tensorflow-federated 19.0.
我创建客户端数据集的方式是否有问题,或者是否以某种方式预期上一轮的 RAM 不会被释放?
schmana@ 注意到在每轮更改 CLIENTS
放置的基数(不同数量的客户端数据集)时会发生这种情况。这会导致缓存归档,如 http://github.com/tensorflow/federated/issues/1215 中所述。
近期的解决方法是调用:
tff.framework.get_context_stack().current.executor_factory.clean_up_executors()
在每一轮的开始或结束时。
我在使用 tff.simulation.FilePerUserClientData
进行训练时遇到问题 - 在每轮 10 个客户端的 5-6 轮之后,我很快 运行 内存不足。
每轮 RAM 使用量都在稳步增加。
我试图缩小范围并意识到问题不是实际的迭代过程,而是客户端数据集的创建。
简单地在循环中调用 create_tf_dataset_for_client(client)
会导致问题。
所以这是我的代码的最小版本:
import tensorflow as tf
import tensorflow_federated as tff
import numpy as np
import pickle
BATCH_SIZE = 16
EPOCHS = 2
MAX_SEQUENCE_LEN = 20
NUM_ROUNDS = 100
CLIENTS_PER_ROUND = 10
def decode_fn(record_bytes):
return tf.io.parse_single_example(
record_bytes,
{"x": tf.io.FixedLenFeature([MAX_SEQUENCE_LEN], dtype=tf.string),
"y": tf.io.FixedLenFeature([MAX_SEQUENCE_LEN], dtype=tf.string)}
)
def dataset_fn(path):
return tf.data.TFRecordDataset([path]).map(decode_fn).padded_batch(BATCH_SIZE).repeat(EPOCHS)
def sample_client_data(data, client_ids, sampling_prob):
clients_total = len(client_ids)
x = np.random.uniform(size=clients_total)
sampled_ids = [client_ids[i] for i in range(clients_total) if x[i] < sampling_prob]
data = [train_data.create_tf_dataset_for_client(client) for client in sampled_ids]
return data
with open('users.pkl', 'rb') as f:
users = pickle.load(f)
train_client_ids = users["train"]
client_id_to_train_file = {i: "reddit_leaf_tf/" + i for i in train_client_ids}
train_data = tff.simulation.datasets.FilePerUserClientData(
client_ids_to_files=client_id_to_train_file,
dataset_fn=dataset_fn
)
sampling_prob = CLIENTS_PER_ROUND / len(train_client_ids)
for round_num in range(0, NUM_ROUNDS):
print('Round {r}'.format(r=round_num))
participants_data = sample_client_data(train_data, train_client_ids, sampling_prob)
print("Round Completed")
我正在使用 tensorflow-federated 19.0.
我创建客户端数据集的方式是否有问题,或者是否以某种方式预期上一轮的 RAM 不会被释放?
schmana@ 注意到在每轮更改 CLIENTS
放置的基数(不同数量的客户端数据集)时会发生这种情况。这会导致缓存归档,如 http://github.com/tensorflow/federated/issues/1215 中所述。
近期的解决方法是调用:
tff.framework.get_context_stack().current.executor_factory.clean_up_executors()
在每一轮的开始或结束时。