在联邦学习中将数据拆分为训练和测试

Question

我是联邦学习的新手我目前正在按照官方 TFF 文档试验一个模型。但我遇到了一个问题，希望能在这里找到一些解释。

我正在使用自己的数据集，数据分布在多个文件中，每个文件都是一个客户端（因为我正计划构建模型）。并且已经定义了因变量和自变量。

现在，我的问题是如何在联邦学习中将数据拆分为每个客户端（文件）中的训练集和测试集？就像我们通常在集中式 ML 模型中所做的那样 下面的代码是我目前已经实现的：注意我的代码是受官方文档和这个的启发，它与我的应用程序几乎相似，但它旨在将客户端拆分为培训和测试客户端本身，而我的目标是就是拆分这些clients内部的数据。

dataset_paths = {
  'client_0': '/content/drive/MyDrive/Colab Notebooks/1.csv',
  'client_1': '/content/drive/MyDrive/Colab Notebooks/2.csv',
  'client_2': '/content/drive/MyDrive/Colab Notebooks/3.csv'
}
record_defaults = [int(), int(), int(), int(), float(),float(),float(),
                   float(),float(),float(), int(), int(),float(),float(),int()]

@tf.function
def create_tf_dataset_for_client_fn(dataset_path):
   return tf.data.experimental.CsvDataset(dataset_path,
                                          record_defaults=record_defaults,
                                          header=True)

@tf.function
def add_parsing(dataset):
  def parse_dataset(*x):
    ## x defines the dependant varable & y defines the independant 
    return OrderedDict([('x', x[-1]), ('y', x[1:-1])])
  return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)

source = tff.simulation.datasets.FilePerUserClientData(
  dataset_paths, create_tf_dataset_for_client_fn) 

source = source.preprocess(add_parsing)
## Creat the the datasets from client data 
dataset_creation=source.create_tf_dataset_for_client(source.client_ids[0-2])
print(dataset_creation)
>>> _VariantDataset element_spec=OrderedDict([('x', TensorSpec(shape=(), dtype=tf.int32, name=None)), ('y', (TensorSpec(shape=(), dtype=tf.int32, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None)))])>
## Convert the x into array(I think it is necessary for spliting to training and testing sets ) 
test= tf.nest.map_structure(lambda x: x.numpy(),next(iter(dataset_creation)))
print(test)
>>> OrderedDict([('x', 1), ('y', (0, 1, 9, 85.0, 7.75, 85.0, 95.0, 75.0, 50.0, 6))])

我对监督式 ML 的理解是将数据分成训练集和测试集，如下面的代码所示，我不确定在联邦学习中如何做到这一点以及它是否会以这种方式工作？

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

所以，我正在寻找这个问题的解释，以便我可以继续训练阶段。

Answer 1

看到这个tutorial。您应该能够根据客户及其数据创建两个数据集（训练和测试）：

import tensorflow as tf
import tensorflow_federated as tff
from collections import OrderedDict

record_defaults = [int(), int(), int(), int(), float(),float(),float(),float(),float(),float(), int(), int()]

@tf.function
def create_tf_dataset_for_client_fn(dataset_path):
   return tf.data.experimental.CsvDataset(dataset_path, record_defaults=record_defaults, header=True)
   
@tf.function
def add_parsing(dataset):
  def parse_dataset(*x):
    return OrderedDict([('label', x[:-1]), ('features', x[1:-1])])
  return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)

def split_train_test(client_ids):
  train, test = [], []
  for x in client_ids:
    d = source.create_tf_dataset_for_client(x)
    d_length = d.reduce(0, lambda x,_: x+1).numpy()
    d = d.shuffle(d_length)
    train.append(list(d.take(int(d_length*.8)))) 
    test.append(list(d.skip(int(d_length*.2))))
  return train[0], test[0]

dataset_paths = {'client1': '/content/client1.csv', 'client2': '/content/client2.csv', 
                 'client3': '/content/client2.csv', 'client4': '/content/client2.csv'}
source = tff.simulation.datasets.FilePerUserClientData(
  dataset_paths, create_tf_dataset_for_client_fn) 

client_ids = sorted(source.client_ids)

federated_train_data, federated_test_data = split_train_test(client_ids)
print(*federated_train_data, sep='\n')

(<tf.Tensor: shape=(), dtype=int32, numpy=24>, <tf.Tensor: shape=(), dtype=int32, numpy=17>, <tf.Tensor: shape=(), dtype=int32, numpy=27>, <tf.Tensor: shape=(), dtype=int32, numpy=4>, <tf.Tensor: shape=(), dtype=float32, numpy=0.17308392>, <tf.Tensor: shape=(), dtype=float32, numpy=1.889401>, <tf.Tensor: shape=(), dtype=float32, numpy=1.6235029>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.56010467>, <tf.Tensor: shape=(), dtype=float32, numpy=-1.0171211>, <tf.Tensor: shape=(), dtype=float32, numpy=0.43558818>, <tf.Tensor: shape=(), dtype=int32, numpy=40>, <tf.Tensor: shape=(), dtype=int32, numpy=14>)
(<tf.Tensor: shape=(), dtype=int32, numpy=8>, <tf.Tensor: shape=(), dtype=int32, numpy=32>, <tf.Tensor: shape=(), dtype=int32, numpy=14>, <tf.Tensor: shape=(), dtype=int32, numpy=11>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.91828436>, <tf.Tensor: shape=(), dtype=float32, numpy=0.29887632>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.4598584>, <tf.Tensor: shape=(), dtype=float32, numpy=-1.1088414>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.4057387>, <tf.Tensor: shape=(), dtype=float32, numpy=-2.1537204>, <tf.Tensor: shape=(), dtype=int32, numpy=15>, <tf.Tensor: shape=(), dtype=int32, numpy=45>)
(<tf.Tensor: shape=(), dtype=int32, numpy=11>, <tf.Tensor: shape=(), dtype=int32, numpy=17>, <tf.Tensor: shape=(), dtype=int32, numpy=17>, <tf.Tensor: shape=(), dtype=int32, numpy=2>, <tf.Tensor: shape=(), dtype=float32, numpy=0.93560874>, <tf.Tensor: shape=(), dtype=float32, numpy=-2.4382026>, <tf.Tensor: shape=(), dtype=float32, numpy=-1.7638668>, <tf.Tensor: shape=(), dtype=float32, numpy=0.65431964>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.7130539>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.96356>, <tf.Tensor: shape=(), dtype=int32, numpy=15>, <tf.Tensor: shape=(), dtype=int32, numpy=18>)
(<tf.Tensor: shape=(), dtype=int32, numpy=42>, <tf.Tensor: shape=(), dtype=int32, numpy=27>, <tf.Tensor: shape=(), dtype=int32, numpy=34>, <tf.Tensor: shape=(), dtype=int32, numpy=8>, <tf.Tensor: shape=(), dtype=float32, numpy=0.3965425>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.2588629>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.84179455>, <tf.Tensor: shape=(), dtype=float32, numpy=0.114052325>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.9591451>, <tf.Tensor: shape=(), dtype=float32, numpy=0.94621265>, <tf.Tensor: shape=(), dtype=int32, numpy=28>, <tf.Tensor: shape=(), dtype=int32, numpy=7>)

如果您按照我链接的教程进行操作，您应该能够将拆分数据直接提供给 tff.learning.from_keras_model。

在联邦学习中将数据拆分为训练和测试

splitting the data into training and testing in federated learning

python

tensorflow

tensorflow-datasets

tensorflow-federated