将张量拆分为训练集和测试集

Question

假设我使用 TextLineReader 读取了一个文本文件。有什么方法可以将它分成 Tensorflow 中的训练集和测试集吗？类似于：

def read_my_file_format(filename_queue):
  reader = tf.TextLineReader()
  key, record_string = reader.read(filename_queue)
  raw_features, label = tf.decode_csv(record_string)
  features = some_processing(raw_features)
  features_train, labels_train, features_test, labels_test = tf.train_split(features,
                                                                            labels,
                                                                            frac=.1)
  return features_train, labels_train, features_test, labels_test

Answer 1

像下面这样的东西应该可以工作： tf.split_v(tf.random_shuffle(...

编辑：对于 tensorflow>0.12 现在应该称为 tf.split(tf.random_shuffle(...

Reference

有关示例，请参阅 tf.split and for tf.random_shuffle 的文档。

Answer 2

import sklearn.model_selection as sk

X_train, X_test, y_train, y_test = 
sk.train_test_split(features,labels,test_size=0.33, random_state = 42)

Answer 3

正如 elham 提到的，您可以使用 scikit-learn to do this easily. scikit-learn is an open source library for machine learning. There are tons of tools for data preparation including the model_selection 模块，它处理比较、验证和选择参数。

model_selection.train_test_split() 方法专门用于将数据按百分比.

随机拆分为训练集和测试集

X_train, X_test, y_train, y_test = train_test_split(features,
                                                    labels,
                                                    test_size=0.33,
                                                    random_state=42)

test_size 是为测试保留的百分比，random_state 是随机抽样的种子。

我通常用它来提供训练和验证数据集，并分别保存真实的测试数据。您也可以运行 train_test_split 两次来执行此操作。 IE。将数据拆分为 (Train + Validation) 和 Test，然后将 Train + Validation 拆分为两个单独的张量。

Answer 4

我使用 tf.data.Dataset api 的映射和过滤功能获得了不错的结果。只需使用 map 函数随机 select train 和 testing 之间的示例。为此，您可以针对每个示例从均匀分布中获取样本，并检查样本值是否低于分率。

def split_train_test(parsed_features, train_rate):
    parsed_features['is_train'] = tf.gather(tf.random_uniform([1], maxval=100, dtype=tf.int32) < tf.cast(train_rate * 100, tf.int32), 0)
    return parsed_features

def grab_train_examples(parsed_features):
    return parsed_features['is_train']

def grab_test_examples(parsed_features):
    return ~parsed_features['is_train']

Answer 5

我通过从 sklearn 中封装 train_test_split 函数来临时提出一个解决方案，以便接受张量作为输入以及 return 张量。

我是 tensorflow 的新手并且遇到了同样的问题，所以如果你有更好的解决方案而不使用不同的包，我将不胜感激。

def train_test_split_tensors(X, y, **options):
    """
    encapsulation for the sklearn.model_selection.train_test_split function
    in order to split tensors objects and return tensors as output

    :param X: tensorflow.Tensor object
    :param y: tensorflow.Tensor object
    :dict **options: typical sklearn options are available, such as test_size and train_size
    """

    from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(X.numpy(), y.numpy(), **options)

    X_train, X_test = tf.constant(X_train), tf.constant(X_test)
    y_train, y_test = tf.constant(y_train), tf.constant(y_test)

    del(train_test_split)

    return X_train, X_test, y_train, y_test

将张量拆分为训练集和测试集

Split tensor into training and test sets

training-data

cross-validation

tensorflow