将张量拆分为训练集和测试集
Split tensor into training and test sets
假设我使用 TextLineReader
读取了一个文本文件。有什么方法可以将它分成 Tensorflow
中的训练集和测试集吗?类似于:
def read_my_file_format(filename_queue):
reader = tf.TextLineReader()
key, record_string = reader.read(filename_queue)
raw_features, label = tf.decode_csv(record_string)
features = some_processing(raw_features)
features_train, labels_train, features_test, labels_test = tf.train_split(features,
labels,
frac=.1)
return features_train, labels_train, features_test, labels_test
像下面这样的东西应该可以工作:
tf.split_v(tf.random_shuffle(...
编辑:对于 tensorflow>0.12 现在应该称为 tf.split(tf.random_shuffle(...
有关示例,请参阅 tf.split and for tf.random_shuffle 的文档。
import sklearn.model_selection as sk
X_train, X_test, y_train, y_test =
sk.train_test_split(features,labels,test_size=0.33, random_state = 42)
正如 elham 提到的,您可以使用 scikit-learn to do this easily. scikit-learn is an open source library for machine learning. There are tons of tools for data preparation including the model_selection
模块,它处理比较、验证和选择参数。
model_selection.train_test_split()
方法专门用于将数据按百分比.
随机拆分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(features,
labels,
test_size=0.33,
random_state=42)
test_size
是为测试保留的百分比,random_state
是随机抽样的种子。
我通常用它来提供训练和验证数据集,并分别保存真实的测试数据。您也可以 运行 train_test_split
两次来执行此操作。 IE。将数据拆分为 (Train + Validation) 和 Test,然后将 Train + Validation 拆分为两个单独的张量。
我使用 tf.data.Dataset api 的映射和过滤功能获得了不错的结果。只需使用 map 函数随机 select train 和 testing 之间的示例。为此,您可以针对每个示例从均匀分布中获取样本,并检查样本值是否低于分率。
def split_train_test(parsed_features, train_rate):
parsed_features['is_train'] = tf.gather(tf.random_uniform([1], maxval=100, dtype=tf.int32) < tf.cast(train_rate * 100, tf.int32), 0)
return parsed_features
def grab_train_examples(parsed_features):
return parsed_features['is_train']
def grab_test_examples(parsed_features):
return ~parsed_features['is_train']
我通过从 sklearn 中封装 train_test_split 函数来临时提出一个解决方案,以便接受张量作为输入以及 return 张量。
我是 tensorflow 的新手并且遇到了同样的问题,所以如果你有更好的解决方案而不使用不同的包,我将不胜感激。
def train_test_split_tensors(X, y, **options):
"""
encapsulation for the sklearn.model_selection.train_test_split function
in order to split tensors objects and return tensors as output
:param X: tensorflow.Tensor object
:param y: tensorflow.Tensor object
:dict **options: typical sklearn options are available, such as test_size and train_size
"""
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.numpy(), y.numpy(), **options)
X_train, X_test = tf.constant(X_train), tf.constant(X_test)
y_train, y_test = tf.constant(y_train), tf.constant(y_test)
del(train_test_split)
return X_train, X_test, y_train, y_test
假设我使用 TextLineReader
读取了一个文本文件。有什么方法可以将它分成 Tensorflow
中的训练集和测试集吗?类似于:
def read_my_file_format(filename_queue):
reader = tf.TextLineReader()
key, record_string = reader.read(filename_queue)
raw_features, label = tf.decode_csv(record_string)
features = some_processing(raw_features)
features_train, labels_train, features_test, labels_test = tf.train_split(features,
labels,
frac=.1)
return features_train, labels_train, features_test, labels_test
像下面这样的东西应该可以工作:
tf.split_v(tf.random_shuffle(...
编辑:对于 tensorflow>0.12 现在应该称为 tf.split(tf.random_shuffle(...
有关示例,请参阅 tf.split and for tf.random_shuffle 的文档。
import sklearn.model_selection as sk
X_train, X_test, y_train, y_test =
sk.train_test_split(features,labels,test_size=0.33, random_state = 42)
正如 elham 提到的,您可以使用 scikit-learn to do this easily. scikit-learn is an open source library for machine learning. There are tons of tools for data preparation including the model_selection
模块,它处理比较、验证和选择参数。
model_selection.train_test_split()
方法专门用于将数据按百分比.
X_train, X_test, y_train, y_test = train_test_split(features,
labels,
test_size=0.33,
random_state=42)
test_size
是为测试保留的百分比,random_state
是随机抽样的种子。
我通常用它来提供训练和验证数据集,并分别保存真实的测试数据。您也可以 运行 train_test_split
两次来执行此操作。 IE。将数据拆分为 (Train + Validation) 和 Test,然后将 Train + Validation 拆分为两个单独的张量。
我使用 tf.data.Dataset api 的映射和过滤功能获得了不错的结果。只需使用 map 函数随机 select train 和 testing 之间的示例。为此,您可以针对每个示例从均匀分布中获取样本,并检查样本值是否低于分率。
def split_train_test(parsed_features, train_rate):
parsed_features['is_train'] = tf.gather(tf.random_uniform([1], maxval=100, dtype=tf.int32) < tf.cast(train_rate * 100, tf.int32), 0)
return parsed_features
def grab_train_examples(parsed_features):
return parsed_features['is_train']
def grab_test_examples(parsed_features):
return ~parsed_features['is_train']
我通过从 sklearn 中封装 train_test_split 函数来临时提出一个解决方案,以便接受张量作为输入以及 return 张量。
我是 tensorflow 的新手并且遇到了同样的问题,所以如果你有更好的解决方案而不使用不同的包,我将不胜感激。
def train_test_split_tensors(X, y, **options):
"""
encapsulation for the sklearn.model_selection.train_test_split function
in order to split tensors objects and return tensors as output
:param X: tensorflow.Tensor object
:param y: tensorflow.Tensor object
:dict **options: typical sklearn options are available, such as test_size and train_size
"""
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.numpy(), y.numpy(), **options)
X_train, X_test = tf.constant(X_train), tf.constant(X_test)
y_train, y_test = tf.constant(y_train), tf.constant(y_test)
del(train_test_split)
return X_train, X_test, y_train, y_test