将具有 Tensor 特征的 tf.train.Dataset 序列化到 tfrecord 文件中?
Serialize a tf.train.Dataset with Tensor features into a tfrecord file?
我的数据集如下所示:
dataset1 = tf.data.Dataset.from_tensor_slices((
tf.random.uniform([4, 100], maxval=100, dtype=tf.int32),
tf.random.uniform([4])))
for record in dataset1.take(2):
print(record)
print(type(record))
(<tf.Tensor: shape=(100,), dtype=int32, numpy=
array([28, 96, 6, 22, 36, 33, 34, 29, 20, 77, 40, 82, 45, 81, 62, 59, 30,
86, 44, 17, 43, 32, 19, 32, 96, 24, 14, 65, 43, 59, 0, 96, 20, 17,
54, 31, 88, 72, 88, 55, 57, 63, 92, 50, 95, 76, 99, 63, 95, 82, 22,
36, 87, 56, 44, 29, 12, 45, 82, 27, 56, 32, 44, 66, 77, 99, 97, 58,
52, 81, 42, 54, 78, 3, 29, 86, 59, 98, 67, 39, 25, 27, 16, 46, 68,
81, 72, 30, 53, 95, 33, 71, 93, 82, 95, 55, 13, 53, 30, 21],
dtype=int32)>, <tf.Tensor: shape=(), dtype=float32, numpy=0.42071342>)
(<tf.Tensor: shape=(100,), dtype=int32, numpy=
array([71, 52, 9, 25, 94, 45, 64, 56, 99, 92, 62, 96, 13, 97, 39, 10, 27,
41, 81, 37, 38, 20, 77, 11, 26, 28, 55, 99, 50, 7, 89, 2, 66, 64,
11, 97, 4, 30, 34, 20, 81, 86, 68, 84, 75, 4, 22, 35, 87, 44, 57,
94, 27, 19, 60, 37, 38, 83, 39, 75, 65, 80, 97, 72, 20, 69, 35, 20,
37, 5, 60, 11, 84, 46, 25, 30, 13, 74, 5, 82, 34, 1, 79, 91, 41,
83, 94, 80, 79, 6, 3, 26, 84, 20, 53, 78, 93, 36, 54, 44],
dtype=int32)>, <tf.Tensor: shape=(), dtype=float32, numpy=0.73927164>)
<class 'tuple'>
所以每条记录都是两个张量的元组,一个是模型的输入,另一个是模型的输出。我正在尝试将此数据集转换为 .tfrecord
文件,这需要我从每条记录中创建一个 Example
。这是我的尝试:
def _bytes_feature(value):
"""Returns a bytes_list from a string / byte."""
if isinstance(value, type(tf.constant(0))):
value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _float_feature(value):
"""Returns a float_list from a float / double."""
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
def serialize_example(feature1, feature2):
feature = {
'feature1': _bytes_feature(tf.io.serialize_tensor(feature1)),
'feature2': _float_feature(feature2),
}
example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
return example_proto.SerializeToString()
当我执行 dataset1.map(serialize_example)
时,我希望我的代码在执行
之前能够正常工作
writer = tf.data.experimental.TFRecordWriter(some_path)
writer.write(dataset1)
但是,当我尝试 dataset1.map(serialize_example)
时出现以下错误:
...
value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
AttributeError: 'Tensor' object has no attribute 'numpy'
我应该如何将此数据集转换为 .tfrecord
文件?
我试着按照doc and this is what I could come up with (you can test it right away here in a colab):
import tensorflow as tf
dataset1 = tf.data.Dataset.from_tensor_slices((
tf.random.uniform([4, 100], maxval=100, dtype=tf.int32),
tf.random.uniform([4])))
def _bytes_feature(value):
"""Returns a bytes_list from a string / byte."""
if isinstance(value, type(tf.constant(0))):
value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _float_feature(value):
"""Returns a float_list from a float / double."""
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
def serialize_example(feature1, feature2):
feature = {
'feature1': _bytes_feature(tf.io.serialize_tensor(feature1)),
'feature2': _float_feature(feature2),
}
example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
return example_proto.SerializeToString()
def tf_serialize_example(f0,f1):
tf_string = tf.py_function(
serialize_example,
(f0, f1), # Pass these args to the above function.
tf.string) # The return type is `tf.string`.
return tf.reshape(tf_string, ()) # The result is a scalar.
dataset1 = dataset1.map(tf_serialize_example)
writer = tf.data.experimental.TFRecordWriter('test.tfrecord')
writer.write(dataset1)
基本上主要的部分就是写一个tf.py_function
。这是因为 serialize_example
是一个非张量类函数:您不能在图形模式下使用 .numpy()
。这就是 AttributeError: 'Tensor' object has no attribute 'numpy'
试图告诉你的(尽管很笨拙)。
区别在于 EagerTensor
将具有 .numpy()
方法。
另外一件事:如果您不需要 tf.int32
作为输入的数据类型,您可以使用 tf.int64
并使用以下函数:
def _int64_feature(value):
"""Returns an int64_list from a bool / enum / int / uint."""
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
我认为这个函数类似于张量,因此您不需要 tf.py_function
,但我还没有尝试过。
当然,您也可以转换为 float32
或 float64
,但这在存储方面会更重。
我的数据集如下所示:
dataset1 = tf.data.Dataset.from_tensor_slices((
tf.random.uniform([4, 100], maxval=100, dtype=tf.int32),
tf.random.uniform([4])))
for record in dataset1.take(2):
print(record)
print(type(record))
(<tf.Tensor: shape=(100,), dtype=int32, numpy=
array([28, 96, 6, 22, 36, 33, 34, 29, 20, 77, 40, 82, 45, 81, 62, 59, 30,
86, 44, 17, 43, 32, 19, 32, 96, 24, 14, 65, 43, 59, 0, 96, 20, 17,
54, 31, 88, 72, 88, 55, 57, 63, 92, 50, 95, 76, 99, 63, 95, 82, 22,
36, 87, 56, 44, 29, 12, 45, 82, 27, 56, 32, 44, 66, 77, 99, 97, 58,
52, 81, 42, 54, 78, 3, 29, 86, 59, 98, 67, 39, 25, 27, 16, 46, 68,
81, 72, 30, 53, 95, 33, 71, 93, 82, 95, 55, 13, 53, 30, 21],
dtype=int32)>, <tf.Tensor: shape=(), dtype=float32, numpy=0.42071342>)
(<tf.Tensor: shape=(100,), dtype=int32, numpy=
array([71, 52, 9, 25, 94, 45, 64, 56, 99, 92, 62, 96, 13, 97, 39, 10, 27,
41, 81, 37, 38, 20, 77, 11, 26, 28, 55, 99, 50, 7, 89, 2, 66, 64,
11, 97, 4, 30, 34, 20, 81, 86, 68, 84, 75, 4, 22, 35, 87, 44, 57,
94, 27, 19, 60, 37, 38, 83, 39, 75, 65, 80, 97, 72, 20, 69, 35, 20,
37, 5, 60, 11, 84, 46, 25, 30, 13, 74, 5, 82, 34, 1, 79, 91, 41,
83, 94, 80, 79, 6, 3, 26, 84, 20, 53, 78, 93, 36, 54, 44],
dtype=int32)>, <tf.Tensor: shape=(), dtype=float32, numpy=0.73927164>)
<class 'tuple'>
所以每条记录都是两个张量的元组,一个是模型的输入,另一个是模型的输出。我正在尝试将此数据集转换为 .tfrecord
文件,这需要我从每条记录中创建一个 Example
。这是我的尝试:
def _bytes_feature(value):
"""Returns a bytes_list from a string / byte."""
if isinstance(value, type(tf.constant(0))):
value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _float_feature(value):
"""Returns a float_list from a float / double."""
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
def serialize_example(feature1, feature2):
feature = {
'feature1': _bytes_feature(tf.io.serialize_tensor(feature1)),
'feature2': _float_feature(feature2),
}
example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
return example_proto.SerializeToString()
当我执行 dataset1.map(serialize_example)
时,我希望我的代码在执行
writer = tf.data.experimental.TFRecordWriter(some_path)
writer.write(dataset1)
但是,当我尝试 dataset1.map(serialize_example)
时出现以下错误:
...
value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
AttributeError: 'Tensor' object has no attribute 'numpy'
我应该如何将此数据集转换为 .tfrecord
文件?
我试着按照doc and this is what I could come up with (you can test it right away here in a colab):
import tensorflow as tf
dataset1 = tf.data.Dataset.from_tensor_slices((
tf.random.uniform([4, 100], maxval=100, dtype=tf.int32),
tf.random.uniform([4])))
def _bytes_feature(value):
"""Returns a bytes_list from a string / byte."""
if isinstance(value, type(tf.constant(0))):
value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _float_feature(value):
"""Returns a float_list from a float / double."""
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
def serialize_example(feature1, feature2):
feature = {
'feature1': _bytes_feature(tf.io.serialize_tensor(feature1)),
'feature2': _float_feature(feature2),
}
example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
return example_proto.SerializeToString()
def tf_serialize_example(f0,f1):
tf_string = tf.py_function(
serialize_example,
(f0, f1), # Pass these args to the above function.
tf.string) # The return type is `tf.string`.
return tf.reshape(tf_string, ()) # The result is a scalar.
dataset1 = dataset1.map(tf_serialize_example)
writer = tf.data.experimental.TFRecordWriter('test.tfrecord')
writer.write(dataset1)
基本上主要的部分就是写一个tf.py_function
。这是因为 serialize_example
是一个非张量类函数:您不能在图形模式下使用 .numpy()
。这就是 AttributeError: 'Tensor' object has no attribute 'numpy'
试图告诉你的(尽管很笨拙)。
区别在于 EagerTensor
将具有 .numpy()
方法。
另外一件事:如果您不需要 tf.int32
作为输入的数据类型,您可以使用 tf.int64
并使用以下函数:
def _int64_feature(value):
"""Returns an int64_list from a bool / enum / int / uint."""
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
我认为这个函数类似于张量,因此您不需要 tf.py_function
,但我还没有尝试过。
当然,您也可以转换为 float32
或 float64
,但这在存储方面会更重。