将 tensorflow 数据集记录分块为多条记录
Chunk tensorflow dataset records into multiple records
我有一个未批处理的 tensorflow
数据集,如下所示:
ds = ...
for record in ds.take(3):
print('data shape={}'.format(record['data'].shape))
-> data shape=(512, 512, 87)
-> data shape=(512, 512, 277)
-> data shape=(512, 512, 133)
我想以深度为 5 的块将数据馈送到我的网络。在上面的示例中,形状为 (512, 512, 87) 的张量将分为 17 个形状为 (512, 512, 5 ).应丢弃矩阵的最后 2 行 (tensor[:,:, 85:87]
)。
例如:
chunked_ds = ...
for record in chunked_ds.take(1):
print('chunked data shape={}'.format(record['data'].shape))
-> chunked data shape=(512, 512, 5)
如何从 ds
到 chunked_ds
? tf.data.Dataset.window()
看起来像我需要的,但我无法正常工作。
为了表达我的解决方案,我将首先创建一个虚拟数据集,每个数据集有 10 个样本,形状为 [ 512 , 512 , 87 ]
,
data = tf.random.normal( shape=[ 10 , 512 , 512 , 87 ] )
ds = tf.data.Dataset.from_tensor_slices( ( data ) )
在执行下面的代码时,
for record in ds.take( 3 ):
print( record.shape )
我们得到输出,
(512, 512, 87)
(512, 512, 87)
(512, 512, 87)
For convenience, I have created a dataset in which the length of the last dimension is a constant i.e. 87 ( which contradicts your approach ). But the solution provided is independent of the length of the last dimension.
解决方案,
# chunk/window size
chunk_depth = 5
# array to store the chunks
chunks = []
# Iterating through each sample in ds ( Note: ds.as_numpy_iterator() returns NumPy arrays )
for sample in ds.as_numpy_iterator():
# Length of the last dimension
feature_size = sample.shape[ 2 ]
# No. of chunks that can be produced
num_chunks = feature_size // chunk_depth
# Perform slicing along the last dimension, storing the "chunks" in the chunks array.
for i in range( 0 , num_chunks , chunk_depth ):
chunk = sample[ : , : , i : i + chunk_depth ]
chunks.append( chunk )
# Convert array -> tf.data.Dataset
chunked_ds = tf.data.Dataset.from_tensor_slices( ( chunks ) )
下面代码的输出,
for sample in chunked_ds.take( 1 ):
print( sample.shape )
符合题意,
(512, 512, 5)
解决方案可作为 Colab notebook。
这实际上可以使用仅 tf.data.Dataset
操作来完成:
data = tf.random.normal( shape=[ 10 , 512 , 512 , 87 ] )
ds = tf.data.Dataset.from_tensor_slices( ( data ) )
chunk_size = 5
chunked_ds = ds.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(tf.transpose(x, perm=[2, 0, 1])).batch(chunk_size, drop_remainder=True)) \
.map(lambda rec: tf.transpose(rec, perm=[1, 2, 0]))
那里发生了什么:
首先,我们将每条记录视为一个单独的数据集,并对其进行排列,以便最后一个维度成为批次维度(flat_map
将再次将我们的内部数据集展平为张量)
.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(tf.transpose(x, perm=[2, 0, 1])
然后我们按5分批处理,但我们不关心余数
.batch(chunk_size, drop_remainder=True))
最后,重新排列张量,使我们在开始时有 512x512:
.map(lambda rec: tf.transpose(rec, perm=[1, 2, 0]))
我有一个未批处理的 tensorflow
数据集,如下所示:
ds = ...
for record in ds.take(3):
print('data shape={}'.format(record['data'].shape))
-> data shape=(512, 512, 87)
-> data shape=(512, 512, 277)
-> data shape=(512, 512, 133)
我想以深度为 5 的块将数据馈送到我的网络。在上面的示例中,形状为 (512, 512, 87) 的张量将分为 17 个形状为 (512, 512, 5 ).应丢弃矩阵的最后 2 行 (tensor[:,:, 85:87]
)。
例如:
chunked_ds = ...
for record in chunked_ds.take(1):
print('chunked data shape={}'.format(record['data'].shape))
-> chunked data shape=(512, 512, 5)
如何从 ds
到 chunked_ds
? tf.data.Dataset.window()
看起来像我需要的,但我无法正常工作。
为了表达我的解决方案,我将首先创建一个虚拟数据集,每个数据集有 10 个样本,形状为 [ 512 , 512 , 87 ]
,
data = tf.random.normal( shape=[ 10 , 512 , 512 , 87 ] )
ds = tf.data.Dataset.from_tensor_slices( ( data ) )
在执行下面的代码时,
for record in ds.take( 3 ):
print( record.shape )
我们得到输出,
(512, 512, 87)
(512, 512, 87)
(512, 512, 87)
For convenience, I have created a dataset in which the length of the last dimension is a constant i.e. 87 ( which contradicts your approach ). But the solution provided is independent of the length of the last dimension.
解决方案,
# chunk/window size
chunk_depth = 5
# array to store the chunks
chunks = []
# Iterating through each sample in ds ( Note: ds.as_numpy_iterator() returns NumPy arrays )
for sample in ds.as_numpy_iterator():
# Length of the last dimension
feature_size = sample.shape[ 2 ]
# No. of chunks that can be produced
num_chunks = feature_size // chunk_depth
# Perform slicing along the last dimension, storing the "chunks" in the chunks array.
for i in range( 0 , num_chunks , chunk_depth ):
chunk = sample[ : , : , i : i + chunk_depth ]
chunks.append( chunk )
# Convert array -> tf.data.Dataset
chunked_ds = tf.data.Dataset.from_tensor_slices( ( chunks ) )
下面代码的输出,
for sample in chunked_ds.take( 1 ):
print( sample.shape )
符合题意,
(512, 512, 5)
解决方案可作为 Colab notebook。
这实际上可以使用仅 tf.data.Dataset
操作来完成:
data = tf.random.normal( shape=[ 10 , 512 , 512 , 87 ] )
ds = tf.data.Dataset.from_tensor_slices( ( data ) )
chunk_size = 5
chunked_ds = ds.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(tf.transpose(x, perm=[2, 0, 1])).batch(chunk_size, drop_remainder=True)) \
.map(lambda rec: tf.transpose(rec, perm=[1, 2, 0]))
那里发生了什么:
首先,我们将每条记录视为一个单独的数据集,并对其进行排列,以便最后一个维度成为批次维度(flat_map
将再次将我们的内部数据集展平为张量)
.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(tf.transpose(x, perm=[2, 0, 1])
然后我们按5分批处理,但我们不关心余数
.batch(chunk_size, drop_remainder=True))
最后,重新排列张量,使我们在开始时有 512x512:
.map(lambda rec: tf.transpose(rec, perm=[1, 2, 0]))