用于时间序列分类的 Tensorflow 数据集 API

Question

我正在适应新的数据集API并尝试做一些时间序列分类。我有一个格式为 tf-records 的数据集，其形状如下： (time_steps x features)。我也有每个时间步长的标签。 (time_steps x 1)

我想要做的是重新格式化数据集，使其具有滚动的 window 时间步长，如下所示： (n x windows_size x features)。 n 是 time_steps-window_size 的数量（如果我使用步幅 1 进行滚动 window）

标签应该是 (window_size x 1)，意思是我们取window中最后一个time_step的label。

我已经知道，我可以使用 tf.sliding_window_batch() 为功能创建滑动 window。然而，标签的形状是一样的，我不知道如何正确地做到这一点：(n x window_size x 1

如何使用 tensorflow 数据集 API 执行此操作？ https://www.tensorflow.org/programmers_guide/datasets

感谢您的帮助！

Answer 1

我不知道该怎么做，但我想我也可以使用 numpy 来做。

我发现这个很棒并将其应用到我的案例中。

之后就是这样使用 numpy:

train_df2 = window_nd(train_df, 50, steps=1, axis=0)
train_features = train_df2[:,:,:-1]
train_labels = train_df2[:,:,-1:].squeeze()[:,-1:]
train_labels.shape

我的标签是最后一列，所以你可能需要稍微调整一下。

Answer 2

我有一个 TF 1.13 的缓慢解决方案。

    WIN_SIZE= 5000

dataset_input = tf.data.Dataset.from_tensor_slices(data1).window(size= WIN_SIZE,
                                                             shift= WIN_SIZE,
                                                             drop_remainder= False).flat_map(lambda x: 
                                                                                            x.batch(WIN_SIZE))

dataset_label = tf.data.Dataset.from_tensor_slices(data2).window(size= WIN_SIZE,
                                                             shift= WIN_SIZE,
                                                             drop_remainder= False).flat_map(lambda x: 
                                                                                            x.batch(WIN_SIZE)).map(lambda x:
                                                                                                                  x[-1])
dataset= tf.data.Dataset.zip((dataset_input, dataset_label))
dataset= dataset.repeat(1)
data_iter = dataset.make_one_shot_iterator() # create the iterator
next_sample= data_iter.get_next()

with tf.Session() as sess:
    i=0
    while True:
        try:
            r_= sess.run(next_sample)
            i+=1
            print(i)
            print(r_)
            print(r_[0].shape)
            print(r_[1].shape)

        except tf.errors.OutOfRangeError:
            print('end')
            break

我说'slow solution'的原因可能是下面的代码片段可以优化，我还没有完成：

dataset_label = tf.data.Dataset.from_tensor_slices(data2).window(size= WIN_SIZE,
                                                         shift= WIN_SIZE,
                                                         drop_remainder= False).flat_map(lambda x: 
                                                                                        x.batch(WIN_SIZE)).map(lambda x:
                                                                                                              x[-1])

一个有前途的解决方案可能会找到一些 'skip' 操作来跳过 dataset_label 中无用的值，而不是使用 'window' 操作（现在是）。

用于时间序列分类的 Tensorflow 数据集 API

Tensorflow dataset API for time series classification

python

tensorflow

tensorflow-datasets