按行划分数据帧（或 numpy 数组）的正确方法

Question

我是机器学习领域的新手，我正在研究 rnn 来对时间序列进行分类。我正在研究这个数据集 https://archive.ics.uci.edu/ml/datasets/EEG+Eye+State# 由 14 个时间序列组成，每个时间序列的步数等于 14980 我想要得到的是一组恰好有 20 个时间步长的时间序列，所以一个具有形状 (749,20,14) 的 numpy 数组其中 749 是时间序列的数量，20 是时间序列的时间步数，14 是每个时间步的值数。然后将这个数组输入网络进行训练。实现这一目标的正确方法是什么？

起始数据帧，最后一列包含用于对时间序列进行分类的整数

#how to divide it right?
data = arff.loadarff('./datasets/eeg_eye_state.arff')

df = pd.DataFrame(data[0])
df['eyeDetection'] = df['eyeDetection'].str.decode('utf-8')
df['eyeDetection'] = df['eyeDetection'].astype(str).astype(int)

Answer 1

由于您使用的是 EEG Eye State 数据集并且：

All values are in chronological order with the first measured value at the top of the data.

您可以使用 tensorflow.keras 实用程序 class 中的 TimeseriesGenerator 来生成批量时间数据。

from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator

n_input = 20
batch_size = 749
data_input = df.drop(columns=['eyeDetection'])

data_gen = TimeseriesGenerator(data_input, df.eyeDetection, length=n_input, batch_size=batch_size)

batch_0 = data_gen[0]
x, y = batch_0

print(x.shape)
print(y.shape)

#feed possibly to a model.fit()
#model.fit(data_gen, ...)

(749, 20, 14)
(749,)

按行划分数据帧（或 numpy 数组）的正确方法

correct way to divide a dataframe (or numpy array) by rows

python

numpy

dataframe

pandas

recurrent-neural-network