如何在 Colab 中检查预处理 time/speed？

Question

我正在 Google Colab GPU 上训练神经网络。因此，我将输入图像（总共 180k，训练 105k，验证 76k）与我的 Google Drive 同步。然后我安装 Google 驱动器并从那里开始。我在 Google Colab 中加载了一个带有图像路径和标签的 csv 文件，并将其存储为 pandas 数据帧。之后我使用图像路径和标签列表。

我使用这个函数来对我的标签进行 onehot 编码，因为我需要每个标签一个特殊的输出形状 (7, 35)，这是现有的默认函数无法完成的：

#One Hot Encoding der Labels, Zielarray hat eine Shape von (7,35)
from numpy import argmax
# define input string

def my_onehot_encoded(label):
    # define universe of possible input values
    characters = '0123456789ABCDEFGHIJKLMNPQRSTUVWXYZ'
    # define a mapping of chars to integers
    char_to_int = dict((c, i) for i, c in enumerate(characters))
    int_to_char = dict((i, c) for i, c in enumerate(characters))
    # integer encode input data
    integer_encoded = [char_to_int[char] for char in label]
    # one hot encode
    onehot_encoded = list()
    for value in integer_encoded:
        character = [0 for _ in range(len(characters))]
        character[value] = 1
        onehot_encoded.append(character)

    return onehot_encoded

之后，我使用自定义的 DataGenerator 将数据分批放入我的模型中。 x_set 是我的图像的图像路径列表，y_set 是 onehot 编码标签：

class DataGenerator(Sequence):

    def __init__(self, x_set, y_set, batch_size):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size

    def __len__(self):
        return math.ceil(len(self.x) / self.batch_size)

    def __getitem__(self, idx):
        batch_x = self.x[idx*self.batch_size : (idx + 1)*self.batch_size]
        batch_x = np.array([resize(imread(file_name), (224, 224)) for file_name in batch_x])
        batch_x = batch_x * 1./255
        batch_y = self.y[idx*self.batch_size : (idx + 1)*self.batch_size]
        batch_y = np.array(batch_y)

        return batch_x, batch_y

使用此代码，我将 DataGenerator 应用于我的数据：

training_generator = DataGenerator(X_train, y_train, batch_size=32)
validation_generator = DataGenerator(X_val, y_val, batch_size=32)

当我现在训练我的模型时，一个 epoch 持续 25-40 分钟，这是非常长的。

model.fit_generator(generator=training_generator,
                    validation_data=validation_generator,
                    steps_per_epoch = num_train_samples // 16,
                    validation_steps = num_val_samples // 16,
                    epochs = 10, workers=6, use_multiprocessing=True)

我现在想知道如何测量预处理时间，因为我认为这不是模型大小的原因，因为我已经试验过参数较少的模型，但训练时间并没有显着减少...所以，我对预处理持怀疑态度...

Answer 1

要在 Colab 中测量时间，您可以使用 this autotime 包：

!pip install ipython-autotime

%load_ext autotime

此外，对于分析，您可以使用 %time，如前所述 here。

一般来说，为了确保generator运行得更快，建议您将数据从gdrive复制到那个colab的本地主机，否则它会变慢.

如果您正在使用 Tensorflow 2.0，原因可能是 this 错误。

解决方法是：

在代码开头调用tf.compat.v1.disable_eager_execution()
使用 model.fit 而不是 model.fit_generator。无论如何，前者支持生成器。
降级到 TF 1.14

无论 Tensorflow 版本如何，限制您正在执行的磁盘访问量，这通常是瓶颈。

请注意，似乎 issue TF 中的生成器速度很慢 1.13.2 和 2.0.1（至少）。

如何在 Colab 中检查预处理 time/speed？

How to check preprocessing time/speed in Colab?

python

conv-neural-network

tensorflow

image-preprocessing