How can I clean memory or use SageMaker instead to avoid MemoryError: Unable to allocate for an array with shape (25000, 2000) and data type float64

Question

我正在使用 keras 在 SageMaker 上训练模型，这是我正在使用的代码，但我遇到了错误：

MemoryError: Unable to allocate 381. MiB for an array with shape (25000, 2000) 
    and data type float64

代码如下：

import pandas as pd
import numpy as np
from keras.datasets import imdb
from keras import models, layers, optimizers, losses, metrics
import matplotlib.pyplot as plt

# load imbd preprocessed dataset
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(
    num_words=2000)

# one-hot encoding all the integer into a binary matrix
def vectorize_sequences(sequences, dimension=2000):
    results = np.zeros((len(sequences), dimension))        
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.                          
    return results

x_train = vectorize_sequences(train_data)                  
x_test = vectorize_sequences(test_data)

然后我得到错误。

我第一次运行这段代码有效，但当我尝试重新运行它时失败了，我如何通过清理内存来修复它，或者有什么方法可以我可以在 SageMaker 上使用内存吗？

Answer 1

我不太了解 SageMaker 或 AWS，但您可以将输入转换为 float32，这样占用的内存更少 space。你可以这样投射它：

train_data = tf.cast(train_data, tf.float32)

float32 是 Tensorflow 权重的默认值，因此您无论如何都不需要 float64。证明：

import tensorflow as tf
layer = tf.keras.layers.Dense(8)
print(layer(tf.random.uniform((10, 100), 0, 1)).dtype)

<dtype: 'float32'>

我的其他建议是从数据集中获取更少的单词，或者不要对它们进行一次性编码。如果您计划使用嵌入层训练循环模型，则无论如何都不需要。

How can I clean memory or use SageMaker instead to avoid MemoryError: Unable to allocate for an array with shape (25000, 2000) and data type float64

How can I clean memory or use SageMaker instead to avoid MemoryError: Unable to allocate for an array with shape (25000, 2000) and data type float64

python

amazon-web-services

keras

tensorflow

amazon-sagemaker