无法在 Sagemaker 脚本模式下使用 Keras CSVLogger 回调。它无法在 S3 上写入日志文件（错误 - 没有这样的文件或目录）

Question

我有这个脚本，我想在其中获取回调到 sagemaker 自定义脚本 docker 容器中的单独 CSV 文件。但是当我尝试在本地模式下运行时，它没有给出以下错误。我有一个运行的超参数调整作业 (HPO)，这一直给我错误。在执行 HPO 之前，我需要正确获取此本地模式运行。

在笔记本中我使用了下面的代码。

from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(entry_point='lstm_model.py', 
                          role=role,
                          code_location=custom_code_upload_location,
                          output_path=model_artifact_location+'/',
                          train_instance_count=1, 
                          train_instance_type='local',
                          framework_version='1.12', 
                          py_version='py3',
                          script_mode=True,
                          hyperparameters={'epochs': 1},
                          base_job_name='hpo-lstm-local-test'
                         )

tf_estimator.fit({'training': training_input_path, 'validation': validation_input_path})

在我的lstm_model.py脚本中使用了以下代码。

lgdir = os.path.join(model_dir, 'callbacks_log.csv')
csv_logger = CSVLogger(lgdir, append=True)

regressor.fit(x_train, y_train, batch_size=batch_size,
              validation_data=(x_val, y_val), 
              epochs=epochs,
              verbose=2,
              callbacks=[csv_logger]
              )

我尝试使用 tensorflow 后端预先创建一个文件，如下所示。但它不会创建文件。（K：tensorflow 后端，tf：tensorflow）

filename = tf.Variable(lgdir , tf.string)
content = tf.Variable("", tf.string)
sess = K.get_session()
tf.io.write_file(filename, content)

我无法使用 pandas 等任何其他包来创建文件，因为 SageMaker 中用于自定义脚本的 TensorFlow docker 容器未提供它们。他们只提供数量有限的包裹。

有没有办法在 fit 方法尝试写入回调之前将 csv 文件写入 S3 存储桶位置。或者这是解决问题的方法？我不知道。

如果您甚至可以提出其他建议来获得回电，我什至会接受这个答案。但这应该是值得的。

这张docker图片确实缩小了范围。

Answer 1

首先，您始终可以使用 Tensorflow 图像作为基础制作自己的 docker 图像。我在 Tensorflow 2.0 工作，所以这对你来说会略有不同，但这是我的图像模式的示例：

# Downloads the TensorFlow library used to run the Python script
FROM tensorflow/tensorflow:2.0.0a0 # you would use the equivalent for your TF version

# Contains the common functionality necessary to create a container compatible with Amazon SageMaker
RUN pip install sagemaker-containers -q 

# Wandb allows us to customize and centralize logging while maintaining open-source agility
RUN pip install wandb -q # here you would install pandas

# Copies the training code inside the container to the design pattern created by the Tensorflow estimator
# here you could copy over a callbacks csv
COPY mnist-2.py /opt/ml/code/mnist-2.py 
COPY callbacks.py /opt/ml/code/callbacks.py 
COPY wandb_setup.sh /opt/ml/code/wandb_setup.sh

# Set the login script as the entry point
ENV SAGEMAKER_PROGRAM wandb_setup.sh # here you would instead launch lstm_model.py

我相信您正在寻找与此类似的模式，但我更喜欢使用 Weights and Biases 记录我的所有模型数据。他们的 SageMaker 集成数据有点不足，但实际上我正在为他们编写更新的教程。它肯定应该在本月完成，包括记录和比较来自超参数调整作业的运行

无法在 Sagemaker 脚本模式下使用 Keras CSVLogger 回调。它无法在 S3 上写入日志文件（错误 - 没有这样的文件或目录）

Can't use Keras CSVLogger callbacks in Sagemaker script mode. It fails to write the log file on S3 ( error - No such file or directory )

keras

tensorflow

amazon-sagemaker