将我的数据集存储在 sagemaker 的笔记本实例中是个好主意吗？

Is it a good idea to store my dataset in my notebook instance in sagemaker?

我是 AWS 的新手，我正在考虑使用 amazon sagemaker 来训练我的深度学习模型，因为我必须训练的大型数据集和神经网络导致我遇到内存问题。我很困惑是将我的数据存储在笔记本实例中还是 S3 中？如果我将它存储在我的 s3 中，我是否能够访问它以在我的笔记本实例上进行训练？我对这些概念感到困惑。谁能解释一下 S3 在 AWS 机器学习中的用途？

是的，您可以使用 S3 作为训练数据集的存储。

请参阅此 link 中描述一切如何协同工作的图表：https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html

您可能还想查看以下详细介绍文件模式和管道模式的博客，这两种传输训练数据的机制：

https://aws.amazon.com/blogs/machine-learning/accelerate-model-training-using-faster-pipe-mode-on-amazon-sagemaker/

In File mode, the training data is downloaded first to an encrypted EBS volume attached to the training instance prior to commencing the training. However, in Pipe mode the input data is streamed directly to the training algorithm while it is running.

https://aws.amazon.com/blogs/machine-learning/using-pipe-input-mode-for-amazon-sagemaker-algorithms/

With Pipe input mode, your data is fed on-the-fly into the algorithm container without involving any disk I/O. This approach shortens the lengthy download process and dramatically reduces startup time. It also offers generally better read throughput than File input mode. This is because your data is fetched from Amazon S3 by a highly optimized multi-threaded background process. It also allows you to train on datasets that are much larger than the 16 TB Amazon Elastic Block Store (EBS) volume size limit.

该博客还包含 python 个使用管道输入模式的代码片段以供参考。

将我的数据集存储在 sagemaker 的笔记本实例中是个好主意吗？

Is it a good idea to store my dataset in my notebook instance in sagemaker?

python

amazon-web-services

tensorflow

amazon-sagemaker