尝试序列化成功生成的 SageMaker 模型时导致 unpickling 堆栈下溢的原因

Question

我目前正在 Amazon Sagemaker 中设置管道。为此，我设置了一个 xgboost-estimator 并在我的数据集上对其进行了训练。训练作业按预期运行，新训练的模型保存到指定的输出桶。稍后我想重新导入模型，这是通过从输出桶中获取 mode.tar.gz、提取模型并通过 pickle 序列化二进制文件来完成的。

# download the model artifact from AWS S3
!aws s3 cp s3://my-bucket/output/sagemaker-xgboost-2021-09-06-12-19-41-306/output/model.tar.gz .

# opens the downloaded model artifcat and loads it as 'model' variable
model_path = "model.tar.gz"
with tarfile.open(model_path) as tar:
    tar.extractall(path=".")

model = pkl.load(open("xgboost-model", "rb"))

每当我尝试调整它时，我都会收到一个 unpickling 堆栈下溢：

---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
<ipython-input-9-b88a7424f790> in <module>
     10     tar.extractall(path=".")
     11 
---> 12 model = pkl.load(open("xgboost-model", "rb"))
     13 

UnpicklingError: unpickling stack underflow

到目前为止，我重新训练了模型以查看是否错误发生在不同的模型文件中并且确实如此。我还下载了 model.tar.gz 并通过 gunzip 对其进行了验证。正确提取二进制文件 xgboost-model 时，我无法腌制它。我在 Whosebug 上发现的每一次错误都指向损坏的文件，但这个错误是由 SageMaker 直接生成的，我确实注意到对其执行任何转换，但从 model.tar.gz 中提取它。像这样重新加载模型似乎是一个很常见的用例，请参考文档和不同的教程。在本地，我收到与下载文件相同的错误。我试图直接进入 pickle 进行调试，但无法理解它。完整的错误堆栈如下所示：

Exception has occurred: UnpicklingError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
unpickling stack underflow
  File "/sagemaker_model.py", line 10, in <module>
    model = pkl.load(open('xgboost-model', 'rb'))
  File "/usr/local/Cellar/python@3.9/3.9.1_5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/Cellar/python@3.9/3.9.1_5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/local/Cellar/python@3.9/3.9.1_5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 268, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/usr/local/Cellar/python@3.9/3.9.1_5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/Cellar/python@3.9/3.9.1_5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,

什么可能导致此问题以及我可以在过程中的哪个步骤应用更改来修复或解决问题。

Answer 1

问题的根源在于用于 xgboost 框架的模型版本。从 1.3.0 开始，默认输出从 pickle 更改为 json 并且 sagemaker 文档似乎没有相应更新。所以如果你想通过

阅读模型

    tar.extractall(path=".")

model = pkl.load(open("xgboost-model", "rb"))

如 sagemaker 文档中所述，您需要使用以前的版本导入 XGBOOST 框架，例如1.2.1.

Answer 2

看了@Imoe41的回答，我也想贡献一下这个问题。问题是，如果您单击 link，您会看到错误是 (https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html)，从 xgboost 的 1.0 版开始，模型被保存在 json 和之前的版本中1.0，模型被保存在 pickle 中。我在2020年用sagemaker训练了xgboost模型，使用的是0.90的xgboost版本。但是，在我的笔记本中，xgboost 包版本是 1.5.1。

解决方案：

检查安装的 xgboost 版本

import xgboost as xgb print(xgb.version)

如果版本高于1.0，则需要降级。为了降级xgboost，你还需要降级其他包。

pip install scipy==1.4.1
pip install shap==0.37.0
pip install xgboost==0.90.0

将模型加载为 pickle

import pickle as pkl
import tarfile
t = tarfile.open('model.tar.gz', 'r:gz')
t.extractall()
model = pkl.load(open("xgboost-model", 'rb'))

Answer 3

最新的 XGBoost 版本似乎改变了这个过程。这适用于 1.3.x:

import xgboost as xgb

model = xgb.Booster()
model.load_model('xgboost-model')

尝试序列化成功生成的 SageMaker 模型时导致 unpickling 堆栈下溢的原因

what causes an unpickling stack underflow when trying to serialize a succesfully generated SageMaker model

pickle

python-3.x

amazon-sagemaker