如何确保位级别的持久 sklearn 模型

How to ensure persistent sklearn models on bit level

我想创建一个持久的 scikit-learn 模型并稍后通过哈希引用它。使用 joblib 进行序列化,如果我的数据没有变化,我希望完整(位级)完整性。但是每次我 运行 代码时,磁盘上的模型文件都有不同的哈希值。为什么会这样,每次 运行 代码不变时,我如何才能进行真正相同的序列化?设置固定种子没有帮助(不确定 sklearn 的算法是否在这个简单的例子中使用了随机数)。

import numpy as np
from sklearn import linear_model
import joblib
import hashlib

# set a fixed seed … 
np.random.seed(1979)

# internal md5sum function
def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

# dummy regression data
X = [[0., 0., 0.,1.], [1.,0.,0.,0.], [2.,2.,0.,1.], [2.,5.,1.,0.]]
Y = [[0.1, -0.2], [0.9, 1.1], [6.2, 5.9], [11.9, 12.3]]

# create model
reg = linear_model.LinearRegression()

# save model to disk to make it persistent
with open("reg.joblib", "w"):
    joblib.dump(reg, "reg.joblib")

# load persistant model from disk
with open("reg.joblib", "r"):
    model = joblib.load("reg.joblib")

# fit & predict
reg.fit(X,Y)
model.fit(X,Y)
myprediction1 = reg.predict([[2., 2., 0.1, 1.1]])
myprediction2 = model.predict([[2., 2., 0.1, 1.1]])

# run several times … why does md5sum change everytime?
print(md5("reg.joblib"))
print(myprediction1, myprediction2)

经过一番研究,我找到了问题的答案。每个 运行 的 joblib 文件的不同哈希值的问题与 scikit-learn 或经过训练的模型无关。事实上,使用 joblib.hash(reg) 可以证明纯模型的 MD5 和是相同的,这意味着训练的回归模型的权重没有变化。这个方便的功能现在也解决了我最初的 "business" 问题。

转储文件的不可重现 MD5 和的根本原因在于joblib.dump 所基于的底层 pickle 序列化模型的实现。来自 How to hash a large object (dataset) in Python? . Somewhere in the depth of the internet this old finding 的决定性提示提供了一些背景信息:

Since the pickle data format is actually a tiny stack-oriented programming language, and some freedom is taken in the encodings of certain objects, it is possible that the two modules produce different data streams for the same input objects. However it is guaranteed that they will always be able to read each other's data streams.