如何 "save" Python 中的 IsolationForest 模型？

Question

嘿，我正在使用 sklearn.ensemble.IsolationForest 来预测我的数据的异常值。

是否可以针对我的干净数据训练（拟合）一次模型，然后保存以备后用？例如保存模型的一些属性，这样下次就不需要再次调用 fit 函数来训练我的模型。

例如，对于 GMM 我会保存每个组件的 weights_、means_ 和 covs_，这样以后我就不需要训练再次建模。

为了说明这一点，我将其用于在线欺诈检测，其中 python 脚本会针对相同的 "category" 数据调用多次，我不想每次我需要执行预测或测试操作时训练模型。

提前致谢。

Answer 1

sklearn 估计器实现的方法使您可以轻松保存估计器的相关训练属性。一些估算器自己实现 __getstate__ 方法，但其他估算器，如 GMM 只使用 base implementation ，它简单地保存对象内部字典：

def __getstate__(self):
    try:
        state = super(BaseEstimator, self).__getstate__()
    except AttributeError:
        state = self.__dict__.copy()

    if type(self).__module__.startswith('sklearn.'):
        return dict(state.items(), _sklearn_version=__version__)
    else:
        return state

将模型保存到光盘的推荐方法是使用 pickle 模块：

from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
X = iris.data[:100, :2]
y = iris.target[:100]
model = SVC()
model.fit(X,y)
import pickle
with open('mymodel','wb') as f:
    pickle.dump(model,f)

但是，您应该保存额外的数据，以便将来重新训练您的模型，否则会遭受可怕的后果（例如被锁定在旧版本的 sklearn 中）。

来自documentation:

In order to rebuild a similar model with future versions of scikit-learn, additional metadata should be saved along the pickled model:

The training data, e.g. a reference to a immutable snapshot

The python source code used to generate the model

The versions of scikit-learn and its dependencies

The cross validation score obtained on the training data

对于依赖于用 Cython 编写的 tree.pyx 模块（例如 IsolationForest）的 Ensemble estimators 尤其如此，因为它创建了一个耦合到实现，不能保证 sklearn 版本之间的稳定性。它在过去看到了向后不兼容的变化。

如果您的模型变得非常大并且加载变得很麻烦，您也可以使用更高效的 joblib。来自文档：

In the specific case of the scikit, it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string:

Answer 2

https://docs.python.org/2/library/pickle.html

使用 Pickle 库。

适合您的模型。

用 pickle.dump(obj, file[, protocol])

保存

用pickle.load(file)

加载它

对异常值进行分类

如何 "save" Python 中的 IsolationForest 模型？

How to "save" an IsolationForest Model in Python?

python

machine-learning

outliers

scikit-learn