如何通过将文件分成更小的片段来腌制 > 2 GiB 的文件

How to pickle files > 2 GiB by splitting them into smaller fragments

我有一个大于 2 GiB 的分类器对象,我想对其进行 pickle,但我得到了这个:

cPickle.dump(clf, fo, protocol = cPickle.HIGHEST_PROTOCOL)

OverflowError: cannot serialize a string larger than 2 GiB

我发现 this 个问题有同样的问题,有人建议

  1. 使用 Python 3 协议 4 - 不可接受,因为我需要使用 Python 2
  2. 使用 from pyocser import ocdumps, ocloads - 不可接受,因为我不能使用其他(重要的)模块
  3. 将对象分解成字节并 pickle 每个片段

有没有办法用我的分类器做到这一点?即将其转换为字节、拆分、pickle、unpickle、连接字节并使用分类器?


我的代码:

from sklearn.svm import SVC 
import cPickle

def train_clf(X,y,clf_name):
    start_time = time.time()
    # after many tests, this was found to be best classifier
    clf = SVC(C = 0.01, kernel='poly')
    clf.fit(X,y)
    print 'fit done... {} seconds'.format(time.time() - start_time)
    with open(clf_name, "wb") as fo:
        cPickle.dump(clf, fo,  protocol = cPickle.HIGHEST_PROTOCOL) 
        # cPickle.HIGHEST_PROTOCOL == 2 
        # the error occurs inside the dump method
    return time.time() - start_time

在此之后,我想解开并使用:

with open(clf_name, 'rb') as fo:
     clf, load_time = cPickle.load(fo), time.time()

您可以使用 sklearn.external.joblib 如果模型尺寸较大,它会自动将模型文件拆分为腌制的 numpy 数组文件

from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl') 

更新: sklearn 将显示

DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.

所以改用这个。

import joblib
joblib.dump(clf, 'filename.pkl') 

稍后可以使用以下方法取消腌制:

clf = joblib.load('filename.pkl')