如何通过将文件分成更小的片段来腌制 > 2 GiB 的文件
How to pickle files > 2 GiB by splitting them into smaller fragments
我有一个大于 2 GiB 的分类器对象,我想对其进行 pickle,但我得到了这个:
cPickle.dump(clf, fo, protocol = cPickle.HIGHEST_PROTOCOL)
OverflowError: cannot serialize a string larger than 2 GiB
我发现 this 个问题有同样的问题,有人建议
- 使用 Python 3 协议 4 - 不可接受,因为我需要使用 Python 2
- 使用
from pyocser import ocdumps, ocloads
- 不可接受,因为我不能使用其他(重要的)模块
- 将对象分解成字节并 pickle 每个片段
有没有办法用我的分类器做到这一点?即将其转换为字节、拆分、pickle、unpickle、连接字节并使用分类器?
我的代码:
from sklearn.svm import SVC
import cPickle
def train_clf(X,y,clf_name):
start_time = time.time()
# after many tests, this was found to be best classifier
clf = SVC(C = 0.01, kernel='poly')
clf.fit(X,y)
print 'fit done... {} seconds'.format(time.time() - start_time)
with open(clf_name, "wb") as fo:
cPickle.dump(clf, fo, protocol = cPickle.HIGHEST_PROTOCOL)
# cPickle.HIGHEST_PROTOCOL == 2
# the error occurs inside the dump method
return time.time() - start_time
在此之后,我想解开并使用:
with open(clf_name, 'rb') as fo:
clf, load_time = cPickle.load(fo), time.time()
您可以使用 sklearn.external.joblib 如果模型尺寸较大,它会自动将模型文件拆分为腌制的 numpy 数组文件
from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')
更新: sklearn 将显示
DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
所以改用这个。
import joblib
joblib.dump(clf, 'filename.pkl')
稍后可以使用以下方法取消腌制:
clf = joblib.load('filename.pkl')
我有一个大于 2 GiB 的分类器对象,我想对其进行 pickle,但我得到了这个:
cPickle.dump(clf, fo, protocol = cPickle.HIGHEST_PROTOCOL)
OverflowError: cannot serialize a string larger than 2 GiB
我发现 this 个问题有同样的问题,有人建议
- 使用 Python 3 协议 4 - 不可接受,因为我需要使用 Python 2
- 使用
from pyocser import ocdumps, ocloads
- 不可接受,因为我不能使用其他(重要的)模块 - 将对象分解成字节并 pickle 每个片段
有没有办法用我的分类器做到这一点?即将其转换为字节、拆分、pickle、unpickle、连接字节并使用分类器?
我的代码:
from sklearn.svm import SVC
import cPickle
def train_clf(X,y,clf_name):
start_time = time.time()
# after many tests, this was found to be best classifier
clf = SVC(C = 0.01, kernel='poly')
clf.fit(X,y)
print 'fit done... {} seconds'.format(time.time() - start_time)
with open(clf_name, "wb") as fo:
cPickle.dump(clf, fo, protocol = cPickle.HIGHEST_PROTOCOL)
# cPickle.HIGHEST_PROTOCOL == 2
# the error occurs inside the dump method
return time.time() - start_time
在此之后,我想解开并使用:
with open(clf_name, 'rb') as fo:
clf, load_time = cPickle.load(fo), time.time()
您可以使用 sklearn.external.joblib 如果模型尺寸较大,它会自动将模型文件拆分为腌制的 numpy 数组文件
from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')
更新: sklearn 将显示
DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
所以改用这个。
import joblib
joblib.dump(clf, 'filename.pkl')
稍后可以使用以下方法取消腌制:
clf = joblib.load('filename.pkl')