MemoryError 使用 MultinomilNB
MemoryError using MultinomilNB
我在对大数据使用 sklearn.naive_bayes.MultinomialNB
进行命名实体识别时出现 MemoryError,其中 train.shape = (416330, 97896)
。
data_train = pd.read_csv(path[0] + "train_SENTENCED.tsv", encoding="utf-8", sep='\t', quoting=csv.QUOTE_NONE)
data_test = pd.read_csv(path[0] + "test_SENTENCED.tsv", encoding="utf-8", sep='\t', quoting=csv.QUOTE_NONE)
print('TRAIN_DATA:\n', data_train.tail(5))
# FIT TRANSFORM
X_TRAIN = data_train.drop('Tag', axis=1)
X_TEST = data_test.drop('Tag', axis=1)
v = DictVectorizer(sparse=False)
X_train = v.fit_transform(X_TRAIN.to_dict('records'))
X_test = v.transform(X_TEST.to_dict('records'))
y_train = data_train.Tag.values
y_test = data_test.Tag.values
classes = np.unique(y_train)
classes = classes.tolist()
nb = MultinomialNB(alpha=0.01)
nb.partial_fit(X_train, y_train, classes)
new_classes = classes.copy()
new_classes.pop()
predictions = nb.predict(X_test)
错误输出如下:
Traceback (most recent call last):
File "naive_bayes_classifier/main.py", line 107, in <module>
X_train = v.fit_transform(X_TRAIN.to_dict('records'))
File "lib/python3.8/site-packages/sklearn/feature_extraction/_dict_vectorizer.py", line 313, in fit_transform
return self._transform(X, fitting=True)
File "lib/python3.8/site-packages/sklearn/feature_extraction/_dict_vectorizer.py", line 282, in _transform
result_matrix = result_matrix.toarray()
File "lib/python3.8/site-packages/scipy/sparse/compressed.py", line 1031, in toarray
out = self._process_toarray_args(order, out)
File "lib/python3.8/site-packages/scipy/sparse/base.py", line 1202, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
numpy.core._exceptions.MemoryError: Unable to allocate 304. GiB for an array with shape (416330, 97896) and data type float64
虽然,我为 DictVectorizer(sparse=False, dtype=np.short)
添加了这样的参数,但代码 returns 错误 nb.partial(X_train, y_train, classes)
行。
如何防止此内存错误?有没有合适的方法解决?我考虑过拆分训练集,但由于向量适合相应的数据集,这是正确的解决方案吗?
可悲的是,问题是由于 sklearn 的复杂性,解决这个问题确实需要大量内存。
.partial_fit()
是一个很好的方法,但我建议将您的数据分块成更小的部分,然后在它们上部分拟合分类器。尝试将您的数据集分成小块,看看是否可行。如果您仍然遇到相同的错误,也许可以尝试更小的位
我在对大数据使用 sklearn.naive_bayes.MultinomialNB
进行命名实体识别时出现 MemoryError,其中 train.shape = (416330, 97896)
。
data_train = pd.read_csv(path[0] + "train_SENTENCED.tsv", encoding="utf-8", sep='\t', quoting=csv.QUOTE_NONE)
data_test = pd.read_csv(path[0] + "test_SENTENCED.tsv", encoding="utf-8", sep='\t', quoting=csv.QUOTE_NONE)
print('TRAIN_DATA:\n', data_train.tail(5))
# FIT TRANSFORM
X_TRAIN = data_train.drop('Tag', axis=1)
X_TEST = data_test.drop('Tag', axis=1)
v = DictVectorizer(sparse=False)
X_train = v.fit_transform(X_TRAIN.to_dict('records'))
X_test = v.transform(X_TEST.to_dict('records'))
y_train = data_train.Tag.values
y_test = data_test.Tag.values
classes = np.unique(y_train)
classes = classes.tolist()
nb = MultinomialNB(alpha=0.01)
nb.partial_fit(X_train, y_train, classes)
new_classes = classes.copy()
new_classes.pop()
predictions = nb.predict(X_test)
错误输出如下:
Traceback (most recent call last):
File "naive_bayes_classifier/main.py", line 107, in <module>
X_train = v.fit_transform(X_TRAIN.to_dict('records'))
File "lib/python3.8/site-packages/sklearn/feature_extraction/_dict_vectorizer.py", line 313, in fit_transform
return self._transform(X, fitting=True)
File "lib/python3.8/site-packages/sklearn/feature_extraction/_dict_vectorizer.py", line 282, in _transform
result_matrix = result_matrix.toarray()
File "lib/python3.8/site-packages/scipy/sparse/compressed.py", line 1031, in toarray
out = self._process_toarray_args(order, out)
File "lib/python3.8/site-packages/scipy/sparse/base.py", line 1202, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
numpy.core._exceptions.MemoryError: Unable to allocate 304. GiB for an array with shape (416330, 97896) and data type float64
虽然,我为 DictVectorizer(sparse=False, dtype=np.short)
添加了这样的参数,但代码 returns 错误 nb.partial(X_train, y_train, classes)
行。
如何防止此内存错误?有没有合适的方法解决?我考虑过拆分训练集,但由于向量适合相应的数据集,这是正确的解决方案吗?
可悲的是,问题是由于 sklearn 的复杂性,解决这个问题确实需要大量内存。
.partial_fit()
是一个很好的方法,但我建议将您的数据分块成更小的部分,然后在它们上部分拟合分类器。尝试将您的数据集分成小块,看看是否可行。如果您仍然遇到相同的错误,也许可以尝试更小的位