使用 CountVectorizer 在 python 中密集时出现内存错误

Question

这是调用 todense() 时我的代码和内存错误，我正在使用 GBDT 模型，想知道是否有人有好的想法如何解决内存错误？谢谢

  for feature_colunm_name in feature_columns_to_use:
    X_train[feature_colunm_name] = CountVectorizer().fit_transform(X_train[feature_colunm_name]).todense()
    X_test[feature_colunm_name] = CountVectorizer().fit_transform(X_test[feature_colunm_name]).todense()
  y_train = y_train.astype('int')
  grd = GradientBoostingClassifier(n_estimators=n_estimator, max_depth=10)
  grd.fit(X_train.values, y_train.values)

详细的错误信息，

in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
...

此致，林

Answer 1

这里有很多错误：

for feature_colunm_name in feature_columns_to_use:
    X_train[feature_colunm_name] = CountVectorizer().fit_transform(X_train[feature_colunm_name]).todense()
    X_test[feature_colunm_name] = CountVectorizer().fit_transform(X_test[feature_colunm_name]).todense()

1) 您正在尝试将多个列（CountVectorizer 的结果将是一个二维数组，其中列表示特征）分配给 DataFrame 的单个列“feature_colunm_name”。那是行不通的，会产生错误。

2) 你在测试数据上再次拟合 CountVectorizer，这是错误的。您应该在测试数据上使用与在 trainind 数据上使用的相同的 CountVectorizer 对象，并且只调用 transform()，而不是 fit_transform().

类似于：

cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train[feature_colunm_name])
X_test_cv = cv.transform(X_test[feature_colunm_name])

3) GradientBoostingClassifier 适用于稀疏数据。它尚未在文档中提及（似乎是文档中的错误）。

4) 您似乎正在将原始数据的多列转换为词袋形式。为此，您需要使用那么多 CountVectorizer 对象，然后将所有输出数据合并到一个数组中，然后传递给 GradientBoostingClassifier。

更新:

您需要这样设置：

# To merge sparse matrices
from scipy.sparse import hstack

result_matrix_train = None
result_matrix_test = None

for feature_colunm_name in feature_columns_to_use:
    cv = CountVectorizer()
    X_train_cv = cv.fit_transform(X_train[feature_colunm_name])

    # Merge the vector with others
    result_matrix_train = hstack((result_matrix_train, X_train_cv)) 
                          if result_matrix_train is not None else X_train_cv

    # Now transform the test data
    X_test_cv = cv.transform(X_test[feature_colunm_name])
    result_matrix_test = hstack((result_matrix_test, X_test_cv)) 
                         if result_matrix_test is not None else X_test_cv

注意：如果您还有其他列，但您没有通过 Countvectorizer 处理，因为它们已经是数值左右，而您想与 result_matrix_train 合并，您也可以通过以下方式实现：

result_matrix_train = hstack((result_matrix_test, X_train[other_columns].values)) 
result_matrix_test = hstack((result_matrix_test, X_test[other_columns].values))

现在用这些来训练：

...
grd.fit(result_matrix_train, y_train.values)

使用 CountVectorizer 在 python 中密集时出现内存错误

memory error when todense in python using CountVectorizer

python

machine-learning

scikit-learn

xgboost