使用 CountVectorizer 在 python 中密集时出现内存错误
memory error when todense in python using CountVectorizer
这是调用 todense()
时我的代码和内存错误,我正在使用 GBDT 模型,想知道是否有人有好的想法如何解决内存错误?谢谢
for feature_colunm_name in feature_columns_to_use:
X_train[feature_colunm_name] = CountVectorizer().fit_transform(X_train[feature_colunm_name]).todense()
X_test[feature_colunm_name] = CountVectorizer().fit_transform(X_test[feature_colunm_name]).todense()
y_train = y_train.astype('int')
grd = GradientBoostingClassifier(n_estimators=n_estimator, max_depth=10)
grd.fit(X_train.values, y_train.values)
详细的错误信息,
in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
...
此致,
林
这里有很多错误:
for feature_colunm_name in feature_columns_to_use:
X_train[feature_colunm_name] = CountVectorizer().fit_transform(X_train[feature_colunm_name]).todense()
X_test[feature_colunm_name] = CountVectorizer().fit_transform(X_test[feature_colunm_name]).todense()
1) 您正在尝试将多个列(CountVectorizer
的结果将是一个二维数组,其中列表示特征)分配给 DataFrame 的单个列“feature_colunm_name
”。那是行不通的,会产生错误。
2) 你在测试数据上再次拟合 CountVectorizer,这是错误的。您应该在测试数据上使用与在 trainind 数据上使用的相同的 CountVectorizer 对象,并且只调用 transform()
,而不是 fit_transform()
.
类似于:
cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train[feature_colunm_name])
X_test_cv = cv.transform(X_test[feature_colunm_name])
3) GradientBoostingClassifier
适用于稀疏数据。它尚未在文档中提及(似乎是文档中的错误)。
4) 您似乎正在将原始数据的多列转换为词袋形式。为此,您需要使用那么多 CountVectorizer 对象,然后将所有输出数据合并到一个数组中,然后传递给 GradientBoostingClassifier。
更新:
您需要这样设置:
# To merge sparse matrices
from scipy.sparse import hstack
result_matrix_train = None
result_matrix_test = None
for feature_colunm_name in feature_columns_to_use:
cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train[feature_colunm_name])
# Merge the vector with others
result_matrix_train = hstack((result_matrix_train, X_train_cv))
if result_matrix_train is not None else X_train_cv
# Now transform the test data
X_test_cv = cv.transform(X_test[feature_colunm_name])
result_matrix_test = hstack((result_matrix_test, X_test_cv))
if result_matrix_test is not None else X_test_cv
注意:如果您还有其他列,但您没有通过 Countvectorizer 处理,因为它们已经是数值左右,而您想与 result_matrix_train
合并,您也可以通过以下方式实现:
result_matrix_train = hstack((result_matrix_test, X_train[other_columns].values))
result_matrix_test = hstack((result_matrix_test, X_test[other_columns].values))
现在用这些来训练:
...
grd.fit(result_matrix_train, y_train.values)
这是调用 todense()
时我的代码和内存错误,我正在使用 GBDT 模型,想知道是否有人有好的想法如何解决内存错误?谢谢
for feature_colunm_name in feature_columns_to_use:
X_train[feature_colunm_name] = CountVectorizer().fit_transform(X_train[feature_colunm_name]).todense()
X_test[feature_colunm_name] = CountVectorizer().fit_transform(X_test[feature_colunm_name]).todense()
y_train = y_train.astype('int')
grd = GradientBoostingClassifier(n_estimators=n_estimator, max_depth=10)
grd.fit(X_train.values, y_train.values)
详细的错误信息,
in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
...
此致, 林
这里有很多错误:
for feature_colunm_name in feature_columns_to_use:
X_train[feature_colunm_name] = CountVectorizer().fit_transform(X_train[feature_colunm_name]).todense()
X_test[feature_colunm_name] = CountVectorizer().fit_transform(X_test[feature_colunm_name]).todense()
1) 您正在尝试将多个列(CountVectorizer
的结果将是一个二维数组,其中列表示特征)分配给 DataFrame 的单个列“feature_colunm_name
”。那是行不通的,会产生错误。
2) 你在测试数据上再次拟合 CountVectorizer,这是错误的。您应该在测试数据上使用与在 trainind 数据上使用的相同的 CountVectorizer 对象,并且只调用 transform()
,而不是 fit_transform()
.
类似于:
cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train[feature_colunm_name])
X_test_cv = cv.transform(X_test[feature_colunm_name])
3) GradientBoostingClassifier
适用于稀疏数据。它尚未在文档中提及(似乎是文档中的错误)。
4) 您似乎正在将原始数据的多列转换为词袋形式。为此,您需要使用那么多 CountVectorizer 对象,然后将所有输出数据合并到一个数组中,然后传递给 GradientBoostingClassifier。
更新:
您需要这样设置:
# To merge sparse matrices
from scipy.sparse import hstack
result_matrix_train = None
result_matrix_test = None
for feature_colunm_name in feature_columns_to_use:
cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train[feature_colunm_name])
# Merge the vector with others
result_matrix_train = hstack((result_matrix_train, X_train_cv))
if result_matrix_train is not None else X_train_cv
# Now transform the test data
X_test_cv = cv.transform(X_test[feature_colunm_name])
result_matrix_test = hstack((result_matrix_test, X_test_cv))
if result_matrix_test is not None else X_test_cv
注意:如果您还有其他列,但您没有通过 Countvectorizer 处理,因为它们已经是数值左右,而您想与 result_matrix_train
合并,您也可以通过以下方式实现:
result_matrix_train = hstack((result_matrix_test, X_train[other_columns].values))
result_matrix_test = hstack((result_matrix_test, X_test[other_columns].values))
现在用这些来训练:
...
grd.fit(result_matrix_train, y_train.values)