使用 pandas 和 scikit 的梯度提升分类器稀疏矩阵问题
Gradient Boosting Classifier sparse matrix issue using pandas and scikit
我一直在使用以下代码进行多类分类,它使用 scikit-learn 中的 GradientBoostingClassifier。我正面临稀疏矩阵转换为密集矩阵的已知问题。
我已经应用了以下解决方案 Whosebug 但它对我的情况不起作用。虽然我使用的解决方案适用于 RandomForestClassifier,但据我所知,它应该适用于 GradientBoostingClassifier!
如果我将 GradientBoostingClassifier 替换为 RandomForestClassifier,添加此代码也能完美运行。
本例中的数据是具有 8 个目标的 93 个数字特征 类。可以从 Kaggle
中获取数据
# load data
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
sample = pd.read_csv('submissions/sampleSubmission.csv')
labels = train.target.values
ids = train.id.values
train = train.drop('id', axis=1)
train = train.drop('target', axis=1)
train_orig = train
test = test.drop('id', axis=1)
# transform counts to TFIDF features
tfidf = feature_extraction.text.TfidfTransformer()
train = tfidf.fit_transform(train)
test = tfidf.transform(test).toarray() # Update line
# encode labels
lbl_enc = preprocessing.LabelEncoder()
labels = lbl_enc.fit_transform(labels)
# train a random forest classifier
print('starting training ... ')
clf = ensemble.GradientBoostingClassifier( n_estimators=config.estimators)
clf.fit(train, labels)
# predict on test set
print('starting prediction ... ')
preds = clf.predict_proba(test) # Error on this line even when test is dense
train_pred = clf.predict(tfidf.transform(train_orig))
回溯:
python boosted_trees.py
starting training ...
Traceback (most recent call last):
File "boosted_trees.py", line 57, in <module>
clf.fit(train, labels)
File "/usr/local/lib/python2.7/site- packages/sklearn/ensemble/gradient_boosting.py", line 941, in fit
X, y = check_X_y(X, y, dtype=DTYPE)
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 439, in check_X_y
ensure_min_features)
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 331, in check_array
copy, force_all_finite)
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.ere
感谢@imaluengo。
以备不时之需。问题出在这些行中。
train = tfidf.fit_transform(train)
test = tfidf.transform(test).toarray() # Update line
这两行都应该有一个 toarray() 来解决这个问题。
train = tfidf.fit_transform(train).toarray()
test = tfidf.transform(test).toarray() # Update line
我一直在使用以下代码进行多类分类,它使用 scikit-learn 中的 GradientBoostingClassifier。我正面临稀疏矩阵转换为密集矩阵的已知问题。
我已经应用了以下解决方案 Whosebug 但它对我的情况不起作用。虽然我使用的解决方案适用于 RandomForestClassifier,但据我所知,它应该适用于 GradientBoostingClassifier!
如果我将 GradientBoostingClassifier 替换为 RandomForestClassifier,添加此代码也能完美运行。
本例中的数据是具有 8 个目标的 93 个数字特征 类。可以从 Kaggle
中获取数据# load data
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
sample = pd.read_csv('submissions/sampleSubmission.csv')
labels = train.target.values
ids = train.id.values
train = train.drop('id', axis=1)
train = train.drop('target', axis=1)
train_orig = train
test = test.drop('id', axis=1)
# transform counts to TFIDF features
tfidf = feature_extraction.text.TfidfTransformer()
train = tfidf.fit_transform(train)
test = tfidf.transform(test).toarray() # Update line
# encode labels
lbl_enc = preprocessing.LabelEncoder()
labels = lbl_enc.fit_transform(labels)
# train a random forest classifier
print('starting training ... ')
clf = ensemble.GradientBoostingClassifier( n_estimators=config.estimators)
clf.fit(train, labels)
# predict on test set
print('starting prediction ... ')
preds = clf.predict_proba(test) # Error on this line even when test is dense
train_pred = clf.predict(tfidf.transform(train_orig))
回溯:
python boosted_trees.py
starting training ...
Traceback (most recent call last):
File "boosted_trees.py", line 57, in <module>
clf.fit(train, labels)
File "/usr/local/lib/python2.7/site- packages/sklearn/ensemble/gradient_boosting.py", line 941, in fit
X, y = check_X_y(X, y, dtype=DTYPE)
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 439, in check_X_y
ensure_min_features)
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 331, in check_array
copy, force_all_finite)
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.ere
感谢@imaluengo。
以备不时之需。问题出在这些行中。
train = tfidf.fit_transform(train)
test = tfidf.transform(test).toarray() # Update line
这两行都应该有一个 toarray() 来解决这个问题。
train = tfidf.fit_transform(train).toarray()
test = tfidf.transform(test).toarray() # Update line