sklearn 中的流水线问题
Issues with Pipelining in sklearn
我是 sklearn 的新手。我正在使用 Pipeline 在文本挖掘问题中一起使用 Vectorizer 和 Classifier。这是我的代码:
def create_ngram_model():
tfidf_ngrams = TfidfVectorizer(ngram_range=(1, 3),
analyzer="word", binary=False)
clf = GaussianNB()
pipeline = Pipeline([('vect', tfidf_ngrams), ('clf', clf)])
return pipeline
def get_trains():
data=open('../cleaning data/cleaning the sentences/cleaned_comments.csv','r').readlines()[1:]
lines=len(data)
features_train=[]
labels_train=[]
for i in range(lines):
l=data[i].split(',')
labels_train+=[int(l[0])]
a=l[2]
features_train+=[a]
return features_train,labels_train
def train_model(clf_factory,features_train,labels_train):
features_train,labels_train=get_trains()
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features_train, labels_train, test_size=0.1, random_state=42)
clf=clf_factory()
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)
accuracy = accuracy_score(pred,labels_test)
return accuracy
X,Y=get_trains()
print train_model(create_ngram_model,X,Y)
从 get_trains() 返回的特征是字符串。
我收到此错误。
clf.fit(features_train,labels_train)
File "C:\Python27\lib\site-packages\sklearn\pipeline.py", line 130, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "C:\Python27\lib\site-packages\sklearn\naive_bayes.py", line 149, in fit
X, y = check_arrays(X, y, sparse_format='dense')
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 263, in check_arrays
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
这个错误我遇到过很多次了。然后,我只是将功能更改为 features_transformed.toarray() 但是因为在这里我使用的是管道,所以我无法这样做,因为转换后的功能会自动返回。我还尝试制作一个新的 class 其中 returns 和 features_transformed.toarray() 但这也引发了同样的错误。
我已经搜索了很多但没有得到它。请帮忙!!
有2个选项:
使用稀疏数据兼容分类器。例如,文档说 Bernoulli Naive Bayes and Multinomial Naive Bayes 支持 fit
.
的稀疏输入
向管道添加一个 "densifier"。显然,你弄错了,这个对我有用(当我需要一路加密我的稀疏数据时):
class Densifier(object):
def fit(self, X, y=None):
pass
def fit_transform(self, X, y=None):
return self.transform(X)
def transform(self, X, y=None):
return X.toarray()
确保在分类器之前将其放入管道。
我是 sklearn 的新手。我正在使用 Pipeline 在文本挖掘问题中一起使用 Vectorizer 和 Classifier。这是我的代码:
def create_ngram_model():
tfidf_ngrams = TfidfVectorizer(ngram_range=(1, 3),
analyzer="word", binary=False)
clf = GaussianNB()
pipeline = Pipeline([('vect', tfidf_ngrams), ('clf', clf)])
return pipeline
def get_trains():
data=open('../cleaning data/cleaning the sentences/cleaned_comments.csv','r').readlines()[1:]
lines=len(data)
features_train=[]
labels_train=[]
for i in range(lines):
l=data[i].split(',')
labels_train+=[int(l[0])]
a=l[2]
features_train+=[a]
return features_train,labels_train
def train_model(clf_factory,features_train,labels_train):
features_train,labels_train=get_trains()
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features_train, labels_train, test_size=0.1, random_state=42)
clf=clf_factory()
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)
accuracy = accuracy_score(pred,labels_test)
return accuracy
X,Y=get_trains()
print train_model(create_ngram_model,X,Y)
从 get_trains() 返回的特征是字符串。 我收到此错误。
clf.fit(features_train,labels_train)
File "C:\Python27\lib\site-packages\sklearn\pipeline.py", line 130, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "C:\Python27\lib\site-packages\sklearn\naive_bayes.py", line 149, in fit
X, y = check_arrays(X, y, sparse_format='dense')
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 263, in check_arrays
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
这个错误我遇到过很多次了。然后,我只是将功能更改为 features_transformed.toarray() 但是因为在这里我使用的是管道,所以我无法这样做,因为转换后的功能会自动返回。我还尝试制作一个新的 class 其中 returns 和 features_transformed.toarray() 但这也引发了同样的错误。 我已经搜索了很多但没有得到它。请帮忙!!
有2个选项:
使用稀疏数据兼容分类器。例如,文档说 Bernoulli Naive Bayes and Multinomial Naive Bayes 支持
fit
. 的稀疏输入
向管道添加一个 "densifier"。显然,你弄错了,这个对我有用(当我需要一路加密我的稀疏数据时):
class Densifier(object): def fit(self, X, y=None): pass def fit_transform(self, X, y=None): return self.transform(X) def transform(self, X, y=None): return X.toarray()
确保在分类器之前将其放入管道。